In this paper, we present an overview of extensible Retrieval, Annotation and Caching Engine (eRACE), a modular and distributed intermediary infrastructure that collects information from heterogeneous Internet sources according to registered profiles or end-user requests. Collected information is stored for filtering, transformation, aggregation, and subsequent personalized or wide-area dissemination on the wireline or wireless-Internet. We study the architecture and implementation of the main module of eRACE, an HTTP proxy named WebRACE. WebRACE consists of a high-performance, distributed and multithreaded Web crawler, a multithreaded filtering processor and an Object Cache. We discuss the implementation of WebRACE in Java, describe a number of performance optimizations, and present its performance assessment. q 2004 Elsevier B.V. All rights reserved.
Marios D. Dikaiakos, Demetrios Zeinalipour-Yazti