We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and mul...
Gambal is an information retrieval system for indexing and accessing web pages that includes graphical interfaces to ease web page search and accessing. In particular, the interfa...
Summarization of web pages is a very interesting topic from both academic and commercial point of view. Academically, it is challenging to create a summary of a document (e.g. a w...
Hassan Alam, Rachmat Hartono, Aman Kumar, Ahmad Fu...
We discuss the design of a class of agents that we call adaptive web site agents. The goal of such an agent is to help a user find information at a particular web site, adapting i...