This paper presents a transaction-time HTTP server, called ? Apache that supports document versioning. A document often consists of a main file formatted in HTML or XML and severa...
Abstract. Current text classification systems typically use term stems for representing document content. Semantic Web technologies allow the usage of features on a higher semantic...
Named entities (e.g., "Kofi Annan", "Coca-Cola", "Second World War") are ubiquitous in web pages and other types of document and often provide a simpl...
Felix Weigel, Klaus U. Schulz, Levin Brunner, Edua...
Using Dirac Notation as a powerful tool, we investigate the three classical Information Retrieval (IR) models and some their extensions. We show that almost all such models can be...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...