Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (...
In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belo...
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an...
This paper describes a language-independent, scalable system for both challenges of crossdocument co-reference: name variation and entity disambiguation. We provide system results...
Database systems often use XML schema to describe the format of valid XML documents. Usually, this format is determined when the system is designed. Sometimes, in an already funct...
Jarek Gryz, Marcin Kwietniewski, Stephanie Hazlewo...