We propose an unsupervised method for detecting spam documents from Web page data, based on equivalence relations on strings. We propose 3 measures for quantifying the alienness (...
This paper presents the advantages of combining multiple document representation schemes for query processing of XML queries on content and structure. We show how extending the Te...
Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a no...
XML has been known as a document standard in representation and exchange of data on the Internet, and is also used as a standard language for the search and reuse of scattered doc...
Eun-Young Kim, Jin-Ho Choi, Jhung-Soo Hong, Tae-Hu...
Government regulations are semi-structured text documents that are often voluminous, heavily cross-referenced between provisions and even ambiguous. Multiple sources of regulation...