As opposed to representing a document as a "bag of words" in most information retrieval applications, we propose a model of representing a web page as sets of named enti...
Nan Di, Conglei Yao, Mengcheng Duan, Jonathan J. H...
Web pages are more than text and they contain much contextual and structural information, e.g., the title, the meta data, the anchor text, etc., each of which can be seen as a dat...
The field of automatic genre classification has primarily focused on extracting textual features from documents. The goal of this research is to investigate whether visual feature...
We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns...
Many anti-phishing schemes have recently been proposed in literature. Despite all those efforts, the threat of phishing attacks is not mitigated. One of the main reasons is that p...