Information extraction from HTML pages has been conventionally treated as plain text documents extended with HTML tags. However, the growing maturity and correct usage of HTML/XHT...
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs...
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon...
Learning classifiers has been studied extensively the last two decades. Recently, various approaches based on patterns (e.g., association rules) that hold within labeled data hav...
Tree-structured probabilistic models admit simple, fast inference. However, they are not well suited to phenomena such as occlusion, where multiple components of an object may dis...
Abstract. Multiple-instance learning (MIL) allows for training classifiers from ambiguously labeled data. In computer vision, this learning paradigm has been recently used in many ...