A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semistructured” web sources, wra...
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected ba...
The World-Wide-Web is less agent-friendly than we might hope. Most information on the Web is presented in loosely structured natural language text with no agent-readable semantics...
This paper considers the problem of identifying on the Web compound documents (cDocs) ? groups of web pages that in aggregate constitute semantically coherent information entities...
: Combining multiple data sources, each with its own features, to achieve optimal inference has received a lot of attention in recent years. In inference from multiple data sources...
Shankara B. Subramanya, Zheshen Wang, Baoxin Li, H...