Organizing structured web sources by query schemas: a clustering approach

16 years 1 months ago

Download eagle.cs.uiuc.edu

In the recent years, the Web has been rapidly “deepened” with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas (i.e., attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of cal data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To ﬁnd clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to ma...

Bin He, Tao Tao, Kevin Chen-Chuan Chang

Real-time Traffic