We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and c...
Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multipl...
The Prediction by Partial Matching (PPM) algorithm uses a cumulative frequency count of input symbols in different contexts to estimate their probability distribution. Excellent c...
Abstract-Unstructured text represents a large fraction of the world's data. It often contain snippets of structured information within them (e.g., people's names and zip ...
Daisy Zhe Wang, Eirinaios Michelakis, Joseph M. He...
Using visualization techniques to assist conventional data mining tasks has attracted considerable interest in recent years. This paper addresses a challenging issue in the use of...