Text Joins for Data Cleansing and Integration in an RDBMS

16 years 8 months ago

Download www.research.att.com

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. In this paper, we propose an approximate, samplingbased text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas

Real-time Traffic

Database | ICDE 2003 | Text Join Execution | Textual Data | Unmodified Rdbms |

claim paper

» Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

» PIVOT and UNPIVOT Optimization and Execution Strategies in an RDBMS

» Efficient Compression of Text Attributes of Data Warehouse Dimensions

» Join Queries with External Text Sources Execution and Optimization Techniques

» The TEXTURE Benchmark Measuring Performance of Text Queries on a Relational DBMS

» Scalable Keyword Search on Large Data Streams

» Querying Structured Text in an XML Database

» Oracle database filesystem

Post Info
More Details (n/a)

Added	01 Nov 2009
Updated	01 Nov 2009
Type	Conference
Year	2003
Where	ICDE
Authors	Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava

Comments (0)

Sciweavers

Text Joins for Data Cleansing and Integration in an RDBMS

Database | ICDE 2003 | Text Join Execution | Textual Data | Unmodified Rdbms |

Explore & Download

Productivity Tools

Sciweavers