Cleansing Databases of Misspelled Proper Nouns

16 years 20 days ago

Download pike.psu.edu

The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identiﬁes a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efﬁcient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efﬁciently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results.

Arturas Mazeika, Michael H. Böhlen

Real-time Traffic

CLEANDB 2006 | Database | Nearby Misspelled Strings | Proper Nouns | String |

claim paper

Post Info
More Details (n/a)

Added	13 Jun 2010
Updated	13 Jun 2010
Type	Conference
Year	2006
Where	CLEANDB
Authors	Arturas Mazeika, Michael H. Böhlen

Comments (0)

Sciweavers

Cleansing Databases of Misspelled Proper Nouns

CLEANDB 2006 | Database | Nearby Misspelled Strings | Proper Nouns | String |

Explore & Download

Productivity Tools

Sciweavers