Automatic Segmentation of Text into Structured Records

15 years 1 months ago

Download www.it.iitb.ac.in

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool datamold that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life dataset...

Vinayak R. Borkar, Kaustubh Deshmukh, Sunita Saraw

Real-time Traffic

Database | Datasets Yielded Accuracy | Several Useful Data | SIGMOD 2001 | Unformatted Text Records |

claim paper

Post Info
More Details (n/a)

Added	08 Dec 2009
Updated	08 Dec 2009
Type	Conference
Year	2001
Where	SIGMOD
Authors	Vinayak R. Borkar, Kaustubh Deshmukh, Sunita Sarawagi

Comments (0)

Sciweavers

Automatic Segmentation of Text into Structured Records

Database | Datasets Yielded Accuracy | Several Useful Data | SIGMOD 2001 | Unformatted Text Records |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers