Optimizing complex extraction programs over evolving text data

15 years 20 days ago

Download pages.cs.wisc.edu

Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE "blackbox." In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional "workflow." In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling...

Fei Chen 0002, Byron J. Gao, AnHai Doan, Jun Yang

Real-time Traffic

Complex Ie Programs | Database | IE Programs | Learning-based Ie Programs | SIGMOD 2009 |

claim paper

Post Info
More Details (n/a)

Added	05 Dec 2009
Updated	05 Dec 2009
Type	Conference
Year	2009
Where	SIGMOD
Authors	Fei Chen 0002, Byron J. Gao, AnHai Doan, Jun Yang 0001, Raghu Ramakrishnan

Comments (0)

Sciweavers

Optimizing complex extraction programs over evolving text data

Complex Ie Programs | Database | IE Programs | Learning-based Ie Programs | SIGMOD 2009 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers