Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

15 years 9 months ago

Download www.archipel.uqam.ca

Collaborative work on unstructured or semistructured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project GutenbergTM corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

Owen Kaser, Daniel Lemire

Real-time Traffic

CASCON 2007 | Documents Require Knowledge | Education | Semistructured Documents | Templates Containing Metadata |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2007
Where	CASCON
Authors	Owen Kaser, Daniel Lemire

Comments (0)

Sciweavers

Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

CASCON 2007 | Documents Require Knowledge | Education | Semistructured Documents | Templates Containing Metadata |

Explore & Download

Productivity Tools

Sciweavers