Sciweavers

DGO
2006

Next steps in near-duplicate detection for eRulemaking

14 years 27 days ago
Next steps in near-duplicate detection for eRulemaking
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within nearduplicate documents, is an important component of data cleaning and integration processes for eRulemaking. This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper c...
Hui Yang, Jamie Callan, Stuart W. Shulman
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where DGO
Authors Hui Yang, Jamie Callan, Stuart W. Shulman
Comments (0)