Sciweavers

COMAD
2009

Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harvester

14 years 1 months ago
Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harvester
In this paper, we report the development and experiments of IBM Content Harvester (CH), a tool to analyze and recover templates and content from word processor created text documents. CH is part of a bigger effort to collect and reuse material generated in business service engagements. Specifically, it works on unstructured formatted documents and works by extracting content, cleansing off sensitive information, tagging it based on user-defined or domain-defined labels, and making it available for publishing in any open format and flexible querying. As a result, one can search for specific information based on tags, aggregate information regardless of document source or formatting peculiarities and publish the content in any format or template. CH has been applied to a broad variety of document collections containing hundreds of documents, including live engagements, to promising effect.
Biplav Srivastava, Yuan-Chi Chang
Added 09 Nov 2010
Updated 09 Nov 2010
Type Conference
Year 2009
Where COMAD
Authors Biplav Srivastava, Yuan-Chi Chang
Comments (0)