A considerable amount of clean semistructured data is internally available to companies in the form of business reports. However, business reports are untapped for data mining, data warehousing, and querying because they are not in relational form. Business reports have a regular structure that can be reconstructed. We present algorithms that automatically infer the regular structure underlying business reports and automatically generate wrappers to extract relational data.
Stephen W. Liddle, Douglas M. Campbell, Chad Crawf