As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. A case in point is the California Air Resources Board, a component of California EPA, which every year has to integrate emissions inventories from the 35 local air quality districts in California and send them to US EPA in North Carolina (which in turn has to integrate the data from all 50 states and from neighboring countries). The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover correspondences across comparable datasets at all levels. We have seen pa...
Patrick Pantel, Andrew Philpot, Eduard H. Hovy