Abstract--With the growing computer networks, accessible data is becoming increasing distributed. Understanding and integrating remote and unfamiliar data sources are important data management issues. In this paper, we propose to utilize self-organizing maps (SOM) to aid with the visualization and integration of relational database tables and attributes based on the contents. In order to accomodate heterogeneous data types found in relational databases, we extended the TFIDF measure to handle numerical and binary attribute types. We present a SOM-based visualization algorithm allowing the user to browse the heterogeneously typed database attributes and discover semantically similar clusters. The discovered semantic clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings in data integration.
Farid Bourennani, Ken Q. Pu, Ying Zhu