We are interested in finding natural communities in largescale linked networks. Our ultimate goal is to track changes over time in such communities. For such temporal tracking, we require a clustering algorithm that is relatively stable under small perturbations of the input data. We have developed an efficient, scalable agglomerative strategy and applied it to the citation graph of the NEC CiteSeer database (250,000 papers; 4.5 million citations). Agglomerative clustering techniques are known to be unstable on data in which the community structure is not strong. We find that some communities are essentially random and thus unstable while others are natural and will appear in most clusterings. These natural communities will enable us to track the evolution of communities over time. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords natural communities, large linked networks, hierarchical agglomerative clustering, sta...
John E. Hopcroft, Omar Khan, Brian Kulis, Bart Sel