Wedescribea novel approachfor clustering collectionsof sets,andits applicationto theanalysis and mining of categoricaldata. By "categorical data," we meantableswith fields that cannot be naturally orderedby ametric- e.g.,thenamesof producersof automobiles,or the namesof products offeredby a manufacturer.Our approachis basedon an iterative method for assigning and propagatingweights on the categoricalvalues in a table; this facilitates a type of similarity measure arising from the co-occurrenceof values in the dataset. Our techniquescan be studied analytically in termsof certain types of non-linear dynamical systems. We discussexperimentson a variety of tablesof synthetic and real data; we find that our iterative methodsconvergequickly to prominently correlatedvaluesof various categorical fields.
David Gibson, Jon M. Kleinberg, Prabhakar Raghavan