mes, abstracts and year of publication of all 853 papers published.1 We then applied Porter stemming and stopword removal to this text, represented terms from the elds with twice the weights of author or abstract fields, and weighted each term using BM25 term weighting. Finally, we calculated an 853x853 similarity matrix for this set of documents and used Clustan Graphics version 5.25 [1] to generate an hierarchical, non-overlapping clustering of the document set. We chose to use Clustan Graphics because it has a very user-friendly interface which allows a full-screen visualisation of the hierarchical clustering and allows the user to run a slider across the screen, effectively varying the similarity threshold above which clusters are created. This means that by using this slider, the user can not only see how many clusters are generated, but also how large these clusters are relative to each other. In our case we wanted to generate a number of clusters where the variability in size wa...
Alan F. Smeaton, Gary Keogh, Cathal Gurrin, Kieran