Abstract— In this paper we suggest a new approach to represent text document collections, integrating background knowledge to improve clustering effectiveness. Background knowled...
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use ...
News articles about the same event published over time have properties that challenge NLP and IR applications. A cluster of such texts typically exhibits instances of paraphrase a...
Document clustering techniques have been applied in several areas, with the web as one of the most recent and influent. Both general-purpose and text-oriented techniques exist and...
Abstract. A new methodology that structures the semantics of a collection of documents into the geometry of a simplicial complex is developed. A simplicial complex is topologically...