

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Document

13 years 3 months ago
A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Document
Abstract. This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identified NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The sys...
N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma
Added 18 Dec 2011
Updated 18 Dec 2011
Type Journal
Year 2011
Where CLEF
Authors N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma
Comments (0)