A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Document

13 years 7 months ago

Download web2py.iiit.ac.in

Abstract. This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identiﬁed NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The sys...

N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma

Real-time Traffic

Bilingual Dictionaries | CLEF 2011 | Clustering Algorithm | Comparable Corpora | Information Technology |

claim paper

Post Info
More Details (n/a)

Added	18 Dec 2011
Updated	18 Dec 2011
Type	Journal
Year	2011
Where	CLEF
Authors	N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma

Comments (0)

Sciweavers

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Document

Bilingual Dictionaries | CLEF 2011 | Clustering Algorithm | Comparable Corpora | Information Technology |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers