WebKhoj: Indian language IR from multiple character encodings

16 years 7 months ago

Download www.iiit.net

Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages. Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multiencoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information filtering, Selection process; H.3.1 [Content Analysis and Indexing]: Linguistic processing General Terms Standardization, Languages Keywords Indian languages, web...

Prasad Pingali, Jagadeesh Jagarlamudi, Vasudeva Va

Real-time Traffic

Indian Langauge Content | Indian Language Content | Internet Technology | Language Web Content | WWW 2006 |

claim paper

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2006
Where	WWW
Authors	Prasad Pingali, Jagadeesh Jagarlamudi, Vasudeva Varma

Comments (0)

Sciweavers

WebKhoj: Indian language IR from multiple character encodings

Indian Langauge Content | Indian Language Content | Internet Technology | Language Web Content | WWW 2006 |

Explore & Download

Productivity Tools

Sciweavers