- Crawling web pages written in Arabic or any other language with limited content in the web may, at first, seem to parallel the process of crawling the English content. However, two major challenges must be addressed carefully in order to build efficient language based crawlers. Firstly, due to the limited content of these languages compared to the English content; we conjecture that the associated Web graphs are sparse and the crawling process must be guided carefully to avoid downloading irrelevant pages. Secondly, many pages written in the desired language are referenced by pages written in English and other foreign languages only and it is not possible to reach these pages without traversing many irrelevant pages. In this paper we present a number of language based crawling techniques and demonstrate the viability of these techniques through real crawling experimentations. We will restrict our study to Arabic web pages since we believe that these techniques still apply to other la...
Saad H. Alabbad, Sultan Alanazi