Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs...
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon...
Matching regular expressions (regexps) is a very common workload. For example, tokenization, which consists of recognizing words or keywords in a character stream, appears in ever...
Web image search is inspired by text search techniques; it mainly relies on indexing textual data that surround the image file. But retrieval results are often noisy and image pro...
Event search is the problem of identifying events or activity of interest in a large database storing long sequences of activity. In this paper, our topic is the problem of identi...
Panagiotis Papapetrou, Paul Doliotis, Vassilis Ath...
The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normaliz...