Sciweavers

ICDE
2012
IEEE

Horizontal Reduction: Instance-Level Dimensionality Reduction for Similarity Search in Large Document Databases

12 years 1 months ago
Horizontal Reduction: Instance-Level Dimensionality Reduction for Similarity Search in Large Document Databases
—Dimensionality reduction is essential in text mining since the dimensionality of text documents could easily reach several tens of thousands. Most recent efforts on dimensionality reduction, however, are not adequate to large document databases due to lack of scalability. We hence propose a new type of simple but effective dimensionality reduction, called horizontal (dimensionality) reduction, for large document databases. Horizontal reduction converts each text document to a few bitmap vectors and provides tight lower bounds of inter-document distances using those bitmap vectors. Bitmap representation is very simple and extremely fast, and its instance-based nature makes it suitable for large and dynamic document databases. Using the proposed horizontal reduction, we develop an efficient k-nearest neighbor (k-NN) search algorithm for text mining such as classification and clustering, and we formally prove its correctness. The proposed algorithm decreases I/O and CPU overheads sim...
Min-Soo Kim 0001, Kyu-Young Whang, Yang-Sae Moon
Added 28 Sep 2012
Updated 28 Sep 2012
Type Journal
Year 2012
Where ICDE
Authors Min-Soo Kim 0001, Kyu-Young Whang, Yang-Sae Moon
Comments (0)