Sciweavers

DRR
2010

Time and space optimization of document content classifiers

14 years 1 months ago
Time and space optimization of document content classifiers
Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative training sets which can be dauntingly large: in investigating document content extraction systems, we have demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest neighbor (kNN) classifiers sped up using hashed K-d trees. We report here on an algorithm, which we call online bin-decimation, for coping with training sets that are too big to fit in main memory, and we show empirically that it is superior to offline pre-decimation, which simply discards a large fraction of the training samples at random before constructing the classifier. The key idea of bin-decimation is to enforce an upper bound approximately on the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to ...
Dawei Yin, Henry S. Baird, Chang An
Added 30 Sep 2010
Updated 30 Sep 2010
Type Conference
Year 2010
Where DRR
Authors Dawei Yin, Henry S. Baird, Chang An
Comments (0)