Sciweavers

ICDAR
2009
IEEE

Document Content Extraction Using Automatically Discovered Features

14 years 26 days ago
Document Content Extraction Using Automatically Discovered Features
We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. As first detailed in [17], this approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieved an error rate of 13.8%. This outperforms a feature set of size 14 chosen by Principal Component Analysis (PCA) with an error rate ...
Sui-Yu Wang, Henry S. Baird, Chang An
Added 18 Feb 2011
Updated 18 Feb 2011
Type Journal
Year 2009
Where ICDAR
Authors Sui-Yu Wang, Henry S. Baird, Chang An
Comments (0)