Sciweavers

EMNLP
2011

Approximate Scalable Bounded Space Sketch for Large Data NLP

12 years 11 months ago
Approximate Scalable Bounded Space Sketch for Large Data NLP
We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).
Amit Goyal, Hal Daumé III
Added 20 Dec 2011
Updated 20 Dec 2011
Type Journal
Year 2011
Where EMNLP
Authors Amit Goyal, Hal Daumé III
Comments (0)