ADtrees for sequential data and n-gram Counting

16 years 26 days ago

Download axon.cs.byu.edu

Abstract— We consider the problem of efﬁciently storing ngram counts for large n over very large corpora. In such cases, the efﬁcient storage of sufﬁcient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to beneﬁt from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efﬁcient than the na¨ıve approach to storing n-grams and is also signiﬁcantly more efﬁcient than a traditional preﬁx tree.

Robert Van Dam, Dan Ventura

Real-time Traffic

Control Systems | SMC 2007 | Storing N-grams | Tabular Data Sets | Well-known Wall Street |

claim paper

Post Info
More Details (n/a)

Added	04 Jun 2010
Updated	04 Jun 2010
Type	Conference
Year	2007
Where	SMC
Authors	Robert Van Dam, Dan Ventura

Comments (0)

Sciweavers

ADtrees for sequential data and n-gram Counting

Control Systems | SMC 2007 | Storing N-grams | Tabular Data Sets | Well-known Wall Street |

Explore & Download

Productivity Tools

Sciweavers