Large Datasets Lead to Overly Complex Models: An Explanation and a Solution

14 years 5 months ago

Download www.cs.arizona.edu

This paper explores unexpected results that lie at the intersection of two common themes in the KDD community: large datasets and the goal of building compact models. Experiments with many di erent datasets and several model construction algorithms (including tree learning algorithms such as c4.5 with three di erent pruning methods, and rule learning algorithms such as c4.5rules and ripper) show that increasing the amount of data used to build a model often results in a linear increase in model size, even when that additional complexity results in no signi cant increase in model accuracy. Despite the promise of better parameter estimation held out by large datasets, as a practical matter, models built with large amounts of data are often needlessly complex and cumbersome. In the case of decision trees, the cause of this pathology is identi ed as a bias inherent in several common pruning techniques. Pruning errors made low in the tree, where there is insu cient data to make accurate pa...

Tim Oates, David Jensen

Real-time Traffic

Accurate Parameter Estimates | Data Mining | KDD 1998 | Large Datasets | Learning Algorithms |

claim paper

Post Info
More Details (n/a)

Added	06 Aug 2010
Updated	06 Aug 2010
Type	Conference
Year	1998
Where	KDD
Authors	Tim Oates, David Jensen

Comments (0)

Sciweavers

Large Datasets Lead to Overly Complex Models: An Explanation and a Solution

Accurate Parameter Estimates | Data Mining | KDD 1998 | Large Datasets | Learning Algorithms |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers