Sciweavers

WWW
2008
ACM

Learning deterministic regular expressions for the inference of schemas from XML data

15 years 1 months ago
Learning deterministic regular expressions for the inference of schemas from XML data
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirical...
Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn V
Added 21 Nov 2009
Updated 21 Nov 2009
Type Conference
Year 2008
Where WWW
Authors Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn Vansummeren
Comments (0)