Learning Deterministic Regular Expressions for the Inference of Schemas from XML data

Bex, Geert Jan; Gelade, Wouter; Neven, Frank; Vansummeren, Stijn

doi:doi/10.1145/1367497.1367609

Citer

Learning Deterministic Regular Expressions for the Inference of Schemas from XML data

par Bex, Geert Jan ;Gelade, Wouter ;Neven, Frank ;Vansummeren, Stijn

Référence Proceedings of the 17th International Conference on World Wide Web, WWW '08, ACM Press, page (825-834)
Publication Publié, 2008

Publication dans des actes

Résumé :

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small number k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML data

Documents en relation

DI-fusion