Efficient enumeration algorithms for regular document spanners

Florenzano, Fernando; Riveros, Cristian; Ugarte Caraball, Martin Ignacio; Vansummeren, Stijn; Vrgoč, Domagoj

doi:doi/10.1145/3351451

Citer

Efficient enumeration algorithms for regular document spanners

par Florenzano, Fernando ;Riveros, Cristian ;Ugarte Caraball, Martin Ignacio

;Vansummeren, Stijn

;Vrgoč, Domagoj
Référence ACM transactions on database systems, 45, 1, 3
Publication Publié, 2020-02-01

Article révisé par les pairs

Résumé :

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner's result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

Efficient enumeration algorithms for regular document spanners

Documents en relation

DI-fusion