DS-prox: Dataset proximity mining for governing the data lake

Alserafi, Ayman; Calders, Toon; Abelló, Alberto; Romero, Oscar

doi:doi/10.1007/978-3-319-68474-1_20

Citer

DS-prox: Dataset proximity mining for governing the data lake

par Alserafi, Ayman

;Calders, Toon

;Abelló, Alberto ;Romero, Oscar
Référence Lecture notes in computer science, 10609 LNCS, page (284-299)
Publication Publié, 2017

Article révisé par les pairs

Résumé :

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

DS-prox: Dataset proximity mining for governing the data lake

Documents en relation

DI-fusion