Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

Alserafi, Ayman; Abelló, Alberto; Romero, Oscar; Calders, Toon

doi:doi/10.1007/978-3-030-32065-2_3

Citer

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

par Alserafi, Ayman

;Abelló, Alberto ;Romero, Oscar ;Calders, Toon

Référence Lecture notes in computer science, 11815 LNCS, page (35-49)
Publication Publié, 2019-06-01

Article révisé par les pairs

Résumé :

With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

Documents en relation

DI-fusion