par Khalid, Hiba
Président du jury Sakr, Mahmoud
Promoteur Zimanyi, Esteban
Publication Non publié, 2023-10-17
Thèse de doctorat
Résumé : With the increased data production, a significant surge has appeared in metadata generation or collection as a part of the process. Metadata, when captured in accordance with standardized procedures, can offer valuable insights, leading to improved data analytics, data integration, and resource management. However, the absence of standardized practices often leads to inconsistencies in metadata recording, such as missing attribute information, absent publishing URLs, or inadequate provenance details. Furthermore, recorded metadata may exhibit inconsistencies, including variations in value formats, presence of special characters in values, and incorrectly entered values, among others.While the database community has made substantial progress in cleaning, preparing, and transforming data, the research focus on organizing and cleaning metadata has not progressed at the same pace. Despite the similarities between metadata and data files, and the applicability of data preparation and transformation techniques to metadata files in theory, practical implementation is hindered by the unique challenges presented by metadata files. Raw metadata files typically lack a well-defined structure due to the absence of standards, as users record metadata according to their individual preferences. Consequently, these files present not only structural challenges, such as recognizing layouts, elements, and value boundaries, but also semantic issues, including missing information, aliases, and duplicates. Although several approaches have shown promise in metadata collection and discovery, implementing an automated metadata preprocessing pipeline for existing metadata introduces numerous challenges.To address these challenges, this thesis proposes novel approaches to enhance the quality of existing metadata. The first step involves understanding the organization of information within raw metadata files using a system called MDOrg. This system employs rule-based agents to discern how information is distributed across metadata files. By utilizing user-supplied labels, the system improves its ability to recognize metadata elements and their boundaries. Experimental results demonstrate the system's ability to learn from provided labels and rules, enabling its application to unseen data. However, raw metadata files often contain information that requires preparation to mitigate inconsistencies, such as values containing special characters that may hinder processing by associated rules.To facilitate the handling of inconsistencies, such as non-standardized values, duplicates, and missing values, the thesis introduces MDPrep. This system detects and resolves the aforementioned inconsistencies, which can impact the performance of data-driven applications, while also assisting end users in saving time and effort. Leveraging data preparation techniques for both syntax and semantic errors specific to metadata files, MDPrep improves the quality and readability of values, enhancing the reusability of metadata files. Additionally, domain-specific data preparation operations are considered to further enhance the quality of values in metadata files. Performance evaluation is conducted using a set of predefined queries applied to both unprepared and prepared information, demonstrating the improved performance on prepared information.Following the understanding of metadata file layouts and the preparation of metadata information, the thesis introduces MDClean into the metadata improvement pipeline. MDClean focuses on rectifying semantic inconsistencies and enhancing the quality of information within metadata files. It improves the parsing of information regarding corresponding data sources, rectifies misplaced metadata values and their associated properties by detecting value-property semantics, and crucially, tracks the provenance of data sources by inferring their directories through a unique path. Provenance details are extended by including the directory of MDClean itself in the path chain.By utilizing the aforementioned systems, as well as other systems developed in this thesis, the objective is to implement a metadata processing pipeline that ensures the availability of high-quality metadata. This high-quality metadata significantly enhances data-driven decision-making, data integration, resource inference and reusability, as well as data maintenance.