Thèse de doctorat
Résumé : NoSQL is an umbrella term used to classify alternate storage systems to the traditional Relational Database Management Systems (RDBMSs). At the moment of writing, there are more than 200 NoSQL systems available that can be classified into four main categories on the data storage model: key-value stores, document stores, column family stores, and graph stores. Document stores have gained popularity mainly due to the semi-structured data storage model and the rich query capabilities compared to the other NoSQL systems making them an ideal candidate for rapid prototyping. Document stores encourage users to use a data-first approach as opposed to a design-first one. Database design on document stores is mainly carried out in a trial-and-error or ad-hoc rule-based manner instead of a formal process such as normalization in an RDBMS. However, these approaches could easily lead to a non-optimal database design leading to additional costs in query processing, data storage, and redesigning.This PhD thesis aims to provide a novel multi-criteria-based approach to database design in document stores. Most of the existing approaches of database design are based on optimizing query performance. However, other factors include storage requirement and complexity of the stored documents specific to each use case. Moreover, there is a large solution space of alternative designs due to the different combinations of referencing and nesting of data. Hence, we believe multi-criteria optimization is ideal with a proven track record of solving such problems in various domains. However, to achieve this, we need to address several issues that will enable us to apply multi-criteria optimization for the data design problem.First, we evaluate the impact of alternate storage representations of semi-structured data. There are multiple and equivalent ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. Thus, we embark on the task of quantifying that precisely for document stores. We empirically compare multiple ways of representing semi-structured data, which allows us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette.Then, we need a formal canonical model that is capable of representing alternative designs. To this extent, we propose a hypergraph-based approach for representing heterogeneous datastore designs. Taking an existing common programming interface to NoSQL systems, we extend and formalize it as hypergraphs. Then, we define design constraints and query transformation rules for three representative data store types. Next, we propose a simple query rewriting algorithm from a generic one into underlying data stores specific one and provide a prototype implementation. Furthermore, we introduce a storage statistics estimator on the underlying data stores. Finally, we show the feasibility of our approach on a use case of an existing polyglot system and its usefulness in metadata and physical query path calculations.Next, we require a formal query cost model to estimate and evaluate query performance on alternative document store designs. Document stores use primitive approaches to query processing, such as evaluating all possible query plans to find the winning one and using it in the subsequent similar queries or relying on the end-user to specify the usage of indexes instead of a formal cost model. However, we require a reliable approach to compare two alternative designs on how they perform on a specific query. For this, we define a generic storage and query cost model based on disk access and memory allocation that allows estimating the impact of design decisions. Since all document stores carry out data operations in memory, we first estimate the memory usage by considering the characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. This is the first attempt at such an approach to the best of our knowledge. Finally, we validate our work on two well-known document store implementations: MongoDB and Couchbase. The results show that the memory usage estimates have an average precision of 91%, and predicted costs are highly correlated to the actual execution times. During this work, we have managed to suggest several improvements to document storage systems. Thus, this cost model also contributes to identifying discordance between document store implementations and their theoretical expectations.Finally, we implement the automated database design solution using multi-criteria optimization. First, we introduce an algebra of transformations that can systematically modify a design of our canonical representation. Then, using these transformations, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. Finally, we compare our prototype against an existing document store data design solution purely driven by query cost. Our proposed designs have better performance and are more compact with less redundancy.