Thèse de doctorat
Résumé : Data management systems are essential components of modern data-driven applications due to the increasing need to efficiently handle large volumes of data. Their performance is therefore a critical factor in the efficiency of data processing pipelines, as they play a crucial role in storing and managing the data. Modern database systems expose a variety of parameters that users and database administrators can configure to optimize the database configuration for a specific application. Traditionally, the task of configuring database parameters has been performed manually, although in recent years several methods have been proposed to automate this process. However, many of these methods are based on statistical models that require large amounts of data and fail to capture all the factors that affect database performance, or implement complex algorithmic solutions.The evaluation of the performance of a database is a fundamental operation because it is a common task for all tuning methods. However, despite the general consensus on the importance of this task, little guidance is usually provided to practitioners who need to benchmark their database. In particular, many works in the area of database optimization do not provide an adequate amount of information on the setup used in their experiments and analyses.The growing complexity of data management in today’s world has led to the widespread adoption of data pipelines, which are sequences of software programs that automate the collection, storage, processing, and analysis of large amounts of data. Data pipelines play a crucial role in modern data-driven applications, enabling the efficient processing and analysis of large volumes of data. However, these pipelines are composed of different systems, and can become very complex due to the heterogeneity of their components. This makes its configuration a very complex task due to the various factors to be considered in the process, such as the complex interaction between its component systems.In this thesis we leverage the well-developed research in automatic configuration in optimization and machine learning to propose the use of irace, a general-purpose configuration tool, to automatically finding the best parameter configuration for a database in a given context. The irace configurator is a black-box optimizer, meaning that it requires no prior knowledge of the system to be configured. As such, we demonstrate the potential of this methodology for both stand-alone databases and more complex pipelines composed of a sequence of software systems.We start by configuring a single NoSQL database, Cassandra, under different scenarios, using the YCBS benchmark. In this work, we achieve good results for Cassandra in terms of performance and scalability, outperforming existing a state-of-the-art configuration tool, and we provide an analysis of the tuned configurations.Then, we extend the methodology to optimize the performance of a data collection pipeline composed by Kafka and Elasticsearch. We demonstrate that the use of a simple technique allows to improve the performance of an entire pipeline, and establish a procedure to obtain consistent and scalable performance improvements.In this thesis we also report an experimental procedure that, through a sequence of experiments, analyzes the impact of various choices in the design of a database benchmark, leading to the individuation of an experimental setup that balances the consistency of the results with the time needed to obtain them. We show that the minimal experimental setup we obtain is representative also of heavier scenarios, which make it possible for the results of optimization tasks to scale.