Thèse de doctorat
Résumé : Business Intelligence (BI) is the set of techniques and technologies that support the decision-making process by providing an aggregated insight on data in the organization. Due to the numerous potentially useful data hold by the events and applications running in the organization, the BI market calls for new technologies able to suitably exploit it for analysis wherever it is available. In particular, the Extract, Transform, and Load (ETL) processes, the fundamental BI technology responsible for integrating and cleansing organization data, must respond to these requirements.

However, the development of ETL processes is still considered to be very complex and time-consuming, to such a point that roughly 80% of the BI project effort is dedicated to the ETL development. Among the phases of ETL development life cycle, ETL modeling is a critical and laborious task. Actually, this phase produces

the first effective formal representation of the ETL process, i.e., ETL model, that is completely reused and refined in the subsequent phases of the development.

Typically, the ETL processes are modeled using vendor-specific ETL tools from the very beginning of development. However, these tools are unsuitable for business users since they induce overwhelming fine-grained models.

As an attempt to provide more appropriate tools to business users, vendor-independent ETL modeling languages have been proposed in the literature. Nevertheless, they still remain immature. In order to get a precise view on these languages, we conduct a survey which: i) defines a set of criteria associated to major ETL

requirements identified in the literature; ii) compares the surveyed conceptual languages, issued from research work, to the physical languages, issued from prominent ETL tools; and iii) studies the whole methodologies of ETL development associated

to these modeling languages.

The analysis of our survey reveals several drawbacks in responding to the ETL requirements. Particularly, the conceptual languages have incomplete elements for ETL modeling with few or no formalization. Several languages are only descriptive with no ability to be automatically implemented into executable code, nor are they able to be automatically maintained according to changes over time.

To address these shortcomings, we present, in this thesis, a novel approach that tackles the whole development life cycle of ETL processes.

First, we propose a new vendor-independent language aiming at modeling ETL processes similar to typical business processes, the processes responsible for managing the operations in an organization. The rational behind this proposal is to provide ETL processes with better access to data in events and applications of the organization, including fresh data, and better design capabilities such as available analysis for any users. By using the standard representation mechanism denoted BPMN (Business Process Modeling and Notation) and a classification of ETL elements resulting from a study of the most used commercial and open source ETL tools, the language enables building agile and full-edged ETL processes. We name our language BPMN4ETL to refer to BPMN for ETL processes.

Second, we build a model-driven framework that provides automatic code generation capability and ameliorates maintenance support of our ETL language. We use the Model-Driven Development (MDD) technology as it helps in developing software, particularly in automating the transformation from one phase of the software development to another. We present a set of model-to-text transformations able to produce code for different business process engines and ETL engines. Also, we depict the model-to-model transformations that automatically update the ETL models with the aim of supporting the maintenance of the generated code according to data source evolution. A demonstration using a case study is conducted as an initial validation to show that the framework covering modeling, implementation and maintenance could be used in practice.

To illustrate new concepts introduced in the thesis, mainly the BPMN4ETL language, and the implementation and maintenance framework, we use a case study from the fictitious Northwind Traders company, a retailer company that imports and exports foods from around the world.