Information-theoretic variable selection and network inference from microarray data

Président du jury Cardinal, Jean

Promoteur Bontempi, Gianluca

Publication Non publié, 2008-12-16

Thèse de doctorat

Résumé :

Statisticians are used to model interactions between variables on the basis of observed

data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets

having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of

samples. The detection of functional relationships, when such uncertainty is contained in

data, constitutes a major challenge.

Our work focuses on variable selection and network inference from datasets having

many variables and few samples (high variable-to-sample ratio), such as microarray data.

Variable selection is the topic of machine learning whose objective is to select, among a

set of input variables, those that lead to the best predictive model. The application of

variable selection methods to gene expression data allows, for example, to improve cancer

diagnosis and prognosis by identifying a new molecular signature of the disease. Network

inference consists in representing the dependencies between the variables of a dataset by

a graph. Hence, when applied to microarray data, network inference can reverse-engineer

the transcriptional regulatory network of cell in view of discovering new drug targets to

cure diseases.

In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset

Information for Variable Elimination) a new method of feature selection and MRNET (Minimum

Redundancy NETwork), a new algorithm of network inference. Both tools rely on

the computation of mutual information, an information-theoretic measure of dependency.

More precisely, MASSIVE and MRNET use approximations of the mutual information

between a subset of variables and a target variable based on combinations of mutual informations

between sub-subsets of variables and the target. The used approximations allow

to estimate a series of low variate densities instead of one large multivariate density. Low

variate densities are well-suited for dealing with high variable-to-sample ratio datasets,

since they are rather cheap in terms of computational cost and they do not require a large

amount of samples in order to be estimated accurately. Numerous experimental results

show the competitiveness of these new approaches. Finally, our thesis has led to a freely

available source code of MASSIVE and an open-source R and Bioconductor package of

network inference.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.