Séance Séminaire

Séminaire de Probabilités et Statistique

Monday 25 March 2013 à 14:30 - CIRAD, campus de Lavalette, amphi Jacques Alliot
David Causeur (AgroCampus Ouest)

Sparse factor models for high dimensional data

Analysis of data generated by high-throughput technologies has received an increased scrutiny in the statistical literature, especially motivated by emerging challenges in systems biology, neuroscience or astronomy. Microarray technologies for genome analysis or brain imaging and electroencephalography share the common goal to provide a detailed overview of complex systems on a large scale. Statistical analysis of the resulting data usually aims at identifying key components of the whole system essentially by large-scale significance, regression or supervised classification analysis. However, usual issues such as the control of the error rates in multiple testing or model selection in classification turns out to be challenging in high dimensional situations. For example, some papers (Leek and Storey, 2007 and 2008, Friguet et al., 2009) have pointed out the negative impact of dependence among tests on the consistency of the ranking which results from multiple testing procedures in high dimension. These papers essentially show that unmodeled heterogeneity factors can result in an unexpected dependence across data, which generates a high variability in the actual False Discovery Proportion and more generally affects the efficiency of the classical simultaneous testing methods.

Models for interaction network among the components of a complex system often reveal some key components whose changes lead to variations of other connected components. This suggests that it is crucial to account for the system-wide dependence structure to select these key components. A sparse factor model is proposed to identify a low-dimensional linear kernel which captures data dependence. t1-penalized estimation algorithms are presented and strategies for module detection in Graphical Gaussian Models for networks or model selection in supervised classification are derived. The properties are illustrated by issues in statistical genomics (see Blum et al, 2010) and analysis of ERP curves (see Causeur et al., 2012).

keywords: high dimension, factor model, Graphical Gaussian Model, LASSO, selection stability, supervised classification.