Séance Séminaire

Séminaire de Probabilités et Statistique

lundi 16 décembre 2019 à 13:45 - UM - Bât 09 - Salle de conférence (1er étage)

Ghislain Durif (Université de Montpellier)

Dimension reduction approaches for Single Cell Expression Data Analysis

Thanks to the development of high throughput single-cell technologies in the recent years, it is now possible to explore the genomic and functioning diversity between the cells of a same organism. In particular, gene expression quantification at the single-cell level (scRNA-seq data) provides a unique insight on the cell-to-cell and gene-to-gene variability. However, single-cell expression data are (i) high-dimensional (1000s of cells and genes), (ii) over-dispersed count data with drop-outs (phenomenon of zero-inflation). Because of these specificities, analysing such data remains a statistical challenge, regarding data exploration (unsupervised) and classification (supervised). We will focus on dimension reduction approaches specifically tailored to answer this type of questions. First, we will present an unsupervised approach to provide a low-dimensional representation of single-cell expression data, based on a probabilistic version of principal component analysis (PCA). In particular, we propose a probabilistic Count Matrix Factorization (pCMF) method, that relies on a sparse Gamma-Poisson factor model with two specific compartments to handle zero-inflation and inforce sparsity in the low dimensional representation of genes. This framework induces a geometry that is suitable for single-cell expression data visualization. Then, we will focus on a classification problem regarding unknown cell type identification based on gene expression. On this matter, we will present a latent space projection method combined with a sparsity-based variable selection framework called sparse PLS. We propose a generalization of sparse PLS regression to binary or multi-group classification, designed to overcome convergence and stability issues (regarding prediction and selection) that arise when combining this type of dimension reduction algorithm with logistic regression.