Séance Séminaire

Séminaire des Doctorant·e·s

Wednesday 29 January 2020 à 15h - Salle 109
Hamid Jalalzai ()

Classification in Extreme Regions, application to labeled data augmentation.

In a wide variety of applications involving anomaly detection (e.g. buzzes in social network data, frauds, system failures), extreme observations play a key role because anomalies often correspond to large observations. The key issue is then to distinguish between large observation from the normal class and large observations from the anomaly class. This task can thus be formulated as a binary classification problem in extreme regions. However, extreme observations generally contribute in a negligible manner to the (empirical) error, simply because of their rarity. As a consequence, empirical risk minimizers generally perform very poorly in extreme regions. This paper develops a general framework for classification of extreme values. Precisely, under non-parametric heavy-tail assumptions, we propose a natural and asymptotic notion of risk accounting for predictive performance in extreme regions. We prove that minimizers of an empirical version of this dedicated risk lead to classification rules with good generalization capacity, by means of maximal deviation inequalities in low probability regions. Numerical experiments illustrate the relevance of the approach developed.