• ### Mass Volume Curves and Anomaly Ranking(1705.01305)

Sept. 3, 2018 stat.ML
This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued scoring' function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.
• ### Profitable Bandits(1805.02908)

May 8, 2018 cs.LG, stat.ML
Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the profitable bandit problem here. At each step, an agent chooses a subset of the K possible actions. For each action chosen, she then receives the sum of a random number of rewards. Her objective is to maximize her cumulated earnings. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: kl-UCB, Bayes-UCB and Thompson Sampling. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality. Our goal is also to compare these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for Thompson Sampling in practice.
• ### Ranking Data with Continuous Labels through Oriented Recursive Partitions(1801.05772)

Jan. 17, 2018 stat.ML
We formulate a supervised learning problem, referred to as continuous ranking, where a continuous real-valued label Y is assigned to an observable r.v. X taking its values in a feature space $\mathcal{X}$ and the goal is to order all possible observations x in $\mathcal{X}$ by means of a scoring function $s:\mathcal{X}\rightarrow \mathbb{R}$ so that s(X) and Y tend to increase or decrease together with highest probability. This problem generalizes bi/multi-partite ranking to a certain extent and the task of finding optimal scoring functions s(x) can be naturally cast as optimization of a dedicated functional criterion, called the IROC curve here, or as maximization of the Kendall ${\tau}$ related to the pair (s(X), Y ). From the theoretical side, we describe the optimal elements of this problem and provide statistical guarantees for empirical Kendall ${\tau}$ maximization under appropriate conditions for the class of scoring function candidates. We also propose a recursive statistical learning algorithm tailored to empirical IROC curve optimization and producing a piecewise constant scoring function that is fully described by an oriented binary tree. Preliminary numerical experiments highlight the difference in nature between regression and continuous ranking and provide strong empirical evidence of the performance of empirical optimizers of the criteria proposed.
• ### Max K-armed bandit: On the ExtremeHunter algorithm and beyond(1707.08820)

July 27, 2017 cs.LG, stat.ML
This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduced to a classical version of the bandit problem to a certain extent. Beyond the formal analysis, these two approaches are compared through numerical experiments.
• ### Sharp exponential inequalities in survey sampling: conditional Poisson sampling schemes(1610.03776)

Oct. 12, 2016 math.ST, stat.TH
This paper is devoted to establishing exponential bounds for the probabilities of deviation of a sample sum from its expectation, when the variables involved in the summation are obtained by sampling in a finite population according to a rejective scheme, generalizing sampling without replacement, and by using an appropriate normalization. In contrast to Poisson sampling, classical deviation inequalities in the i.i.d. setting do not straightforwardly apply to sample sums related to rejective schemes, due to the inherent dependence structure of the sampled points. We show here how to overcome this difficulty, by combining the formulation of rejective sampling as Poisson sampling conditioned upon the sample size with the Escher transformation. In particular, the Bennett/Bernstein type bounds established highlight the effect of the asymptotic variance of the (properly standardized) sample weighted sum and are shown to be much more accurate than those based on the negative association property shared by the terms involved in the summation. Beyond its interest in itself, such a result for rejective sampling is crucial, insofar as it can be extended to many other sampling schemes, namely those that can be accurately approximated by rejective plans in the sense of the total variation distance.
• ### Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions(1606.02421)

June 8, 2016 cs.AI, cs.DC, cs.LG, stat.ML, cs.SY
In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. This general problem finds applications in ranking, distance metric learning and graph inference, among others. We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach.
• ### Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics(1501.02629)

April 19, 2016 cs.AI, cs.LG, stat.ML
In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by $U$-statistics of degree $d\geq 1$, i.e. functionals of the training data with low variance that take the form of averages over $k$-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size $n$, as it requires averaging $O(n^d)$ terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on $O(n)$ terms only, usually referred to as incomplete $U$-statistics, without damaging the $O_{\mathbb{P}}(1/\sqrt{n})$ learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a $U$-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.
• ### Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking(1603.09584)

March 31, 2016 stat.ML
Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the angular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space with respect to their degree of 'abnormality'. The procedure relies on an original dimension-reduction technique in the extreme domain that possibly produces a sparse representation of multivariate extremes and allows to gain insight into the dependence structure thereof, escaping the curse of dimensionality. The representation output by the unsupervised methodology we propose here can be combined with any Anomaly Detection technique tailored to non-extreme data. As it performs linearly with the dimension and almost linearly in the data (in O(dn log n)), it fits to large scale problems. The approach in this paper is novel in that EVT has never been used in its multivariate version in the field of Anomaly Detection. Illustrative experimental results provide strong empirical evidence of the relevance of our approach.
• ### Sparsity in Multivariate Extremes with Applications to Anomaly Detection(1507.05899)

March 14, 2016 stat.ML
Capturing the dependence structure of multivariate extreme events is a major concern in many fields involving the management of risks stemming from multiple sources, e.g. portfolio monitoring, insurance, environmental risk management and anomaly detection. One convenient (non-parametric) characterization of extremal dependence in the framework of multivariate Extreme Value Theory (EVT) is the angular measure, which provides direct information about the probable 'directions' of extremes, that is, the relative contribution of each feature/coordinate of the 'largest' observations. Modeling the angular measure in high dimensional problems is a major challenge for the multivariate analysis of rare events. The present paper proposes a novel methodology aiming at exhibiting a sparsity pattern within the dependence structure of extremes. This is done by estimating the amount of mass spread by the angular measure on representative sets of directions, corresponding to specific sub-cones of $R^d\_+$. This dimension reduction technique paves the way towards scaling up existing multivariate EVT methods. Beyond a non-asymptotic study providing a theoretical validity framework for our method, we propose as a direct application a --first-- anomaly detection algorithm based on multivariate EVT. This algorithm builds a sparse 'normal profile' of extreme behaviours, to be confronted with new (possibly abnormal) extreme observations. Illustrative experimental results provide strong empirical evidence of the relevance of our approach.
• ### Extending Gossip Algorithms to Distributed Estimation of U-Statistics(1511.05464)

Nov. 17, 2015 cs.DC, stat.CO, cs.LG, stat.ML, cs.SY
Efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. Whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of $U$-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. Yet, such data functionals are essential to describe global properties of a statistical population, with important examples including Area Under the Curve, empirical variance, Gini mean difference and within-cluster point scatter. This paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the $U$-statistic of interest. We establish convergence rate bounds of $O(1/t)$ and $O(\log t / t)$ for the synchronous and asynchronous cases respectively, where $t$ is the number of iterations, with explicit data and network dependent terms. Beyond favorable comparisons in terms of rate analysis, numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach.
• ### AUC Optimisation and Collaborative Filtering(1508.06091)

Aug. 25, 2015 cs.LG, stat.ML
In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.
• ### Learning the dependence structure of rare events: a non-asymptotic study(1505.06298)

May 23, 2015 math.ST, stat.TH
Assessing the probability of occurrence of extreme events is a crucial issue in various fields like finance, insurance, telecommunication or environmental sciences. In a multivariate framework, the tail dependence is characterized by the so-called stable tail dependence function (STDF). Learning this structure is the keystone of multivariate extremes. Although extensive studies have proved consistency and asymptotic normality for the empirical version of the STDF, non-asymptotic bounds are still missing. The main purpose of this paper is to fill this gap. Taking advantage of adapted VC-type concentration inequalities, upper bounds are derived with expected rate of convergence in O(k^-1/2). The concentration tools involved in this analysis rely on a more general study of maximal deviations in low probability regions, and thus directly apply to the classification of extreme data.
• ### A statistical network analysis of the HIV/AIDS epidemics in Cuba(1401.6449)

May 22, 2015 cs.SI, stat.AP
The Cuban contact-tracing detection system set up in 1986 allowed the reconstruction and analysis of the sexual network underlying the epidemic (5,389 vertices and 4,073 edges, giant component of 2,386 nodes and 3,168 edges), shedding light onto the spread of HIV and the role of contact-tracing. Clustering based on modularity optimization provides a better visualization and understanding of the network, in combination with the study of covariates. The graph has a globally low but heterogeneous density, with clusters of high intraconnectivity but low interconnectivity. Though descriptive, our results pave the way for incorporating structure when studying stochastic SIR epidemics spreading on social networks.
• ### On Anomaly Ranking and Excess-Mass Curves(1502.01684)

Feb. 5, 2015 math.PR, stat.ML
Learning how to rank multivariate unlabeled observations depending on their degree of abnormality/novelty is a crucial problem in a wide range of applications. In practice, it generally consists in building a real valued "scoring" function on the feature space so as to quantify to which extent observations should be considered as abnormal. In the 1-d situation, measurements are generally considered as "abnormal" when they are remote from central measures such as the mean or the median. Anomaly detection then relies on tail analysis of the variable of interest. Extensions to the multivariate setting are far from straightforward and it is precisely the main purpose of this paper to introduce a novel and convenient (functional) criterion for measuring the performance of a scoring function regarding the anomaly ranking task, referred to as the Excess-Mass curve (EM curve). In addition, an adaptive algorithm for building a scoring function based on unlabeled data X1 , . . . , Xn with a nearly optimal EM is proposed and is analyzed from a statistical perspective.
• ### Survey schemes for stochastic gradient descent with applications to M-estimation(1501.02218)

Jan. 9, 2015 stat.ML
In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the "full data" statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods in large-scale statistical and machine-learning problems. Precisely, we prove that, in presence of some a priori information, one may significantly increase asymptotic accuracy when choosing appropriate first order inclusion probabilities, without affecting complexity. These striking results are described here by limit theorems and are also illustrated by numerical experiments.
• ### On Tail Index Estimation based on Multivariate Data(1402.0357)

April 9, 2014 math.ST, stat.TH
This article is devoted to the study of tail index estimation based on i.i.d. multivariate observations, drawn from a standard heavy-tailed distribution, i.e. of which 1-d Pareto-like marginals share the same tail index. A multivariate Central Limit Theorem for a random vector, whose components correspond to (possibly dependent) Hill estimators of the common shape index alpha, is established under mild conditions. Motivated by the statistical analysis of extremal spatial data in particular, we introduce the concept of (standard) heavy-tailed random field of tail index alpha and show how this limit result can be used in order to build an estimator of alpha with small asymptotic mean squared error, through a proper convex linear combination of the coordinates. Beyond asymptotic results, simulation experiments illustrating the relevance of the approach promoted are also presented.
• ### Multiresolution Analysis of Incomplete Rankings(1403.1994)

March 8, 2014 math.ST, stat.TH
Incomplete rankings on a set of items $\{1,\; \ldots,\; n\}$ are orderings of the form $a_{1}\prec\dots\prec a_{k}$, with $\{a_{1},\dots a_{k}\}\subset\{1,\dots,n\}$ and $k < n$. Though they arise in many modern applications, only a few methods have been introduced to manipulate them, most of them consisting in representing any incomplete ranking by the set of all its possible linear extensions on $\{1,\; \ldots,\; n\}$. It is the major purpose of this paper to introduce a completely novel approach, which allows to treat incomplete rankings directly, representing them as injective words over $\{1,\; \ldots,\; n\}$. Unexpectedly, operations on incomplete rankings have very simple equivalents in this setting and the topological structure of the complex of injective words can be interpretated in a simple fashion from the perspective of ranking. We exploit this connection here and use recent results from algebraic topology to construct a multiresolution analysis and develop a wavelet framework for incomplete rankings. Though purely combinatorial, this construction relies on the same ideas underlying multiresolution analysis on a Euclidean space, and permits to localize the information related to rankings on each subset of items. It can be viewed as a crucial step toward nonlinear approximation of distributions of incomplete rankings and paves the way for many statistical applications, including preference data analysis and the design of recommender systems.
• ### Efficient Eigen-updating for Spectral Graph Clustering(1301.1318)

Jan. 27, 2014 stat.ML
Partitioning a graph into groups of vertices such that those within each group are more densely connected than vertices assigned to different groups, known as graph clustering, is often used to gain insight into the organisation of large scale networks and for visualisation purposes. Whereas a large number of dedicated techniques have been recently proposed for static graphs, the design of on-line graph clustering methods tailored for evolving networks is a challenging problem, and much less documented in the literature. Motivated by the broad variety of applications concerned, ranging from the study of biological networks to the analysis of networks of scientific references through the exploration of communications networks such as the World Wide Web, it is the main purpose of this paper to introduce a novel, computationally efficient, approach to graph clustering in the evolutionary context. Namely, the method promoted in this article can be viewed as an incremental eigenvalue solution for the spectral clustering method described by Ng. et al. (2001). The incremental eigenvalue solution is a general technique for finding the approximate eigenvectors of a symmetric matrix given a change. As well as outlining the approach in detail, we present a theoretical bound on the quality of the approximate eigenvectors using perturbation theory. We then derive a novel spectral clustering algorithm called Incremental Approximate Spectral Clustering (IASC). The IASC algorithm is simple to implement and its efficacy is demonstrated on both synthetic and real datasets modelling the evolution of a HIV epidemic, a citation network and the purchase history graph of an e-commerce website.
• ### Online Matrix Completion Through Nuclear Norm Regularisation(1401.2451)

Jan. 10, 2014 stat.ML
It is the main goal of this paper to propose a novel method to perform matrix completion on-line. Motivated by a wide variety of applications, ranging from the design of recommender systems to sensor network localization through seismic data reconstruction, we consider the matrix completion problem when entries of the matrix of interest are observed gradually. Precisely, we place ourselves in the situation where the predictive rule should be refined incrementally, rather than recomputed from scratch each time the sample of observed entries increases. The extension of existing matrix completion methods to the sequential prediction context is indeed a major issue in the Big Data era, and yet little addressed in the literature. The algorithm promoted in this article builds upon the Soft Impute approach introduced in Mazumder et al. (2010). The major novelty essentially arises from the use of a randomised technique for both computing and updating the Singular Value Decomposition (SVD) involved in the algorithm. Though of disarming simplicity, the method proposed turns out to be very efficient, while requiring reduced computations. Several numerical experiments based on real datasets illustrating its performance are displayed, together with preliminary results giving it a theoretical basis.
• ### Functional Bipartite Ranking: a Wavelet-Based Filtering Approach(1312.5066)

Dec. 18, 2013 stat.ML
It is the main goal of this article to address the bipartite ranking issue from the perspective of functional data analysis (FDA). Given a training set of independent realizations of a (possibly sampled) second-order random function with a (locally) smooth autocorrelation structure and to which a binary label is randomly assigned, the objective is to learn a scoring function s with optimal ROC curve. Based on linear/nonlinear wavelet-based approximations, it is shown how to select compact finite dimensional representations of the input curves adaptively, in order to build accurate ranking rules, using recent advances in the ranking problem for multivariate data with binary feedback. Beyond theoretical considerations, the performance of the learning methods for functional bipartite ranking proposed in this paper are illustrated by numerical experiments.
• ### On Computer-Intensive Simulation and Estimation Methods for Rare Event Analysis in Epidemic Models(1308.5830)

Aug. 27, 2013 math.ST, stat.TH, stat.AP
This article focuses, in the context of epidemic models, on rare events that may possibly correspond to crisis situations from the perspective of Public Health. In general, no close analytic form for their occurrence probabilities is available and crude Monte-Carlo procedures fail. We show how recent intensive computer simulation techniques, such as interacting branching particle methods, can be used for estimation purposes, as well as for generating model paths that correspond to realizations of such events. Applications of these simulation-based methods to several epidemic models are also considered and discussed thoroughly.
• ### An Empirical Comparison of V-fold Penalisation and Cross Validation for Model Selection in Distribution-Free Regression(1212.1780)

Dec. 8, 2012 stat.ML
Model selection is a crucial issue in machine-learning and a wide variety of penalisation methods (with possibly data dependent complexity penalties) have recently been introduced for this purpose. However their empirical performance is generally not well documented in the literature. It is the goal of this paper to investigate to which extent such recent techniques can be successfully used for the tuning of both the regularisation and kernel parameters in support vector regression (SVR) and the complexity measure in regression trees (CART). This task is traditionally solved via V-fold cross-validation (VFCV), which gives efficient results for a reasonable computational cost. A disadvantage however of VFCV is that the procedure is known to provide an asymptotically suboptimal risk estimate as the number of examples tends to infinity. Recently, a penalisation procedure called V-fold penalisation has been proposed to improve on VFCV, supported by theoretical arguments. Here we report on an extensive set of experiments comparing V-fold penalisation and VFCV for SVR/CART calibration on several benchmark datasets. We highlight cases in which VFCV and V-fold penalisation provide poor estimates of the risk respectively and introduce a modified penalisation technique to reduce the estimation error.
• ### Dissemination of Health Information within Social Networks(1211.4235)

Nov. 18, 2012 physics.soc-ph, cs.SI
In this paper, we investigate, how information about a common food born health hazard, known as Campylobacter, spreads once it was delivered to a random sample of individuals in France. The central question addressed here is how individual characteristics and the various aspects of social network influence the spread of information. A key claim of our paper is that information diffusion processes occur in a patterned network of social ties of heterogeneous actors. Our percolation models show that the characteristics of the recipients of the information matter as much if not more than the characteristics of the sender of the information in deciding whether the information will be transmitted through a particular tie. We also found that at least for this particular advisory, it is not the perceived need of the recipients for the information that matters but their general interest in the topic.
• ### Hierarchical clustering for graph visualization(1210.5693)

Oct. 21, 2012 physics.soc-ph, cs.SI, stat.AP
This paper describes a graph visualization methodology based on hierarchical maximal modularity clustering, with interactive and significant coarsening and refining possibilities. An application of this method to HIV epidemic analysis in Cuba is outlined.
• ### Visual Mining of Epidemic Networks(1210.5694)

Oct. 21, 2012 physics.soc-ph, cs.SI, stat.AP
We show how an interactive graph visualization method based on maximal modularity clustering can be used to explore a large epidemic network. The visual representation is used to display statistical tests results that expose the relations between the propagation of HIV in a sexual contact network and the sexual orientation of the patients.