• ### Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning(1706.03461)

April 24, 2019 math.ST, stat.TH, stat.ME
There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of meta-algorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the Conditional Average Treatment Effect (CATE) function. Meta-algorithms build on base algorithms---such as Random Forests (RF), Bayesian Additive Regression Trees (BART) or neural networks---to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a new meta-algorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other, and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the meta-learners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our new X-learner can be used to target treatment regimes and to shed light on underlying mechanisms. A software package is provided that implements our methods.
• ### Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs(1801.05453)

April 27, 2018 cs.CL, cs.LG, stat.ML
The driving force behind the recent success of LSTMs has been their ability to learn complex and non-linear relationships. Consequently, our inability to describe these relationships has led to LSTMs being characterized as black boxes. To this end, we introduce contextual decomposition (CD), an interpretation algorithm for analysing individual predictions made by standard LSTMs, without any changes to the underlying model. By decomposing the output of a LSTM, CD captures the contributions of combinations of words or variables to the final prediction of an LSTM. On the task of sentiment analysis with the Yelp and SST data sets, we show that CD is able to reliably identify words and phrases of contrasting sentiment, and how they are combined to yield the LSTM's final prediction. Using the phrase-level labels in SST, we also demonstrate that CD is able to successfully extract positive and negative negations from an LSTM, something which has not previously been done.
• ### Stability and Convergence Trade-off of Iterative Optimization Algorithms(1804.01619)

April 4, 2018 math.ST, stat.TH, cs.LG, stat.ML
The overall performance or expected excess risk of an iterative machine learning algorithm can be decomposed into training error and generalization error. While the former is controlled by its convergence analysis, the latter can be tightly handled by algorithmic stability. The machine learning community has a rich history investigating convergence and stability separately. However, the question about the trade-off between these two quantities remains open. In this paper, we show that for any iterative algorithm at any iteration, the overall performance is lower bounded by the minimax statistical error over an appropriately chosen loss function class. This implies an important trade-off between convergence and stability of the algorithm -- a faster converging algorithm has to be less stable, and vice versa. As a direct consequence of this fundamental tradeoff, new convergence lower bounds can be derived for classes of algorithms constrained with different stability bounds. In particular, when the loss function is convex (or strongly convex) and smooth, we discuss the stability upper bounds of gradient descent (GD) and stochastic gradient descent and their variants with decreasing step sizes. For Nesterov's accelerated gradient descent (NAG) and heavy ball method (HB), we provide stability upper bounds for the quadratic loss function. Applying existing stability upper bounds for the gradient methods in our trade-off framework, we obtain lower bounds matching the well-established convergence upper bounds up to constants for these algorithms and conjecture similar lower bounds for NAG and HB. Finally, we numerically demonstrate the tightness of our stability bounds in terms of exponents in the rate and also illustrate via a simulated logistic regression problem that our stability bounds reflect the generalization errors better than the simple uniform convergence bounds for GD and NAG.
• ### Improved OpenCL-based Implementation of Social Field Pedestrian Model(1803.04782)

March 16, 2018 cond-mat.other, cs.DC
Two aspects of improvements are proposed for the OpenCL-based implementation of the social field pedestrian model. In the aspect of algorithm, a method based on the idea of divide-and-conquer is devised in order to overcome the problem of global memory depletion when fields are of a larger size. This is of importance for the study of finer pedestrian walking behavior, which usually requires larger fields. In the aspect of computation, the OpenCL heterogeneous framework is thoroughly studied. Factors that may affect the numerical efficiency are evaluated, with regarding to the social field model previously proposed. This includes usage of local memory, deliberate patch of data structures for avoidance of bank conflicts, and so on. Numerical experiments disclose that the numerical efficiency is brought to an even higher level. Compared to the CPU model and the previous GPU model, the current GPU model can be at most 71.56 and 13.3 times faster respectively so that it is more qualified to be a core engine for analysis of super-large scale crowd.
• ### Towards the Standardization of Non-orthogonal Multiple Access for Next Generation Wireless Networks(1802.03880)

Feb. 12, 2018 cs.IT, math.IT
Non-orthogonal multiple access (NoMA) as an efficient way of radio resource sharing can root back to the network information theory. For generations of wireless communication systems design, orthogonal multiple access (OMA) schemes in time, frequency, or code domain have been the main choices due to the limited processing capability in the transceiver hardware, as well as the modest traffic demands in both latency and connectivity. However, for the next generation radio systems, given its vision to connect everything and the much evolved hardware capability, NoMA has been identified as a promising technology to help achieve all the targets in system capacity, user connectivity, and service latency. This article will provide a systematic overview of the state-of-the-art design of the NoMA transmission based on a unified transceiver design framework, the related standardization progress, and some promising use cases in future cellular networks, based on which the interested researchers can get a quick start in this area.
• ### Solving Linear Programs with Complementarity Constraints using Branch-and-Cut(1802.02941)

Feb. 8, 2018 math.OC
A linear program with linear complementarity constraints (LPCC) requires the minimization of a linear objective over a set of linear constraints together with additional linear complementarity constraints. This class has emerged as a modeling paradigm for a broad collection of problems, including bilevel programs, Stackelberg games, inverse quadratic programs, and problems involving equilibrium constraints. The presence of the complementarity constraints results in a nonconvex optimization problem. We develop a branch-and-cut algorithm to find a global optimum for this class of optimization problems, where we branch directly on complementarities. We develop branching rules and feasibility recovery procedures and demonstrate their computational effectiveness in a comparison with CPLEX. The implementation builds on CPLEX through the use of callback routines. The computational results show that our approach is a strong alternative to constructing an integer programming formulation using big-$M$ terms to represent bounds for variables, with testing conducted on general LPCCs as well as on instances generated from bilevel programs with convex quadratic lower level problems.
• ### Log-concave sampling: Metropolis-Hastings algorithms are fast!(1801.02309)

Jan. 8, 2018 stat.CO, stat.ML
We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} \big(\kappa d \log(1/\delta) \big)$ steps, as compared to the $\mathcal{O} \big(\kappa^2 d/\delta^2 \big)$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes $\mathcal{O}(\kappa d)$ slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the potential gains of Metropolis-Hastings adjustment for Langevin-type algorithms.
• ### Iterative Random Forests to detect predictive and stable high-order interactions(1706.08457)

Dec. 23, 2017 q-bio.GN, stat.ML
Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.
• ### Efficient Aerosol Retrieval for Multi-angle Imaging SpectroRadiometer (MISR): A Bayesian Approach(1708.01948)

Aug. 6, 2017 stat.AP
Recent research in Aerosol Optical Depth (AOD) retrieval algorithms for Multi-angle Imaging SpectroRadiometer (MISR) proposed a hierarchical Bayesian model. However the inference algorithm used in their work was Markov Chain Monte Carlo (MCMC), which was reported prohibitively slow. The poor speed of MCMC dramatically limited the production feasibility of the Bayesian framework if large scale (e.g. global scale) of aerosol retrieval is desired. In this paper, we present an alternative optimization method to mitigate the speed problem. In particular we adopt Maximize a Posteriori (MAP) approach, and apply a gradient-free "hill-climbing" algorithm: the coordinate-wise stochastic-search. Our method has shown to be much (about 100 times) faster than MCMC, easier to converge, and insensitive to hyper parameters. To further scale our approach, we parallelized our method using Apache Spark, which achieves linear speed-up w.r.t number of CPU cores up to 16. Due to these efforts, we are able to retrieve AOD at much finer resolution (1.1km) with a tiny fraction of time consumption compared with existing methods. During our research, we find that in low AOD levels, the Bayesian network tends to produce overestimated retrievals. We also find that high absorbing aerosol types are retrieved at the same time. This is likely caused by the Dirichlet prior for aerosol types, as it is shown to encourage selecting absorbing types in practice. After changing Dirichlet to uniform, the AOD retrievals show excellent agreement with ground measurement in all levels.
• ### Structural Compression of Convolutional Neural Networks(1705.07356)

March 25, 2020 cs.CV
Deep convolutional neural networks (CNNs) have been successful in many tasks in machine vision, however, millions of weights in the form of thousands of convolutional filters in CNNs makes them difficult for human intepretation or understanding in science. In this article, we introduce CAR, a greedy structural compression scheme to obtain smaller and more interpretable CNNs, while achieving close to original accuracy. The compression is based on pruning filters with the least contribution to the classification accuracy. We demonstrate the interpretability of CAR-compressed CNNs by showing that our algorithm prunes filters with visually redundant functionalities such as color filters. These compressed networks are easier to interpret because they retain the filter diversity of uncompressed networks with order of magnitude less filters. Finally, a variant of CAR is introduced to quantify the importance of each image category to each CNN filter. Specifically, the most and the least important class labels are shown to be meaningful interpretations of each filter.
• ### Affine Hirsch foliations on 3-manifolds(1604.03723)

Feb. 24, 2017 math.GT, math.DS
This paper is devoted to discussing affine Hirsch foliations on $3$-manifolds. First, we prove that up to isotopic leaf-conjugacy, every closed orientable $3$-manifold $M$ admits $0$, $1$ or $2$ affine Hirsch foliations. Furthermore, every case is possible. Then, we analyze the $3$-manifolds admitting two affine Hirsch foliations (abbreviated as Hirsch manifolds). On the one hand, we construct Hirsch manifolds by using exchangeable braided links (abbreviated as DEBL Hirsch manifolds); on the other hand, we show that every Hirsch manifold virtually is a DEBL Hirsch manifold. Finally, we show that for every $n\in \mathbb{N}$, there are only finitely many Hirsch manifolds with strand number $n$. Here the strand number of a Hirsch manifold $M$ is a positive integer defined by using strand numbers of braids.
• ### Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data(1512.01524)

Jan. 26, 2017 stat.ME, stat.AP
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics for visualizing large gene expression datasets, they remain a severely underutilized visualization tool in modern data analysis. In this paper we introduce superheat, a new R package that provides an extremely flexible and customizable platform for visualizing large datasets using extendable heatmaps. Superheat enhances the traditional heatmap by providing a platform to visualize a wide range of data types simultaneously, adding to the heatmap a response variable as a scatterplot, model results as boxplots, correlation information as barplots, text information, and more. Superheat allows the user to explore their data to greater depths and to take advantage of the heterogeneity present in the data to inform analysis decisions. The goal of this paper is two-fold: (1) to demonstrate the potential of the heatmap as a default visualization method for a wide range of data types using reproducible examples, and (2) to highlight the customizability and ease of implementation of the superheat package in R for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three case studies, each based on publicly available data sources and accompanied by a file outlining the step-by-step analytic pipeline (with code).
• ### Do retinal ganglion cells project natural scenes to their principal subspace and whiten them?(1612.03483)

Dec. 11, 2016 q-bio.NC
Several theories of early sensory processing suggest that it whitens sensory stimuli. Here, we test three key predictions of the whitening theory using recordings from 152 ganglion cells in salamander retina responding to natural movies. We confirm the previous finding that firing rates of ganglion cells are less correlated compared to natural scenes, although significant correlations remain. We show that while the power spectrum of ganglion cells decays less steeply than that of natural scenes, it is not completely flattened. Finally, we find evidence that only the top principal components of the visual stimulus are transmitted.
• ### Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs(1610.07921)

Oct. 23, 2016 cs.DM, stat.ML
The sizes of Markov equivalence classes of directed acyclic graphs play important roles in measuring the uncertainty and complexity in causal learning. A Markov equivalence class can be represented by an essential graph and its undirected subgraphs determine the size of the class. In this paper, we develop a method to derive the formulas for counting the sizes of Markov equivalence classes. We first introduce a new concept of core graph. The size of a Markov equivalence class of interest is a polynomial of the number of vertices given its core graph. Then, we discuss the recursive and explicit formula of the polynomial, and provide an algorithm to derive the size formula via symbolic computation for any given core graph. The proposed size formula derivation sheds light on the relationships between the size of a Markov equivalence class and its representation graph, and makes size counting efficient, even when the essential graphs contain non-sparse undirected subgraphs.
• ### Local identifiability of $l_1$-minimization dictionary learning: a sufficient and almost necessary condition(1505.04363)

July 12, 2016 stat.ML
We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\mu\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.
• ### Lasso adjustments of treatment effect estimates in randomized experiments(1507.03652)

Dec. 21, 2015 math.ST, stat.TH
We provide a principled way for investigators to analyze randomized experiments when the number of covariates is large. Investigators often use linear multivariate regression to analyze randomized experiments instead of simply reporting the difference of means between treatment and control groups. Their aim is to reduce the variance of the estimated treatment effect by adjusting for covariates. If there are a large number of covariates relative to the number of observations, regression may perform poorly because of overfitting. In such cases, the Lasso may be helpful. We study the resulting Lasso-based treatment effect estimator under the Neyman-Rubin model of randomized experiments. We present theoretical conditions that guarantee that the estimator is more efficient than the simple difference-of-means estimator, and we provide a conservative estimator of the asymptotic variance, which can yield tighter confidence intervals than the difference-of-means estimator. Simulation and data examples show that Lasso-based adjustment can be advantageous even when the number of covariates is less than the number of observations. Specifically, a variant using Lasso for selection and OLS for estimation performs particularly well, and it chooses a smoothing parameter based on combined performance of Lasso and OLS.
• ### Optimal Subsampling Approaches for Large Sample Linear Regression(1509.05111)

Nov. 23, 2015 stat.ME, stat.ML
A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenarios: approximating the full sample ordinary least-square (OLS) estimator and estimating the coefficients in linear regression. We present two algorithms, weighted estimation algorithm and unweighted estimation algorithm, and analyze asymptotic behaviors of their resulting subsample estimators under general conditions. For the weighted estimation algorithm, we propose a criterion for selecting the optimal sampling probability by making use of the asymptotic results. On the basis of the criterion, we provide two novel subsampling methods, the optimal subsampling and the predictor- length subsampling methods. The predictor-length subsampling method is based on the L2 norm of predictors rather than leverage scores. Its computational cost is scalable. For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator. However, it has better performance than the weighted estimation algorithm for estimating the coefficients. Simulation studies and a real data example are used to demonstrate the effectiveness of our proposed subsampling methods.
• ### Estimation Stability with Cross Validation (ESCV)(1303.3128)

Oct. 27, 2015 stat.ME, stat.ML
Cross-validation (CV) is often used to select the regularization parameter in high dimensional problems. However, when applied to the sparse modeling method Lasso, CV leads to models that are unstable in high-dimensions, and consequently not suited for reliable interpretation. In this paper, we propose a model-free criterion ESCV based on a new estimation stability (ES) metric and CV. Our proposed ESCV finds a locally ES-optimal model smaller than the CV choice so that the it fits the data and also enjoys estimation stability property. We demonstrate that ESCV is an effective alternative to CV at a similar easily parallelizable computational cost. In particular, we compare the two approaches with respect to several performance measures when applied to the Lasso on both simulated and real data sets. For dependent predictors common in practice, our main finding is that, ESCV cuts down false positive rates often by a large margin, while sacrificing little of true positive rates. ESCV usually outperforms CV in terms of parameter estimation while giving similar performance as CV in terms of prediction. For the two real data sets from neuroscience and cell biology, the models found by ESCV are less than half of the model sizes by CV. Judged based on subject knowledge, they are more plausible than those by CV as well. We also discuss some regularization parameter alignment issues that come up in both approaches.
• ### A Novel Scattered Pilot Design for FBMC/OQAM Systems(1509.08577)

Sept. 29, 2015 cs.IT, math.IT
Filter bank multi-carrier with offset quadrature amplitude modulation (FBMC/OQAM) has been heavily studied as an alternative waveform for 5G systems. Its advantages of higher spectrum efficiency, localized frequency response and insensitivity to synchronization errors may enable promising performance when orthogonal frequency division multiplexing (OFDM) fails. However, performing channel estimation under the intrinsic interference has been a fundamental obstacle towards adopting FBMC/OQMA in a practical system. Several schemes are available but the performance is far from satisfaction. In this paper, we will show the existing methods are trapped by the paradigm that a clean pilot is mandatory so as to explicitly carry a reference symbol to the receiver for the purpose of channel estimation. By breaking this paradigm, a novel dual dependent pilot scheme is proposed, which gives up the independent pilot and derives dual pilots from the imposed interference. By doing this, the interference between pilots can be fully utilized. Consequentially, the new scheme significantly outperforms existing solutions and the simulation results show FBMC/OQAM can achieve close-to-OFDM performance in a practical system even with the presence of strong intrinsic interference.
• ### A spectral-like decomposition for transitive Anosov flows in dimension three(1505.06259)

May 23, 2015 math.DS
Given a (transitive or non-transitive) Anosov vector field $X$ on a closed three-dimensional manifold $M$, one may try to decompose $(M,X)$ by cutting $M$ along two-tori transverse to $X$. We prove that one can find a finite collection $\{T_1,\dots,T_n\}$ of pairwise disjoint, pairwise non-parallel incompressible tori transverse to $X$, such that the maximal invariant sets $\Lambda_1,\dots,\Lambda_m$ of the connected components $V_1,\dots,V_m$ of $M-(T_1\cup\dots\cup T_n)$ satisfy the following properties: 1, each $\Lambda_i$ is a compact invariant locally maximal transitive set for $X$, 2, the collection $\{\Lambda_1,\dots,\Lambda_m\}$ is canonically attached to the pair $(M,X)$ (i.e., it can be defined independently of the collection of tori $\{T_1,\dots,T_n\}$), 3, the $\Lambda_i$'s are the smallest possible: for every (possibly infinite) collection $\{S_i\}_{i\in I}$ of tori transverse to $X$, the $\Lambda_i$'s are contained in the maximal invariant set of $M-\cup_i S_i$. To a certain extent, the sets $\Lambda_1,\dots,\Lambda_m$ are analogs (for Anosov vector field in dimension 3) of the basic pieces which appear in the spectral decomposition of a non-transitive axiom A vector field. Then we discuss the uniqueness of such a decomposition: we prove that the pieces of the decomposition $V_1,\dots,V_m$, equipped with the restriction of the Anosov vector field $X$, are "almost unique up to topological equivalence".
• ### The geometry of kernelized spectral clustering(1404.7552)

April 7, 2015 math.ST, stat.TH, stat.ML
Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of nonparametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two nonoverlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, and under the same assumption, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the fraction of samples mislabeled by spectral clustering under finite mixtures with nonparametric components.
• ### Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim(1204.2296)

Jan. 8, 2015 math.ST, stat.TH, stat.ML
Directed graphs have asymmetric connections, yet the current graph clustering methodologies cannot identify the potentially global structure of these asymmetries. We give a spectral algorithm called di-sim that builds on a dual measure of similarity that correspond to how a node (i) sends and (ii) receives edges. Using di-sim, we analyze the global asymmetries in the networks of Enron emails, political blogs, and the c elegans neural connectome. In each example, a small subset of nodes have persistent asymmetries; these nodes send edges with one cluster, but receive edges with another cluster. Previous approaches would have assigned these asymmetric nodes to only one cluster, failing to identify their sending/receiving asymmetries. Regularization and "projection" are two steps of di-sim that are essential for spectral clustering algorithms to work in practice. The theoretical results show that these steps make the algorithm weakly consistent under the degree corrected Stochastic co-Blockmodel, a model that generalizes the Stochastic Blockmodel to allow for both (i) degree heterogeneity and (ii) the global asymmetries that we intend to detect. The theoretical results make no assumptions on the smallest degree nodes. Instead, the theorem requires that the average degree grows sufficiently fast and that the weak consistency only applies to the subset of the nodes with sufficiently large leverage scores. The results results also apply to bipartite graphs.
• ### Error Rate Bounds and Iterative Weighted Majority Voting for Crowdsourcing(1411.4086)

Nov. 15, 2014 math.PR, math.ST, stat.TH, cs.HC, cs.LG, stat.ML
Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. In this paper, we provide finite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model. The bounds are derived for multi-class labeling, and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We show that the oracle MAP rule approximately optimizes our upper bound on the mean error rate of weighted majority voting in certain setting. We propose an iterative weighted majority voting (IWMV) method that optimizes the error rate bound and approximates the oracle MAP rule. Its one step version has a provable theoretical guarantee on the error rate. The IWMV method is intuitive and computationally simple. Experimental results on simulated and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has a much lower computational cost (around one hundred times faster) than the state-of-the-art methods.
• ### Statistical guarantees for the EM algorithm: From population to sample-based analysis(1408.2156)

Aug. 9, 2014 math.ST, stat.TH, cs.LG, stat.ML
We develop a general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM. Our analysis is divided into two parts: a treatment of these algorithms at the population level (in the limit of infinite data), followed by results that apply to updates based on a finite set of samples. First, we characterize the domain of attraction of any global maximizer of the population likelihood. This characterization is based on a novel view of the EM updates as a perturbed form of likelihood ascent, or in parallel, of the gradient EM updates as a perturbed form of standard gradient ascent. Leveraging this characterization, we then provide non-asymptotic guarantees on the EM and gradient EM algorithms when applied to a finite set of samples. We develop consequences of our general theory for three canonical examples of incomplete-data problems: mixture of Gaussians, mixture of regressions, and linear regression with covariates missing completely at random. In each case, our theory guarantees that with a suitable initialization, a relatively small number of EM (or gradient EM) steps will yield (with high probability) an estimate that is within statistical error of the MLE. We provide simulations to confirm this theoretically predicted behavior.
• ### Impact of regularization on Spectral Clustering(1312.1733)

July 21, 2014 stat.ML
The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et. al (2012). Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral clustering relied on the minimum degree of the graph being sufficiently large for its good performance. By examining the scenario where the regularization parameter $\tau$ is large we show that the minimum degree assumption can potentially be removed. As a special case, for an SBM with two blocks, the results require the maximum degree to be large (grow faster than $\log n$) as opposed to the minimum degree. More importantly, we show the usefulness of regularization in situations where not all nodes belong to well-defined clusters. Our results rely on a `bias-variance'-like trade-off that arises from understanding the concentration of the sample Laplacian and the eigen gap as a function of the regularization parameter. As a byproduct of our bounds, we propose a data-driven technique \textit{DKest} (standing for estimated Davis-Kahan bounds) for choosing the regularization parameter. This technique is shown to work well through simulations and on a real data set.