• We study the problem of recovery of matrices that are simultaneously low rank and row and/or column sparse. Such matrices appear in recent applications in cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics. We propose a GDT (Gradient Descent with hard Thresholding) algorithm to efficiently recover matrices with such structure, by minimizing a bi-convex function over a nonconvex set of constraints. We show linear convergence of the iterates obtained by GDT to a region within statistical error of an optimal solution. As an application of our method, we consider multi-task learning problems and show that the statistical error rate obtained by GDT is near optimal compared to minimax rate. Experiments demonstrate competitive performance and much faster running speed compared to existing methods, on both simulations and real data sets.
  • We propose a novel class of time-varying nonparanormal graphical models, which allows us to model high dimensional heavy-tailed systems and the evolution of their latent network structures. Under this model, we develop statistical tests for presence of edges both locally at a fixed index value and globally over a range of values. The tests are developed for a high-dimensional regime, are robust to model selection mistakes and do not require commonly assumed minimum signal strength. The testing procedures are based on a high dimensional, debiasing-free moment estimator, which uses a novel kernel smoothed Kendall's tau correlation matrix as an input statistic. The estimator consistently estimates the latent inverse Pearson correlation matrix uniformly in both the index variable and kernel bandwidth. Its rate of convergence is shown to be minimax optimal. Our method is supported by thorough numerical simulations and an application to a neural imaging data set.
  • We develop a novel procedure for constructing confidence bands for components of a sparse additive model. Our procedure is based on a new kernel-sieve hybrid estimator that combines two most popular nonparametric estimation methods in the literature, the kernel regression and the spline method, and is of interest in its own right. Existing methods for fitting sparse additive model are primarily based on sieve estimators, while the literature on confidence bands for nonparametric models are primarily based upon kernel or local polynomial estimators. Our kernel-sieve hybrid estimator combines the best of both worlds and allows us to provide a simple procedure for constructing confidence bands in high-dimensional sparse additive models. We prove that the confidence bands are asymptotically honest by studying approximation with a Gaussian process. Thorough numerical results on both synthetic data and real-world neuroscience data are provided to demonstrate the efficacy of the theory.
  • We propose methods for distributed graph-based multi-task learning that are based on weighted averaging of messages from other machines. Uniform averaging or diminishing stepsize in these methods would yield consensus (single task) learning. We show how simply skewing the averaging weights or controlling the stepsize allows learning different, but related, tasks on the different machines.
  • Along with developing of Peaceman-Rachford Splittling Method (PRSM), many batch algorithms based on it have been studied very deeply. But almost no algorithm focused on the performance of stochastic version of PRSM. In this paper, we propose a new stochastic algorithm based on PRSM, prove its convergence rate in ergodic sense, and test its performance on both artificial and real data. We show that our proposed algorithm, Stochastic Scalable PRSM (SS-PRSM), enjoys the $O(1/K)$ convergence rate, which is the same as those newest stochastic algorithms that based on ADMM but faster than general Stochastic ADMM (which is $O(1/\sqrt{K})$). Our algorithm also owns wide flexibility, outperforms many state-of-the-art stochastic algorithms coming from ADMM, and has low memory cost in large-scale splitting optimization problems.
  • We consider the problem of estimating the latent structure of a social network based on the observed information diffusion events, or {\it cascades}. Here for a given cascade, we only observe the times of infection for infected nodes but not the source of the infection. Most of the existing work on this problem has focused on estimating a diffusion matrix without any structural assumptions on it. In this paper, we propose a novel model based on the intuition that an information is more likely to propagate among two nodes if they are interested in similar topics which are also prominent in the information content. In particular, our model endows each node with an influence vector (which measures how authoritative the node is on each topic) and a receptivity vector (which measures how susceptible the node is for each topic). We show how this node-topic structure can be estimated from the observed cascades and prove an analytical upper bound on the estimation error. The estimated model can be used to build recommendation systems based on the receptivity vectors, as well as for marketing based on the influence vectors. Experiments on synthetic and real data demonstrate the improved performance and better interpretability of our model compared to existing state-of-the-art methods.
  • Undirected graphical models are used extensively in the biological and social sciences to encode a pattern of conditional independences between variables, where the absence of an edge between two nodes $a$ and $b$ indicates that the corresponding two variables $X_a$ and $X_b$ are believed to be conditionally independent, after controlling for all other measured variables. In the Gaussian case, conditional independence corresponds to a zero entry in the precision matrix $\Omega$ (the inverse of the covariance matrix $\Sigma$). Real data often exhibits heavy tail dependence between variables, which cannot be captured by the commonly-used Gaussian or nonparanormal (Gaussian copula) graphical models. In this paper, we study the transelliptical model, an elliptical copula model that generalizes Gaussian and nonparanormal models to a broader family of distributions. We propose the ROCKET method, which constructs an estimator of $\Omega_{ab}$ that we prove to be asymptotically normal under mild assumptions. Empirically, ROCKET outperforms the nonparanormal and Gaussian models in terms of achieving accurate inference on simulated data. We also compare the three methods on real data (daily stock returns), and find that the ROCKET estimator is the only method whose behavior across subsamples agrees with the distribution predicted by the theory.
  • Hypothesis tests in models whose dimension far exceeds the sample size can be formulated much like the classical studentized tests only after the initial bias of estimation is removed successfully. The theory of debiased estimators can be developed in the context of quantile regression models for a fixed quantile value. However, it is frequently desirable to formulate tests based on the quantile regression process, as this leads to more robust tests and more stable confidence sets. Additionally, inference in quantile regression requires estimation of the so called sparsity function, which depends on the unknown density of the error. In this paper we consider a debiasing approach for the uniform testing problem. We develop high-dimensional regression rank scores and show how to use them to estimate the sparsity function, as well as how to adapt them for inference involving the quantile regression process. Furthermore, we develop a Kolmogorov-Smirnov test in a location-shift high-dimensional models and confidence sets that are uniformly valid for many quantile values. The main technical result are the development of a Bahadur representation of the debiasing estimator that is uniform over a range of quantiles and uniform convergence of the quantile process to the Brownian bridge process, which are of independent interest. Simulation studies illustrate finite sample properties of our procedure.
  • Sketching techniques have become popular for scaling up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, while still maintaining the statistical power of big data. In this paper, we study sketching from an optimization point of view: we first show that the iterative Hessian sketch is an optimization process with preconditioning, and develop accelerated iterative Hessian sketch via the searching the conjugate direction; we then establish primal-dual connections between the Hessian sketch and dual random projection, and apply the preconditioned conjugate gradient approach on the dual problem, which leads to the accelerated iterative dual random projection methods. Finally to tackle the challenges from both large sample size and high-dimensionality, we propose the primal-dual sketch, which iteratively sketches the primal and dual formulations. We show that using a logarithmic number of calls to solvers of small scale problem, primal-dual sketch is able to recover the optimum of the original problem up to arbitrary precision. The proposed algorithms are validated via extensive experiments on synthetic and real data sets which complements our theoretical results.
  • We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a shifted ell_1 regularized M-estimation problem, and other workers to compute the gradient. In respect of communication, the proposed approach provably matches the estimation error bound of centralized methods within constant rounds of communications (ignoring logarithmic factors). We conduct extensive experiments on both simulated and real world datasets, and demonstrate encouraging performances on high-dimensional regression and classification tasks.
  • We study the problem of distributed multi-task learning with shared representation, where each machine aims to learn a separate, but related, task in an unknown shared low-dimensional subspaces, i.e. when the predictor matrix has low rank. We consider a setting where each task is handled by a different machine, with samples for the task available locally on the machine, and study communication-efficient methods for exploiting the shared structure.
  • We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method.
  • We propose a robust inferential procedure for assessing uncertainties of parameter estimation in high-dimensional linear models, where the dimension $p$ can grow exponentially fast with the sample size $n$. Our method combines the de-biasing technique with the composite quantile function to construct an estimator that is asymptotically normal. Hence it can be used to construct valid confidence intervals and conduct hypothesis tests. Our estimator is robust and does not require the existence of first or second moment of the noise distribution. It also preserves efficiency in the sense that the worst case efficiency loss is less than 30\% compared to the square-loss-based de-biased Lasso estimator. In many cases our estimator is close to or better than the latter, especially when the noise is heavy-tailed. Our de-biasing procedure does not require solving the $L_1$-penalized composite quantile regression. Instead, it allows for any first-stage estimator with desired convergence rate and empirical sparsity. The paper also provides new proof techniques for developing theoretical guarantees of inferential procedures with non-smooth loss functions. To establish the main results, we exploit the local curvature of the conditional expectation of composite quantile loss and apply empirical process theories to control the difference between empirical quantities and their conditional expectations. Our results are established under weaker assumptions compared to existing work on inference for high-dimensional quantile regression. Furthermore, we consider a high-dimensional simultaneous test for the regression parameters by applying the Gaussian approximation and multiplier bootstrap theories. We also study distributed learning and exploit the divide-and-conquer estimator to reduce computation complexity when the sample size is massive. Finally, we provide empirical results to verify the theory.
  • Given $n$ i.i.d. observations of a random vector $(X,Z)$, where $X$ is a high-dimensional vector and $Z$ is a low-dimensional index variable, we study the problem of estimating the conditional inverse covariance matrix $\Omega(z) = (E[(X-E[X \mid Z])(X-E[X \mid Z])^T \mid Z=z])^{-1}$ under the assumption that the set of non-zero elements is small and does not depend on the index variable. We develop a novel procedure that combines the ideas of the local constant smoothing and the group Lasso for estimating the conditional inverse covariance matrix. A proximal iterative smoothing algorithm is used to solve the corresponding convex optimization problems. We prove that our procedure recovers the conditional independence assumptions of the distribution $X \mid Z$ with high probability. This result is established by developing a uniform deviation bound for the high-dimensional conditional covariance matrix from its population counterpart, which may be of independent interest. Furthermore, we develop point-wise confidence intervals for individual elements of the conditional inverse covariance matrix. We perform extensive simulation studies, in which we demonstrate that our proposed procedure outperforms sensible competitors. We illustrate our proposal on a S&P 500 stock price data set.
  • This article considers the problem of multi-group classification in the setting where the number of variables $p$ is larger than the number of observations $n$. Several methods have been proposed in the literature that address this problem, however their variable selection performance is either unknown or suboptimal to the results known in the two-group case. In this work we provide sharp conditions for the consistent recovery of relevant variables in the multi-group case using the discriminant analysis proposal of Gaynanova et al., 2014. We achieve the rates of convergence that attain the optimal scaling of the sample size $n$, number of variables $p$ and the sparsity level $s$. These rates are significantly faster than the best known results in the multi-group case. Moreover, they coincide with the optimal minimax rates for the two-group case. We validate our theoretical results with numerical analysis.
  • Despite its prevalence in statistical datasets, heteroscedasticity (non-constant sample variances) has been largely ignored in the high-dimensional statistics literature. Recently, studies have shown that the Lasso can accommodate heteroscedastic errors, with minor algorithmic modifications (Belloni et al., 2012; Gautier and Tsybakov, 2013). In this work, we study heteroscedastic regression with linear mean model and log-linear variances model with sparse high-dimensional parameters. In this work, we propose estimating variances in a post-Lasso fashion, which is followed by weighted-least squares mean estimation. These steps employ non-convex penalties as in Fan and Li (2001), which allows us to prove oracle properties for both post-Lasso variance and mean parameter estimates. We reinforce our theoretical findings with experiments.
  • We consider the problem of providing nonparametric confidence guarantees for undirected graphs under weak assumptions. In particular, we do not assume sparsity, incoherence or Normality. We allow the dimension $D$ to increase with the sample size $n$. First, we prove lower bounds that show that if we want accurate inferences with low assumptions then there are limitations on the dimension as a function of sample size. When the dimension increases slowly with sample size, we show that methods based on Normal approximations and on the bootstrap lead to valid inferences and we provide Berry-Esseen bounds on the accuracy of the Normal approximation. When the dimension is large relative to sample size, accurate inferences for graphs under low assumptions are not possible. Instead we propose to estimate something less demanding than the entire partial correlation graph. In particular, we consider: cluster graphs, restricted partial correlation graphs and correlation graphs.
  • We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the l2 convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size n, dimensionality p, and sparsity level s in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided.
  • We consider the problems of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive (linear) measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly influenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoffs between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The sufficient conditions are complemented with information theoretic lower bounds.
  • Many real world network problems often concern multivariate nodal attributes such as image, textual, and multi-view feature vectors on nodes, rather than simple univariate nodal attributes. The existing graph estimation methods built on Gaussian graphical models and covariance selection algorithms can not handle such data, neither can the theories developed around such methods be directly applied. In this paper, we propose a new principled framework for estimating graphs from multi-attribute data. Instead of estimating the partial correlation as in current literature, our method estimates the partial canonical correlations that naturally accommodate complex nodal features. Computationally, we provide an efficient algorithm which utilizes the multi-attribute structure. Theoretically, we provide sufficient conditions which guarantee consistent graph recovery. Extensive simulation studies demonstrate performance of our method under various conditions. Furthermore, we provide illustrative applications to uncovering gene regulatory networks from gene and protein profiles, and uncovering brain connectivity graph from functional magnetic resonance imaging data.
  • Network models have been popular for modeling and representing complex relationships and dependencies between observed variables. When data comes from a dynamic stochastic process, a single static network model cannot adequately capture transient dependencies, such as, gene regulatory dependencies throughout a developmental cycle of an organism. Kolar et al (2010b) proposed a method based on kernel-smoothing l1-penalized logistic regression for estimating time-varying networks from nodal observations collected from a time-series of observational data. In this paper, we establish conditions under which the proposed method consistently recovers the structure of a time-varying network. This work complements previous empirical findings by providing sound theoretical guarantees for the proposed estimation procedure. For completeness, we include numerical simulations in the paper.
  • We consider the high-dimensional heteroscedastic regression model, where the mean and the log variance are modeled as a linear combination of input variables. Existing literature on high-dimensional linear regres- sion models has largely ignored non-constant error variances, even though they commonly occur in a variety of applications ranging from biostatis- tics to finance. In this paper we study a class of non-convex penalized pseudolikelihood estimators for both the mean and variance parameters. We show that the Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer (HIPPO) achieves the oracle property, that is, we prove that the rates of convergence are the same as if the true model was known. We demonstrate numerical properties of the procedure on a simulation study and real world data.
  • We study the problem of estimating a temporally varying coefficient and varying structure (VCVS) graphical model underlying nonstationary time series data, such as social states of interacting individuals or microarray expression profiles of gene networks, as opposed to i.i.d. data from an invariant model widely considered in current literature of structural estimation. In particular, we consider the scenario in which the model evolves in a piece-wise constant fashion. We propose a procedure that minimizes the so-called TESLA loss (i.e., temporally smoothed L1 regularized regression), which allows jointly estimating the partition boundaries of the VCVS model and the coefficient of the sparse precision matrix on each block of the partition. A highly scalable proximal gradient method is proposed to solve the resultant convex optimization problem; and the conditions for sparsistent estimation and the convergence rate of both the partition boundaries and the network structure are established for the first time for such estimators.
  • We propose a novel application of the Simultaneous Orthogonal Matching Pursuit (S-OMP) procedure for sparsistant variable selection in ultra-high dimensional multi-task regression problems. Screening of variables, as introduced in \cite{fan08sis}, is an efficient and highly scalable way to remove many irrelevant variables from the set of all variables, while retaining all the relevant variables. S-OMP can be applied to problems with hundreds of thousands of variables and once the number of variables is reduced to a manageable size, a more computationally demanding procedure can be used to identify the relevant variables for each of the regression outputs. To our knowledge, this is the first attempt to utilize relatedness of multiple outputs to perform fast screening of relevant variables. As our main theoretical contribution, we prove that, asymptotically, S-OMP is guaranteed to reduce an ultra-high number of variables to below the sample size without losing true relevant variables. We also provide formal evidence that a modified Bayesian information criterion (BIC) can be used to efficiently determine the number of iterations in S-OMP. We further provide empirical evidence on the benefit of variable selection using multiple regression outputs jointly, as opposed to performing variable selection for each output separately. The finite sample performance of S-OMP is demonstrated on extensive simulation studies, and on a genetic association mapping problem. $Keywords$ Adaptive Lasso; Greedy forward regression; Orthogonal matching pursuit; Multi-output regression; Multi-task learning; Simultaneous orthogonal matching pursuit; Sure screening; Variable selection
  • Stochastic networks are a plausible representation of the relational information among entities in dynamic systems such as living cells or social communities. While there is a rich literature in estimating a static or temporally invariant network from observation data, little has been done toward estimating time-varying networks from time series of entity attributes. In this paper we present two new machine learning methods for estimating time-varying networks, which both build on a temporally smoothed $l_1$-regularized logistic regression formalism that can be cast as a standard convex-optimization problem and solved efficiently using generic solvers scalable to large networks. We report promising results on recovering simulated time-varying networks. For real data sets, we reverse engineer the latent sequence of temporally rewiring political networks between Senators from the US Senate voting records and the latent evolving regulatory networks underlying 588 genes across the life cycle of Drosophila melanogaster from the microarray time course.