• Big data is transforming our world, revolutionizing operations and analytics everywhere, from financial engineering to biomedical sciences. The complexity of big data often makes dimension reduction techniques necessary before conducting statistical inference. Principal component analysis, commonly referred to as PCA, has become an essential tool for multivariate data analysis and unsupervised dimension reduction, the goal of which is to find a lower dimensional subspace that captures most of the variation in the dataset. This article provides an overview of methodological and theoretical developments of PCA over the last decade, with focus on its applications to big data analytics. We first review the mathematical formulation of PCA and its theoretical development from the view point of perturbation analysis. We then briefly discuss the relationship between PCA and factor analysis as well as its applications to large covariance estimation and multiple testing. PCA also finds important applications in many modern machine learning problems, and we focus on community detection, ranking, mixture model and manifold learning in this paper.
  • Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $\mathbf{X}$, even when $\mathbf{X}$ and $Y$ are independent. When the covariance matrix of $\mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $\mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.
  • Recently, Chernozhukov, Chetverikov, and Kato [Ann. Statist. 42 (2014) 1564--1597] developed a new Gaussian comparison inequality for approximating the suprema of empirical processes. This paper exploits this technique to devise sharp inference on spectra of large random matrices. In particular, we show that two long-standing problems in random matrix theory can be solved: (i) simple bootstrap inference on sample eigenvalues when true eigenvalues are tied; (ii) conducting two-sample Roy's covariance test in high dimensions. To establish the asymptotic results, a generalized $\epsilon$-net argument regarding the matrix rescaled spectral norm and several new empirical process bounds are developed and of independent interest.
  • Matrix completion has been well studied under the uniform sampling model and the trace-norm regularized methods perform well both theoretically and numerically in such a setting. However, the uniform sampling model is unrealistic for a range of applications and the standard trace-norm relaxation can behave very poorly when the underlying sampling scheme is non-uniform. In this paper we propose and analyze a max-norm constrained empirical risk minimization method for noisy matrix completion under a general sampling model. The optimal rate of convergence is established under the Frobenius norm loss in the context of approximately low-rank matrix reconstruction. It is shown that the max-norm constrained method is minimax rate-optimal and yields a unified and robust approximate recovery guarantee, with respect to the sampling distributions. The computational effectiveness of this method is also discussed, based on first-order algorithms for solving convex optimizations involving max-norm regularization.
  • In this paper, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic datasets and an human acute lymphoblastic leukemia gene expression dataset, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.
  • Many data mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and $L_1$ regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.
  • Two-sample $U$-statistics are widely used in a broad range of applications, including those in the fields of biostatistics and econometrics. In this paper, we establish sharp Cram\'{e}r-type moderate deviation theorems for Studentized two-sample $U$-statistics in a general framework, including the two-sample $t$-statistic and Studentized Mann-Whitney test statistic as prototypical examples. In particular, a refined moderate deviation theorem with second-order accuracy is established for the two-sample $t$-statistic. These results extend the applicability of the existing statistical methodologies from the one-sample $t$-statistic to more general nonlinear statistics. Applications to two-sample large-scale multiple testing problems with false discovery rate control and the regularized bootstrap method are also discussed.
  • This paper studies the matrix completion problem under arbitrary sampling schemes. We propose a new estimator incorporating both max-norm and nuclear-norm regularization, based on which we can conduct efficient low-rank matrix recovery using a random subset of entries observed with additive noise under general non-uniform and unknown sampling distributions. This method significantly relaxes the uniform sampling assumption imposed for the widely used nuclear-norm penalized approach, and makes low-rank matrix recovery feasible in more practical settings. Theoretically, we prove that the proposed estimator achieves fast rates of convergence under different settings. Computationally, we propose an alternating direction method of multipliers algorithm to efficiently compute the estimator, which bridges a gap between theory and practice of machine learning methods with max-norm regularization. Further, we provide thorough numerical studies to evaluate the proposed method using both simulated and real datasets.
  • Cram\'er type moderate deviation theorems quantify the accuracy of the relative error of the normal approximation and provide theoretical justifications for many commonly used methods in statistics. In this paper, we develop a new randomized concentration inequality and establish a Cram\'er type moderate deviation theorem for general self-normalized processes which include many well-known Studentized nonlinear statistics. In particular, a sharp moderate deviation theorem under optimal moment conditions is established for Studentized $U$-statistics.
  • Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and is available on CRAN.
  • We consider nonparametric estimation of a regression curve when the data are observed with multiplicative distortion which depends on an observed confounding variable. We suggest several estimators, ranging from a relatively simple one that relies on restrictive assumptions usually made in the literature, to a sophisticated piecewise approach that involves reconstructing a smooth curve from an estimator of a constant multiple of its absolute value, and which can be applied in much more general scenarios. We show that, although our nonparametric estimators are constructed from predictors of the unobserved undistorted data, they have the same first order asymptotic properties as the standard estimators that could be computed if the undistorted data were available. We illustrate the good numerical performance of our methods on both simulated and real datasets.
  • This paper considers the problem of testing the equality of two unspecified distributions. The classical omnibus tests such as the Kolmogorov-Smirnov and Cram\`er-von Mises are known to suffer from low power against essentially all but location-scale alternatives. We propose a new two-sample test that modifies the Neyman's smooth test and extend it to the multivariate case based on the idea of projection pursue. The asymptotic null property of the test and its power against local alternatives are studied. The multiplier bootstrap method is employed to compute the critical value of the multivariate test. We establish validity of the bootstrap approximation in the case where the dimension is allowed to grow with the sample size. Numerical studies show that the new testing procedures perform well even for small sample sizes and are powerful in detecting local features or high-frequency components.
  • Let $\mathbf {x}_1,\ldots,\mathbf {x}_n$ be a random sample from a $p$-dimensional population distribution, where $p=p_n\to\infty$ and $\log p=o(n^{\beta})$ for some $0<\beta\leq1$, and let $L_n$ be the coherence of the sample correlation matrix. In this paper it is proved that $\sqrt{n/\log p}L_n\to2$ in probability if and only if $Ee^{t_0|x_{11}|^{\alpha}}<\infty$ for some $t_0>0$, where $\alpha$ satisfies $\beta=\alpha/(4-\alpha)$. Asymptotic distributions of $L_n$ are also proved under the same sufficient condition. Similar results remain valid for $m$-coherence when the variables of the population are $m$ dependent. The proofs are based on self-normalized moderate deviations, the Stein-Chen method and a newly developed randomized concentration inequality.
  • We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed.