
Big data is transforming our world, revolutionizing operations and analytics
everywhere, from financial engineering to biomedical sciences. The complexity
of big data often makes dimension reduction techniques necessary before
conducting statistical inference. Principal component analysis, commonly
referred to as PCA, has become an essential tool for multivariate data analysis
and unsupervised dimension reduction, the goal of which is to find a lower
dimensional subspace that captures most of the variation in the dataset. This
article provides an overview of methodological and theoretical developments of
PCA over the last decade, with focus on its applications to big data analytics.
We first review the mathematical formulation of PCA and its theoretical
development from the view point of perturbation analysis. We then briefly
discuss the relationship between PCA and factor analysis as well as its
applications to large covariance estimation and multiple testing. PCA also
finds important applications in many modern machine learning problems, and we
focus on community detection, ranking, mixture model and manifold learning in
this paper.

Over the last two decades, many exciting variable selection methods have been
developed for finding a small group of covariates that are associated with the
response from a large pool. Can the discoveries from these data mining
approaches be spurious due to high dimensionality and limited sample size? Can
our fundamental assumptions about the exogeneity of the covariates needed for
such variable selection be validated with the data? To answer these questions,
we need to derive the distributions of the maximum spurious correlations given
a certain number of predictors, namely, the distribution of the correlation of
a response variable $Y$ with the best $s$ linear combinations of $p$ covariates
$\mathbf{X}$, even when $\mathbf{X}$ and $Y$ are independent. When the
covariance matrix of $\mathbf{X}$ possesses the restricted eigenvalue property,
we derive such distributions for both a finite $s$ and a diverging $s$, using
Gaussian approximation and empirical process techniques. However, such a
distribution depends on the unknown covariance matrix of $\mathbf{X}$. Hence,
we use the multiplier bootstrap procedure to approximate the unknown
distributions and establish the consistency of such a simple bootstrap
approach. The results are further extended to the situation where the residuals
are from regularized fits. Our approach is then used to construct the upper
confidence limit for the maximum spurious correlation and to test the
exogeneity of the covariates. The former provides a baseline for guarding
against false discoveries and the latter tests whether our fundamental
assumptions for highdimensional model selection are statistically valid. Our
techniques and results are illustrated with both numerical examples and real
data analysis.

Recently, Chernozhukov, Chetverikov, and Kato [Ann. Statist. 42 (2014)
15641597] developed a new Gaussian comparison inequality for approximating
the suprema of empirical processes. This paper exploits this technique to
devise sharp inference on spectra of large random matrices. In particular, we
show that two longstanding problems in random matrix theory can be solved: (i)
simple bootstrap inference on sample eigenvalues when true eigenvalues are
tied; (ii) conducting twosample Roy's covariance test in high dimensions. To
establish the asymptotic results, a generalized $\epsilon$net argument
regarding the matrix rescaled spectral norm and several new empirical process
bounds are developed and of independent interest.

Matrix completion has been well studied under the uniform sampling model and
the tracenorm regularized methods perform well both theoretically and
numerically in such a setting. However, the uniform sampling model is
unrealistic for a range of applications and the standard tracenorm relaxation
can behave very poorly when the underlying sampling scheme is nonuniform.
In this paper we propose and analyze a maxnorm constrained empirical risk
minimization method for noisy matrix completion under a general sampling model.
The optimal rate of convergence is established under the Frobenius norm loss in
the context of approximately lowrank matrix reconstruction. It is shown that
the maxnorm constrained method is minimax rateoptimal and yields a unified
and robust approximate recovery guarantee, with respect to the sampling
distributions. The computational effectiveness of this method is also
discussed, based on firstorder algorithms for solving convex optimizations
involving maxnorm regularization.

In this paper, we study the problem of testing the mean vectors of high
dimensional data in both onesample and twosample cases. The proposed testing
procedures employ maximumtype statistics and the parametric bootstrap
techniques to compute the critical values. Different from the existing tests
that heavily rely on the structural conditions on the unknown covariance
matrices, the proposed tests allow general covariance structures of the data
and therefore enjoy wide scope of applicability in practice. To enhance powers
of the tests against sparse alternatives, we further propose twostep
procedures with a preliminary feature screening step. Theoretical properties of
the proposed tests are investigated. Through extensive numerical experiments on
synthetic datasets and an human acute lymphoblastic leukemia gene expression
dataset, we illustrate the performance of the new tests and how they may
provide assistance on detecting diseaseassociated genesets. The proposed
methods have been implemented in an Rpackage HDtest and are available on CRAN.

Many data mining and statistical machine learning algorithms have been
developed to select a subset of covariates to associate with a response
variable. Spurious discoveries can easily arise in highdimensional data
analysis due to enormous possibilities of such selections. How can we know
statistically our discoveries better than those by chance? In this paper, we
define a measure of goodness of spurious fit, which shows how good a response
variable can be fitted by an optimally selected subset of covariates under the
null model, and propose a simple and effective LAMM algorithm to compute it. It
coincides with the maximum spurious correlation for linear models and can be
regarded as a generalized maximum spurious correlation. We derive the
asymptotic distribution of such goodness of spurious fit for generalized linear
models and $L_1$ regression. Such an asymptotic distribution depends on the
sample size, ambient dimension, the number of variables used in the fit, and
the covariance information. It can be consistently estimated by multiplier
bootstrapping and used as a benchmark to guard against spurious discoveries. It
can also be applied to model selection, which considers only candidate models
with goodness of fits better than those by spurious fits. The theory and method
are convincingly illustrated by simulated examples and an application to the
binary outcomes from German Neuroblastoma Trials.

Twosample $U$statistics are widely used in a broad range of applications,
including those in the fields of biostatistics and econometrics. In this paper,
we establish sharp Cram\'{e}rtype moderate deviation theorems for Studentized
twosample $U$statistics in a general framework, including the twosample
$t$statistic and Studentized MannWhitney test statistic as prototypical
examples. In particular, a refined moderate deviation theorem with secondorder
accuracy is established for the twosample $t$statistic. These results extend
the applicability of the existing statistical methodologies from the onesample
$t$statistic to more general nonlinear statistics. Applications to twosample
largescale multiple testing problems with false discovery rate control and the
regularized bootstrap method are also discussed.

This paper studies the matrix completion problem under arbitrary sampling
schemes. We propose a new estimator incorporating both maxnorm and
nuclearnorm regularization, based on which we can conduct efficient lowrank
matrix recovery using a random subset of entries observed with additive noise
under general nonuniform and unknown sampling distributions. This method
significantly relaxes the uniform sampling assumption imposed for the widely
used nuclearnorm penalized approach, and makes lowrank matrix recovery
feasible in more practical settings. Theoretically, we prove that the proposed
estimator achieves fast rates of convergence under different settings.
Computationally, we propose an alternating direction method of multipliers
algorithm to efficiently compute the estimator, which bridges a gap between
theory and practice of machine learning methods with maxnorm regularization.
Further, we provide thorough numerical studies to evaluate the proposed method
using both simulated and real datasets.

Cram\'er type moderate deviation theorems quantify the accuracy of the
relative error of the normal approximation and provide theoretical
justifications for many commonly used methods in statistics. In this paper, we
develop a new randomized concentration inequality and establish a Cram\'er type
moderate deviation theorem for general selfnormalized processes which include
many wellknown Studentized nonlinear statistics. In particular, a sharp
moderate deviation theorem under optimal moment conditions is established for
Studentized $U$statistics.

Comparing large covariance matrices has important applications in modern
genomics, where scientists are often interested in understanding whether
relationships (e.g., dependencies or coregulations) among a large number of
genes vary between different biological states. We propose a computationally
fast procedure for testing the equality of two large covariance matrices when
the dimensions of the covariance matrices are much larger than the sample
sizes. A distinguishing feature of the new procedure is that it imposes no
structural assumptions on the unknown covariance matrices. Hence the test is
robust with respect to various complex dependence structures that frequently
arise in genomics. We prove that the proposed procedure is asymptotically valid
under weak moment conditions. As an interesting application, we derive a new
gene clustering algorithm which shares the same nice property of avoiding
restrictive structural assumptions for highdimensional genomics data. Using an
asthma gene expression dataset, we illustrate how the new test helps compare
the covariance matrices of the genes across different gene sets/pathways
between the disease group and the control group, and how the gene clustering
algorithm provides new insights on the way gene clustering patterns differ
between the two groups. The proposed methods have been implemented in an
Rpackage HDtest and is available on CRAN.

We consider nonparametric estimation of a regression curve when the data are
observed with multiplicative distortion which depends on an observed
confounding variable. We suggest several estimators, ranging from a relatively
simple one that relies on restrictive assumptions usually made in the
literature, to a sophisticated piecewise approach that involves reconstructing
a smooth curve from an estimator of a constant multiple of its absolute value,
and which can be applied in much more general scenarios. We show that, although
our nonparametric estimators are constructed from predictors of the unobserved
undistorted data, they have the same first order asymptotic properties as the
standard estimators that could be computed if the undistorted data were
available. We illustrate the good numerical performance of our methods on both
simulated and real datasets.

This paper considers the problem of testing the equality of two unspecified
distributions. The classical omnibus tests such as the KolmogorovSmirnov and
Cram\`ervon Mises are known to suffer from low power against essentially all
but locationscale alternatives. We propose a new twosample test that modifies
the Neyman's smooth test and extend it to the multivariate case based on the
idea of projection pursue. The asymptotic null property of the test and its
power against local alternatives are studied. The multiplier bootstrap method
is employed to compute the critical value of the multivariate test. We
establish validity of the bootstrap approximation in the case where the
dimension is allowed to grow with the sample size. Numerical studies show that
the new testing procedures perform well even for small sample sizes and are
powerful in detecting local features or highfrequency components.

Let $\mathbf {x}_1,\ldots,\mathbf {x}_n$ be a random sample from a
$p$dimensional population distribution, where $p=p_n\to\infty$ and $\log
p=o(n^{\beta})$ for some $0<\beta\leq1$, and let $L_n$ be the coherence of the
sample correlation matrix. In this paper it is proved that $\sqrt{n/\log
p}L_n\to2$ in probability if and only if $Ee^{t_0x_{11}^{\alpha}}<\infty$ for
some $t_0>0$, where $\alpha$ satisfies $\beta=\alpha/(4\alpha)$. Asymptotic
distributions of $L_n$ are also proved under the same sufficient condition.
Similar results remain valid for $m$coherence when the variables of the
population are $m$ dependent. The proofs are based on selfnormalized moderate
deviations, the SteinChen method and a newly developed randomized
concentration inequality.

We consider in this paper the problem of noisy 1bit matrix completion under
a general nonuniform sampling distribution using the maxnorm as a convex
relaxation for the rank. A maxnorm constrained maximum likelihood estimate is
introduced and studied. The rate of convergence for the estimate is obtained.
Informationtheoretical methods are used to establish a minimax lower bound
under the general sampling model. The minimax upper and lower bounds together
yield the optimal rate of convergence for the Frobenius norm loss.
Computational algorithms and numerical performance are also discussed.