
We consider the estimation and inference of graphical models that
characterize the dependency structure of highdimensional tensorvalued data.
To facilitate the estimation of the precision matrix corresponding to each way
of the tensor, we assume the data follow a tensor normal distribution whose
covariance has a Kronecker product structure. A critical challenge in the
estimation and inference of this model is the fact that its penalized maximum
likelihood estimation involves minimizing a nonconvex objective function. To
address it, this paper makes two contributions: (i) In spite of the
nonconvexity of this estimation problem, we prove that an alternating
minimization algorithm, which iteratively estimates each sparse precision
matrix while fixing the others, attains an estimator with an optimal
statistical rate of convergence. (ii) We propose a debiased statistical
inference procedure for testing hypotheses on the true support of the sparse
precision matrices, and employ it for testing a growing number of hypothesis
with false discovery rate (FDR) control. The asymptotic normality of our test
statistic and the consistency of FDR control procedure are established. Our
theoretical results are backed up by thorough numerical studies and our real
applications on neuroimaging studies of Autism spectrum disorder and users'
advertising click analysis bring new scientific findings and business insights.
The proposed methods are encoded into a publicly available R package Tlasso.

Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large
family of highdimensional statistical models, including sparse Fisher's
discriminant analysis, canonical correlation analysis, and sufficient dimension
reduction. Sparse GEP involves solving a nonconvex optimization problem. Most
existing methods and theory in the context of specific statistical models that
are special cases of the sparse GEP require restrictive structural assumptions
on the input matrices. In this paper, we propose a twostage computational
framework to solve the sparse GEP. At the first stage, we solve a convex
relaxation of the sparse GEP. Taking the solution as an initial value, we then
exploit a nonconvex optimization perspective and propose the truncated Rayleigh
flow method (Rifle) to estimate the leading generalized eigenvector. We show
that Rifle converges linearly to a solution with the optimal statistical rate
of convergence for many statistical models. Theoretically, our method
significantly improves upon the existing literature by eliminating structural
assumptions on the input matrices for both stages. To achieve this, our
analysis involves two key ingredients: (i) a new analysis of the gradient based
method on nonconvex objective functions, and (ii) a finegrained
characterization of the evolution of sparsity patterns along the solution path.
Thorough numerical studies are provided to validate the theoretical results.

We propose a nonparametric method for detecting nonlinear causal relationship
within a set of multidimensional discrete time series, by using sparse additive
models (SpAMs). We show that, when the input to the SpAM is a $\beta$mixing
time series, the model can be fitted by first approximating each unknown
function with a linear combination of a set of Bspline bases, and then solving
a grouplassotype optimization problem with nonconvex regularization.
Theoretically, we characterize the oracle statistical properties of the
proposed sparse estimator in function estimation and model selection.
Numerically, we propose an efficient pathwise iterative shrinkage thresholding
algorithm (PISTA), which tames the nonconvexity and guarantees linear
convergence towards the desired sparse estimator with high probability.

We study the problem of recovery of matrices that are simultaneously low rank
and row and/or column sparse. Such matrices appear in recent applications in
cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics.
We propose a GDT (Gradient Descent with hard Thresholding) algorithm to
efficiently recover matrices with such structure, by minimizing a biconvex
function over a nonconvex set of constraints. We show linear convergence of the
iterates obtained by GDT to a region within statistical error of an optimal
solution. As an application of our method, we consider multitask learning
problems and show that the statistical error rate obtained by GDT is near
optimal compared to minimax rate. Experiments demonstrate competitive
performance and much faster running speed compared to existing methods, on both
simulations and real data sets.

We propose a general theory for studying the \xl{landscape} of nonconvex
\xl{optimization} with underlying symmetric structures \tz{for a class of
machine learning problems (e.g., lowrank matrix factorization, phase
retrieval, and deep linear neural networks)}. In specific, we characterize the
locations of stationary points and the null space of Hessian matrices \xl{of
the objective function} via the lens of invariant groups\removed{for associated
optimization problems, including lowrank matrix factorization, phase
retrieval, and deep linear neural networks}. As a major motivating example, we
apply the proposed general theory to characterize the global \xl{landscape} of
the \xl{nonconvex optimization in} lowrank matrix factorization problem. In
particular, we illustrate how the rotational symmetry group gives rise to
infinitely many nonisolated strict saddle points and equivalent global minima
of the objective function. By explicitly identifying all stationary points, we
divide the entire parameter space into three regions: ($\cR_1$) the region
containing the neighborhoods of all strict saddle points, where the objective
has negative curvatures; ($\cR_2$) the region containing neighborhoods of all
global minima, where the objective enjoys strong convexity along certain
directions; and ($\cR_3$) the complement of the above regions, where the
gradient has sufficiently large magnitudes. We further extend our result to the
matrix sensing problem. Such global landscape implies strong global convergence
guarantees for popular iterative algorithms with arbitrary initial solutions.

We study a stochastic and distributed algorithm for nonconvex problems whose
objective consists of a sum of $N$ nonconvex $L_i/N$smooth functions, plus a
nonsmooth regularizer. The proposed NonconvEx primaldual SpliTTing (NESTT)
algorithm splits the problem into $N$ subproblems, and utilizes an augmented
Lagrangian based primaldual scheme to solve it in a distributed and stochastic
manner. With a special nonuniform sampling, a version of NESTT achieves
$\epsilon$stationary solution using
$\mathcal{O}((\sum_{i=1}^N\sqrt{L_i/N})^2/\epsilon)$ gradient evaluations,
which can be up to $\mathcal{O}(N)$ times better than the (proximal) gradient
descent methods. It also achieves Qlinear convergence rate for nonconvex
$\ell_1$ penalized quadratic problems with polyhedral constraints. Further, we
reveal a fundamental connection between primaldual based methods and a few
primal only methods such as IAG/SAG/SAGA.

We study the fundamental tradeoffs between computational tractability and
statistical accuracy for a general family of hypothesis testing problems with
combinatorial structures. Based upon an oracle model of computation, which
captures the interactions between algorithms and data, we establish a general
lower bound that explicitly connects the minimum testing risk under
computational budget constraints with the intrinsic probabilistic and
combinatorial structures of statistical problems. This lower bound mirrors the
classical statistical lower bound by Le Cam (1986) and allows us to quantify
the optimal statistical performance achievable given limited computational
budgets in a systematic fashion. Under this unified framework, we sharply
characterize the statisticalcomputational phase transition for two testing
problems, namely, normal mean detection and sparse principal component
detection. For normal mean detection, we consider two combinatorial structures,
namely, sparse set and perfect matching. For these problems we identify
significant gaps between the optimal statistical accuracy that is achievable
under computational tractability constraints and the classical statistical
lower bounds. Compared with existing works on computational lower bounds for
statistical problems, which consider general polynomialtime algorithms on
Turing machines, and rely on computational hardness hypotheses on problems like
planted clique detection, we focus on the oracle computational model, which
covers a broad range of popular algorithms, and do not rely on unproven
hypotheses. Moreover, our result provides an intuitive and concrete
interpretation for the intrinsic computational intractability of
highdimensional statistical problems. One byproduct of our result is a lower
bound for a strict generalization of the matrix permanent problem, which is of
independent interest.

We study parameter estimation and asymptotic inference for sparse nonlinear
regression. More specifically, we assume the data are given by $y = f( x^\top
\beta^* ) + \epsilon$, where $f$ is nonlinear. To recover $\beta^*$, we propose
an $\ell_1$regularized leastsquares estimator. Unlike classical linear
regression, the corresponding optimization problem is nonconvex because of the
nonlinearity of $f$. In spite of the nonconvexity, we prove that under mild
conditions, every stationary point of the objective enjoys an optimal
statistical rate of convergence. In addition, we provide an efficient algorithm
that provably converges to a stationary point. We also access the uncertainty
of the obtained estimator. Specifically, based on any stationary point of the
objective, we construct valid hypothesis tests and confidence intervals for the
low dimensional components of the highdimensional parameter $\beta^*$.
Detailed numerical results are provided to back up our theory.

Linear regression studies the problem of estimating a model parameter
$\beta^* \in \mathbb{R}^p$, from $n$ observations
$\{(y_i,\mathbf{x}_i)\}_{i=1}^n$ from linear model $y_i = \langle
\mathbf{x}_i,\beta^* \rangle + \epsilon_i$. We consider a significant
generalization in which the relationship between $\langle \mathbf{x}_i,\beta^*
\rangle$ and $y_i$ is noisy, quantized to a single bit, potentially nonlinear,
noninvertible, as well as unknown. This model is known as the singleindex
model in statistics, and, among other things, it represents a significant
generalization of onebit compressed sensing. We propose a novel spectralbased
estimation procedure and show that we can recover $\beta^*$ in settings (i.e.,
classes of link function $f$) where previous algorithms fail. In general, our
algorithm requires only very mild restrictions on the (unknown) functional
relationship between $y_i$ and $\langle \mathbf{x}_i,\beta^* \rangle$. We also
consider the high dimensional setting where $\beta^*$ is sparse ,and introduce
a twostage nonconvex framework that addresses estimation challenges in high
dimensional regimes where $p \gg n$. For a broad class of link functions
between $\langle \mathbf{x}_i,\beta^* \rangle$ and $y_i$, we establish minimax
lower bounds that demonstrate the optimality of our estimators in both the
classical and high dimensional regimes.

Many high dimensional sparse learning problems are formulated as nonconvex
optimization. A popular approach to solve these nonconvex optimization problems
is through convex relaxations such as linear and semidefinite programming. In
this paper, we study the statistical limits of convex relaxations.
Particularly, we consider two problems: Mean estimation for sparse principal
submatrix and edge probability estimation for stochastic block model. We
exploit the sumofsquares relaxation hierarchy to sharply characterize the
limits of a broad class of convex relaxations. Our result shows statistical
optimality needs to be compromised for achieving computational tractability
using convex relaxations. Compared with existing results on computational lower
bounds for statistical problems, which consider general polynomialtime
algorithms and rely on computational hardness hypotheses on problems like
planted clique detection, our theory focuses on a broad class of convex
relaxations and does not rely on unproven hypotheses.

We provide a general theory of the expectationmaximization (EM) algorithm
for inferring high dimensional latent variable models. In particular, we make
two contributions: (i) For parameter estimation, we propose a novel high
dimensional EM algorithm which naturally incorporates sparsity structure into
parameter estimation. With an appropriate initialization, this algorithm
converges at a geometric rate and attains an estimator with the (near)optimal
statistical rate of convergence. (ii) Based on the obtained estimator, we
propose new inferential procedures for testing hypotheses and constructing
confidence intervals for low dimensional components of high dimensional
parameters. For a broad family of statistical models, our framework establishes
the first computationally feasible approach for optimal estimation and
asymptotic inference in high dimensions. Our theory is supported by thorough
numerical results.

We provide theoretical analysis of the statistical and computational
properties of penalized $M$estimators that can be formulated as the solution
to a possibly nonconvex optimization problem. Many important estimators fall in
this category, including least squares regression with nonconvex
regularization, generalized linear models with nonconvex regularization and
sparse elliptical random design regression. For these problems, it is
intractable to calculate the global solution due to the nonconvex formulation.
In this paper, we propose an approximate regularization pathfollowing method
for solving a variety of learning problems with nonconvex objective functions.
Under a unified analytic framework, we simultaneously provide explicit
statistical and computational rates of convergence for any local solution
attained by the algorithm. Computationally, our algorithm attains a global
geometric rate of convergence for calculating the full regularization path,
which is optimal among all firstorder algorithms. Unlike most existing methods
that only attain geometric rates of convergence for one single regularization
parameter, our algorithm calculates the full regularization path with the same
iteration complexity. In particular, we provide a refined iteration complexity
bound to sharply characterize the performance of each stage along the
regularization path. Statistically, we provide sharp sample complexity analysis
for all the approximate local solutions along the regularization path. In
particular, our analysis improves upon existing results by providing a more
refined sample complexity bound as well as an exact support recovery result for
the final estimator. These results show that the final estimator attains an
oracle statistical property due to the usage of nonconvex penalty.

Sparse principal component analysis (PCA) involves nonconvex optimization for
which the global solution is hard to obtain. To address this issue, one popular
approach is convex relaxation. However, such an approach may produce suboptimal
estimators due to the relaxation effect. To optimally estimate sparse principal
subspaces, we propose a twostage computational framework named "tighten after
relax": Within the 'relax' stage, we approximately solve a convex relaxation of
sparse PCA with early stopping to obtain a desired initial estimator; For the
'tighten' stage, we propose a novel algorithm called sparse orthogonal
iteration pursuit (SOAP), which iteratively refines the initial estimator by
directly solving the underlying nonconvex problem. A key concept of this
twostage framework is the basin of attraction. It represents a local region
within which the `tighten' stage has desired computational and statistical
guarantees. We prove that, the initial estimator obtained from the 'relax'
stage falls into such a region, and hence SOAP geometrically converges to a
principal subspace estimator which is minimaxoptimal within a certain model
class. Unlike most existing sparse PCA estimators, our approach applies to the
nonspiked covariance models, and adapts to nonGaussianity as well as
dependent data settings. Moreover, through analyzing the computational
complexity of the two stages, we illustrate an interesting phenomenon that
larger sample size can reduce the total iteration complexity. Our framework
motivates a general paradigm for solving many complex statistical problems
which involve nonconvex optimization with provable guarantees.

We study sparse principal component analysis for high dimensional vector
autoregressive time series under a doubly asymptotic framework, which allows
the dimension $d$ to scale with the series length $T$. We treat the transition
matrix of time series as a nuisance parameter and directly apply sparse
principal component analysis on multivariate time series as if the data are
independent. We provide explicit nonasymptotic rates of convergence for
leading eigenvector estimation and extend this result to principal subspace
estimation. Our analysis illustrates that the spectral norm of the transition
matrix plays an essential role in determining the final rates. We also
characterize sufficient conditions under which sparse principal component
analysis attains the optimal parametric rate. Our theoretical results are
backed up by thorough numerical studies.