
Hypothesis testing in the linear regression model is a fundamental
statistical problem. We consider linear regression in the highdimensional
regime where the number of parameters exceeds the number of samples ($p> n$).
In order to make informative inference, we assume that the model is
approximately sparse, that is the effect of covariates on the response can be
well approximated by conditioning on a relatively small number of covariates
whose identities are unknown. We develop a framework for testing very general
hypotheses regarding the model parameters. Our framework encompasses testing
whether the parameter lies in a convex cone, testing the signal strength, and
testing arbitrary functionals of the parameter. We show that the proposed
procedure controls the type I error, and also analyze the power of the
procedure. Our numerical experiments confirm our theoretical findings and
demonstrate that we control false positive rate (type I error) near the nominal
level, and have high power. By duality between hypotheses testing and
confidence intervals, the proposed framework can be used to obtain valid
confidence intervals for various functionals of the model parameters. For
linear functionals, the length of confidence intervals is shown to be minimax
rate optimal.

The stochastic gradient descent (SGD) algorithm has been widely used in
statistical estimation for largescale data due to its computational and memory
efficiency. While most existing works focus on the convergence of the objective
function or the error of the obtained solution, we investigate the problem of
statistical inference of true model parameters based on SGD when the population
loss function is strongly convex and satisfies certain smoothness conditions.
Our main contributions are twofold. First, in the fixed dimension setup, we
propose two consistent estimators of the asymptotic covariance of the average
iterate from SGD: (1) a plugin estimator, and (2) a batchmeans estimator,
which is computationally more efficient and only uses the iterates from SGD.
Both proposed estimators allow us to construct asymptotically exact confidence
intervals and hypothesis tests. Second, for highdimensional linear regression,
using a variant of the SGD algorithm, we construct a debiased estimator of each
regression coefficient that is asymptotically normal. This gives a onepass
algorithm for computing both the sparse regression coefficients and confidence
intervals, which is computationally attractive and applicable to online data.

In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
overparameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.

Matrix completion is a basic machine learning problem that has wide
applications, especially in collaborative filtering and recommender systems.
Simple nonconvex optimization algorithms are popular and effective in
practice. Despite recent progress in proving various nonconvex algorithms
converge from a good initial point, it remains unclear why random or arbitrary
initialization suffices in practice. We prove that the commonly used nonconvex
objective function for \textit{positive semidefinite} matrix completion has no
spurious local minima  all local minima must also be global. Therefore, many
popular optimization algorithms such as (stochastic) gradient descent can
provably solve positive semidefinite matrix completion with \textit{arbitrary}
initialization in polynomial time. The result can be generalized to the setting
when the observed entries contain noise. We believe that our main proof
strategy can be useful for understanding geometric properties of other
statistical problems involving partial or noisy observations.

This work considers the question: what convergence guarantees does the
stochastic subgradient method have in the absence of smoothness and convexity?
We prove that the stochastic subgradient method, on any semialgebraic locally
Lipschitz function, produces limit points that are all firstorder stationary.
More generally, our result applies to any function with a Whitney stratifiable
graph. In particular, this work endows the stochastic subgradient method, and
its proximal extension, with rigorous convergence guarantees for a wide class
of problems arising in data scienceincluding all popular deep learning
architectures.

We provide new theoretical insights on why overparametrization is effective
in learning neural networks. For a $k$ hidden node shallow network with
quadratic activation and $n$ training data points, we show as long as $ k \ge
\sqrt{2n}$, overparametrization enables local search algorithms to find a
\emph{globally} optimal solution for general smooth and convex loss functions.
Further, despite that the number of parameters may exceed the sample size,
using theory of Rademacher complexity, we show with weight decay, the solution
also generalizes well if the data is sampled from a regular distribution such
as Gaussian. To prove when $k\ge \sqrt{2n}$, the loss function has benign
landscape properties, we adopt an idea from smoothed analysis, which may have
other applications in studying loss surfaces of neural networks.

We analyze the convergence of (stochastic) gradient descent algorithm for
learning a convolutional filter with Rectified Linear Unit (ReLU) activation
function. Our analysis does not rely on any specific form of the input
distribution and our proofs only use the definition of ReLU, in contrast with
previous works that are restricted to standard Gaussian input. We show that
(stochastic) gradient descent with random initialization can learn the
convolutional filter in polynomial time and the convergence rate depends on the
smoothness of the input distribution and the closeness of patches. To the best
of our knowledge, this is the first recovery guarantee of gradientbased
algorithms for convolutional filter on nonGaussian input distributions. Our
theory also justifies the twostage learning rate strategy in deep neural
networks. While our focus is theoretical, we also present experiments that
illustrate our theoretical findings.

In this work, we study two firstorder primaldual based algorithms, the
Gradient PrimalDual Algorithm (GPDA) and the Gradient Alternating Direction
Method of Multipliers (GADMM), for solving a class of linearly constrained
nonconvex optimization problems. We show that with random initialization of
the primal and dual variables, both algorithms are able to compute secondorder
stationary solutions (ss2) with probability one. This is the first result
showing that primaldual algorithm is capable of finding ss2 when only using
firstorder information, it also extends the existing results for firstorder,
but primalonly algorithms.
An important implication of our result is that it also gives rise to the
first global convergence result to the ss2, for two classes of unconstrained
distributed nonconvex learning problems over multiagent networks.

Generative Adversarial Networks (GANs) are one of the most practical
strategies to learn data distributions. A popular GAN formulation is based on
the use of Wasserstein distance as a metric between probability distributions.
Unfortunately, minimizing the Wasserstein distance between the data
distribution and the generative model distribution is a challenging problem as
its objective is nonconvex, nonsmooth, and even hard to compute. In this
work, we propose to use a smooth approximation of the Wasserstein GANs. We show
that this smooth approximation is close to the original objective. Moreover,
obtaining gradient information of this approximate formulation is
computationally effortless and hence one can easily apply first order
optimization methods to optimize this objective. Based on this observation, we
proposed a class of algorithms with guaranteed theoretical convergence to
stationarity. Unlike the original nonsmooth objective, our proposed algorithm
only requires solving the discriminator to approximate optimality. We applied
our method to learning Gaussian mixtures on a grid and also to learning MNIST
digits. Our method allows the use of powerful cost functions based on latent
representations of the data, where this latent representation could also be
optimized adversarially.

We establish that firstorder methods avoid saddle points for almost all
initializations. Our results apply to a wide variety of firstorder methods,
including gradient descent, block coordinate descent, mirror descent and
variants thereof. The connecting thread is that such algorithms can be studied
from a dynamical systems perspective in which appropriate instantiations of the
Stable Manifold Theorem allow for a global stability analysis. Thus, neither
access to secondorder derivative information nor randomness beyond
initialization is necessary to provably avoid saddle points.

We propose a fast proximal Newtontype algorithm for minimizing regularized
finite sums that returns an $\epsilon$suboptimal point in
$\tilde{\mathcal{O}}(d(n + \sqrt{\kappa d})\log(\frac{1}{\epsilon}))$ FLOPS,
where $n$ is number of samples, $d$ is feature dimension, and $\kappa$ is the
condition number. As long as $n > d$, the proposed method is more efficient
than stateoftheart accelerated stochastic firstorder methods for nonsmooth
regularizers which requires $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa
n})\log(\frac{1}{\epsilon}))$ FLOPS. The key idea is to form the subsampled
Newton subproblem in a way that preserves the finite sum structure of the
objective, thereby allowing us to leverage recent developments in stochastic
firstorder methods to solve the subproblem. Experimental results verify that
the proposed algorithm outperforms previous algorithms for $\ell_1$regularized
logistic regression on real datasets.

Although gradient descent (GD) almost always escapes saddle points
asymptotically [Lee et al., 2016], this paper shows that even with fairly
natural random initialization schemes and nonpathological functions, GD can be
significantly slowed down by saddle points, taking exponential time to escape.
On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et
al., 2017] is not slowed down by saddle points  it can find an approximate
local minimizer in polynomial time. This result implies that GD is inherently
slower than perturbed GD, and justifies the importance of adding perturbations
for efficient nonconvex optimization. While our focus is theoretical, we also
present experiments that illustrate our theoretical findings.

We present a Communicationefficient Surrogate Likelihood (CSL) framework for
solving distributed statistical inference problems. CSL provides a
communicationefficient surrogate to the global likelihood that can be used for
lowdimensional estimation, highdimensional regularized estimation and
Bayesian inference. For lowdimensional estimation, CSL provably improves upon
naive averaging schemes and facilitates the construction of confidence
intervals. For highdimensional regularized estimation, CSL leads to a
minimaxoptimal estimator with controlled communication cost. For Bayesian
inference, CSL can be used to form a communicationefficient quasiposterior
distribution that converges to the true posterior. This quasiposterior
procedure significantly improves the computational efficiency of MCMC
algorithms even in a nondistributed setting. We present both theoretical
analysis and experiments to explore the properties of the CSL approximation.

Importance sampling is widely used in machine learning and statistics, but
its power is limited by the restriction of using simple proposals for which the
importance weights can be tractably calculated. We address this problem by
studying blackbox importance sampling methods that calculate importance
weights for samples generated from any unknown proposal or blackbox mechanism.
Our method allows us to use better and richer proposals to solve difficult
problems, and (somewhat counterintuitively) also has the additional benefit of
improving the estimation accuracy beyond typical importance sampling. Both
theoretical and empirical analyses are provided.

Sketching techniques have become popular for scaling up machine learning
algorithms by reducing the sample size or dimensionality of massive data sets,
while still maintaining the statistical power of big data. In this paper, we
study sketching from an optimization point of view: we first show that the
iterative Hessian sketch is an optimization process with preconditioning, and
develop accelerated iterative Hessian sketch via the searching the conjugate
direction; we then establish primaldual connections between the Hessian sketch
and dual random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the accelerated iterative dual
random projection methods. Finally to tackle the challenges from both large
sample size and highdimensionality, we propose the primaldual sketch, which
iteratively sketches the primal and dual formulations. We show that using a
logarithmic number of calls to solvers of small scale problem, primaldual
sketch is able to recover the optimum of the original problem up to arbitrary
precision. The proposed algorithms are validated via extensive experiments on
synthetic and real data sets which complements our theoretical results.

We derive a new discrepancy statistic for measuring differences between two
probability distributions based on combining Stein's identity with the
reproducing kernel Hilbert space theory. We apply our result to test how well a
probabilistic model fits a set of observations, and derive a new class of
powerful goodnessoffit tests that are widely applicable for complex and high
dimensional distributions, even for those with computationally intractable
normalization constants. Both theoretical and empirical properties of our
methods are studied thoroughly.

We develop a general approach to valid inference after model selection. At
the core of our framework is a result that characterizes the distribution of a
postselection estimator conditioned on the selection event. We specialize the
approach to model selection by the lasso to form valid confidence intervals for
the selected coefficients and test whether all relevant variables have been
included in the model.

We show that gradient descent converges to a local minimizer, almost surely
with random initialization. This is proved by applying the Stable Manifold
Theorem from dynamical systems theory.

We study distributed optimization algorithms for minimizing the average of
convex functions. The applications include empirical risk minimization problems
in statistical machine learning where the datasets are large and have to be
stored on different machines. We design a distributed stochastic variance
reduced gradient algorithm that, under certain conditions on the condition
number, simultaneously achieves the optimal parallel runtime, amount of
communication and rounds of communication among all distributed firstorder
methods up to constant factors. Our method and its accelerated extension also
outperform existing distributed algorithms in terms of the rounds of
communication as long as the condition number is not too large compared to the
size of data in each machine. We also prove a lower bound for the number of
rounds of communication for a broad class of distributed firstorder methods
including the proposed algorithms in this paper. We show that our accelerated
distributed stochastic variance reduced gradient algorithm achieves this lower
bound so that it uses the fewest rounds of communication among all distributed
firstorder algorithms.

We study nonconvex empirical risk minimization for learning halfspaces and
neural networks. For loss functions that are $L$Lipschitz continuous, we
present algorithms to learn halfspaces and multilayer neural networks that
achieve arbitrarily small excess risk $\epsilon>0$. The time complexity is
polynomial in the input dimension $d$ and the sample size $n$, but exponential
in the quantity $(L/\epsilon^2)\log(L/\epsilon)$. These algorithms run multiple
rounds of random initialization followed by arbitrary optimization steps. We
further show that if the data is separable by some neural network with constant
margin $\gamma>0$, then there is a polynomialtime algorithm for learning a
neural network that separates the training data with margin $\Omega(\gamma)$.
As a consequence, the algorithm achieves arbitrary generalization error
$\epsilon>0$ with ${\rm poly}(d,1/\epsilon)$ sample and time complexity. We
establish the same learnability result when the labels are randomly flipped
with probability $\eta<1/2$.

We study the improper learning of multilayer neural networks. Suppose that
the neural network to be learned has $k$ hidden layers and that the
$\ell_1$norm of the incoming weights of any neuron is bounded by $L$. We
present a kernelbased method, such that with probability at least $1 
\delta$, it learns a predictor whose generalization error is at most $\epsilon$
worse than that of the neural network. The sample complexity and the time
complexity of the presented method are polynomial in the input dimension and in
$(1/\epsilon,\log(1/\delta),F(k,L))$, where $F(k,L)$ is a function depending on
$(k,L)$ and on the activation function, independent of the number of neurons.
The algorithm applies to both sigmoidlike activation functions and ReLUlike
activation functions. It implies that any sufficiently sparse neural network is
learnable in polynomial time.

We devise a oneshot approach to distributed sparse regression in the
highdimensional setting. The key idea is to average "debiased" or
"desparsified" lasso estimators. We show the approach converges at the same
rate as the lasso as long as the dataset is not split across too many machines.
We also extend the approach to generalized linear models.

This thesis studies two problems in modern statistics. First, we study
selective inference, or inference for hypothesis that are chosen after looking
at the data. The motiving application is inference for regression coefficients
selected by the lasso. We present the ConditiononSelection method that allows
for valid selective inference, and study its application to the lasso, and
several other selection algorithms.
In the second part, we consider the problem of learning the structure of a
pairwise graphical model over continuous and discrete variables. We present a
new pairwise model for graphical models with both continuous and discrete
variables that is amenable to structure learning. In previous work, authors
have considered structure learning of Gaussian graphical models and structure
learning of discrete models. Our approach is a natural generalization of these
two lines of work to the mixed case. The penalization scheme involves a novel
symmetric use of the grouplasso norm and follows naturally from a particular
parametrization of the model. We provide conditions under which our estimator
is model selection consistent in the highdimensional regime.

Regularized Mestimators are used in diverse areas of science and engineering
to fit highdimensional models with some lowdimensional structure. Usually the
lowdimensional structure is encoded by the presence of the (unknown)
parameters in some lowdimensional model subspace. In such settings, it is
desirable for estimates of the model parameters to be \emph{model selection
consistent}: the estimates also fall in the model subspace. We develop a
general framework for establishing consistency and model selection consistency
of regularized Mestimators and show how it applies to some special cases of
interest in statistical learning. Our analysis identifies two key properties of
regularized Mestimators, referred to as geometric decomposability and
irrepresentability, that ensure the estimators are consistent and model
selection consistent.

We generalize Newtontype methods for minimizing smooth functions to handle a
sum of two convex functions: a smooth function and a nonsmooth function with a
simple proximal mapping. We show that the resulting proximal Newtontype
methods inherit the desirable convergence behavior of Newtontype methods for
minimizing smooth functions, even when search directions are computed
inexactly. Many popular methods tailored to problems arising in bioinformatics,
signal processing, and statistical learning are special cases of proximal
Newtontype methods, and our analysis yields new convergence results for some
of these methods.