
We study the problem of recovery of matrices that are simultaneously low rank
and row and/or column sparse. Such matrices appear in recent applications in
cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics.
We propose a GDT (Gradient Descent with hard Thresholding) algorithm to
efficiently recover matrices with such structure, by minimizing a biconvex
function over a nonconvex set of constraints. We show linear convergence of the
iterates obtained by GDT to a region within statistical error of an optimal
solution. As an application of our method, we consider multitask learning
problems and show that the statistical error rate obtained by GDT is near
optimal compared to minimax rate. Experiments demonstrate competitive
performance and much faster running speed compared to existing methods, on both
simulations and real data sets.

We propose a novel class of timevarying nonparanormal graphical models,
which allows us to model high dimensional heavytailed systems and the
evolution of their latent network structures. Under this model, we develop
statistical tests for presence of edges both locally at a fixed index value and
globally over a range of values. The tests are developed for a highdimensional
regime, are robust to model selection mistakes and do not require commonly
assumed minimum signal strength. The testing procedures are based on a high
dimensional, debiasingfree moment estimator, which uses a novel kernel
smoothed Kendall's tau correlation matrix as an input statistic. The estimator
consistently estimates the latent inverse Pearson correlation matrix uniformly
in both the index variable and kernel bandwidth. Its rate of convergence is
shown to be minimax optimal. Our method is supported by thorough numerical
simulations and an application to a neural imaging data set.

We develop a novel procedure for constructing confidence bands for components
of a sparse additive model. Our procedure is based on a new kernelsieve hybrid
estimator that combines two most popular nonparametric estimation methods in
the literature, the kernel regression and the spline method, and is of interest
in its own right. Existing methods for fitting sparse additive model are
primarily based on sieve estimators, while the literature on confidence bands
for nonparametric models are primarily based upon kernel or local polynomial
estimators. Our kernelsieve hybrid estimator combines the best of both worlds
and allows us to provide a simple procedure for constructing confidence bands
in highdimensional sparse additive models. We prove that the confidence bands
are asymptotically honest by studying approximation with a Gaussian process.
Thorough numerical results on both synthetic data and realworld neuroscience
data are provided to demonstrate the efficacy of the theory.

We propose methods for distributed graphbased multitask learning that are
based on weighted averaging of messages from other machines. Uniform averaging
or diminishing stepsize in these methods would yield consensus (single task)
learning. We show how simply skewing the averaging weights or controlling the
stepsize allows learning different, but related, tasks on the different
machines.

Along with developing of PeacemanRachford Splittling Method (PRSM), many
batch algorithms based on it have been studied very deeply. But almost no
algorithm focused on the performance of stochastic version of PRSM. In this
paper, we propose a new stochastic algorithm based on PRSM, prove its
convergence rate in ergodic sense, and test its performance on both artificial
and real data. We show that our proposed algorithm, Stochastic Scalable PRSM
(SSPRSM), enjoys the $O(1/K)$ convergence rate, which is the same as those
newest stochastic algorithms that based on ADMM but faster than general
Stochastic ADMM (which is $O(1/\sqrt{K})$). Our algorithm also owns wide
flexibility, outperforms many stateoftheart stochastic algorithms coming
from ADMM, and has low memory cost in largescale splitting optimization
problems.

We consider the problem of estimating the latent structure of a social
network based on the observed information diffusion events, or {\it cascades}.
Here for a given cascade, we only observe the times of infection for infected
nodes but not the source of the infection. Most of the existing work on this
problem has focused on estimating a diffusion matrix without any structural
assumptions on it. In this paper, we propose a novel model based on the
intuition that an information is more likely to propagate among two nodes if
they are interested in similar topics which are also prominent in the
information content. In particular, our model endows each node with an
influence vector (which measures how authoritative the node is on each topic)
and a receptivity vector (which measures how susceptible the node is for each
topic). We show how this nodetopic structure can be estimated from the
observed cascades and prove an analytical upper bound on the estimation error.
The estimated model can be used to build recommendation systems based on the
receptivity vectors, as well as for marketing based on the influence vectors.
Experiments on synthetic and real data demonstrate the improved performance and
better interpretability of our model compared to existing stateoftheart
methods.

Undirected graphical models are used extensively in the biological and social
sciences to encode a pattern of conditional independences between variables,
where the absence of an edge between two nodes $a$ and $b$ indicates that the
corresponding two variables $X_a$ and $X_b$ are believed to be conditionally
independent, after controlling for all other measured variables. In the
Gaussian case, conditional independence corresponds to a zero entry in the
precision matrix $\Omega$ (the inverse of the covariance matrix $\Sigma$). Real
data often exhibits heavy tail dependence between variables, which cannot be
captured by the commonlyused Gaussian or nonparanormal (Gaussian copula)
graphical models. In this paper, we study the transelliptical model, an
elliptical copula model that generalizes Gaussian and nonparanormal models to a
broader family of distributions. We propose the ROCKET method, which constructs
an estimator of $\Omega_{ab}$ that we prove to be asymptotically normal under
mild assumptions. Empirically, ROCKET outperforms the nonparanormal and
Gaussian models in terms of achieving accurate inference on simulated data. We
also compare the three methods on real data (daily stock returns), and find
that the ROCKET estimator is the only method whose behavior across subsamples
agrees with the distribution predicted by the theory.

Hypothesis tests in models whose dimension far exceeds the sample size can be
formulated much like the classical studentized tests only after the initial
bias of estimation is removed successfully. The theory of debiased estimators
can be developed in the context of quantile regression models for a fixed
quantile value. However, it is frequently desirable to formulate tests based on
the quantile regression process, as this leads to more robust tests and more
stable confidence sets. Additionally, inference in quantile regression requires
estimation of the so called sparsity function, which depends on the unknown
density of the error. In this paper we consider a debiasing approach for the
uniform testing problem. We develop highdimensional regression rank scores and
show how to use them to estimate the sparsity function, as well as how to adapt
them for inference involving the quantile regression process. Furthermore, we
develop a KolmogorovSmirnov test in a locationshift highdimensional models
and confidence sets that are uniformly valid for many quantile values. The main
technical result are the development of a Bahadur representation of the
debiasing estimator that is uniform over a range of quantiles and uniform
convergence of the quantile process to the Brownian bridge process, which are
of independent interest. Simulation studies illustrate finite sample properties
of our procedure.

Sketching techniques have become popular for scaling up machine learning
algorithms by reducing the sample size or dimensionality of massive data sets,
while still maintaining the statistical power of big data. In this paper, we
study sketching from an optimization point of view: we first show that the
iterative Hessian sketch is an optimization process with preconditioning, and
develop accelerated iterative Hessian sketch via the searching the conjugate
direction; we then establish primaldual connections between the Hessian sketch
and dual random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the accelerated iterative dual
random projection methods. Finally to tackle the challenges from both large
sample size and highdimensionality, we propose the primaldual sketch, which
iteratively sketches the primal and dual formulations. We show that using a
logarithmic number of calls to solvers of small scale problem, primaldual
sketch is able to recover the optimum of the original problem up to arbitrary
precision. The proposed algorithms are validated via extensive experiments on
synthetic and real data sets which complements our theoretical results.

We propose a novel, efficient approach for distributed sparse learning in
highdimensions, where observations are randomly partitioned across machines.
Computationally, at each round our method only requires the master machine to
solve a shifted ell_1 regularized Mestimation problem, and other workers to
compute the gradient. In respect of communication, the proposed approach
provably matches the estimation error bound of centralized methods within
constant rounds of communications (ignoring logarithmic factors). We conduct
extensive experiments on both simulated and real world datasets, and
demonstrate encouraging performances on highdimensional regression and
classification tasks.

We study the problem of distributed multitask learning with shared
representation, where each machine aims to learn a separate, but related, task
in an unknown shared lowdimensional subspaces, i.e. when the predictor matrix
has low rank. We consider a setting where each task is handled by a different
machine, with samples for the task available locally on the machine, and study
communicationefficient methods for exploiting the shared structure.

We consider the problem of distributed multitask learning, where each
machine learns a separate, but related, task. Specifically, each machine learns
a linear predictor in highdimensional space,where all tasks share the same
small support. We present a communicationefficient estimator based on the
debiased lasso and show that it is comparable with the optimal centralized
method.

We propose a robust inferential procedure for assessing uncertainties of
parameter estimation in highdimensional linear models, where the dimension $p$
can grow exponentially fast with the sample size $n$. Our method combines the
debiasing technique with the composite quantile function to construct an
estimator that is asymptotically normal. Hence it can be used to construct
valid confidence intervals and conduct hypothesis tests. Our estimator is
robust and does not require the existence of first or second moment of the
noise distribution. It also preserves efficiency in the sense that the worst
case efficiency loss is less than 30\% compared to the squarelossbased
debiased Lasso estimator. In many cases our estimator is close to or better
than the latter, especially when the noise is heavytailed. Our debiasing
procedure does not require solving the $L_1$penalized composite quantile
regression. Instead, it allows for any firststage estimator with desired
convergence rate and empirical sparsity. The paper also provides new proof
techniques for developing theoretical guarantees of inferential procedures with
nonsmooth loss functions. To establish the main results, we exploit the local
curvature of the conditional expectation of composite quantile loss and apply
empirical process theories to control the difference between empirical
quantities and their conditional expectations. Our results are established
under weaker assumptions compared to existing work on inference for
highdimensional quantile regression. Furthermore, we consider a
highdimensional simultaneous test for the regression parameters by applying
the Gaussian approximation and multiplier bootstrap theories. We also study
distributed learning and exploit the divideandconquer estimator to reduce
computation complexity when the sample size is massive. Finally, we provide
empirical results to verify the theory.

Given $n$ i.i.d. observations of a random vector $(X,Z)$, where $X$ is a
highdimensional vector and $Z$ is a lowdimensional index variable, we study
the problem of estimating the conditional inverse covariance matrix $\Omega(z)
= (E[(XE[X \mid Z])(XE[X \mid Z])^T \mid Z=z])^{1}$ under the assumption
that the set of nonzero elements is small and does not depend on the index
variable. We develop a novel procedure that combines the ideas of the local
constant smoothing and the group Lasso for estimating the conditional inverse
covariance matrix. A proximal iterative smoothing algorithm is used to solve
the corresponding convex optimization problems. We prove that our procedure
recovers the conditional independence assumptions of the distribution $X \mid
Z$ with high probability. This result is established by developing a uniform
deviation bound for the highdimensional conditional covariance matrix from its
population counterpart, which may be of independent interest. Furthermore, we
develop pointwise confidence intervals for individual elements of the
conditional inverse covariance matrix. We perform extensive simulation studies,
in which we demonstrate that our proposed procedure outperforms sensible
competitors. We illustrate our proposal on a S&P 500 stock price data set.

This article considers the problem of multigroup classification in the
setting where the number of variables $p$ is larger than the number of
observations $n$. Several methods have been proposed in the literature that
address this problem, however their variable selection performance is either
unknown or suboptimal to the results known in the twogroup case. In this work
we provide sharp conditions for the consistent recovery of relevant variables
in the multigroup case using the discriminant analysis proposal of Gaynanova
et al., 2014. We achieve the rates of convergence that attain the optimal
scaling of the sample size $n$, number of variables $p$ and the sparsity level
$s$. These rates are significantly faster than the best known results in the
multigroup case. Moreover, they coincide with the optimal minimax rates for
the twogroup case. We validate our theoretical results with numerical
analysis.

Despite its prevalence in statistical datasets, heteroscedasticity
(nonconstant sample variances) has been largely ignored in the
highdimensional statistics literature. Recently, studies have shown that the
Lasso can accommodate heteroscedastic errors, with minor algorithmic
modifications (Belloni et al., 2012; Gautier and Tsybakov, 2013). In this work,
we study heteroscedastic regression with linear mean model and loglinear
variances model with sparse highdimensional parameters. In this work, we
propose estimating variances in a postLasso fashion, which is followed by
weightedleast squares mean estimation. These steps employ nonconvex penalties
as in Fan and Li (2001), which allows us to prove oracle properties for both
postLasso variance and mean parameter estimates. We reinforce our theoretical
findings with experiments.

We consider the problem of providing nonparametric confidence guarantees for
undirected graphs under weak assumptions. In particular, we do not assume
sparsity, incoherence or Normality. We allow the dimension $D$ to increase with
the sample size $n$. First, we prove lower bounds that show that if we want
accurate inferences with low assumptions then there are limitations on the
dimension as a function of sample size. When the dimension increases slowly
with sample size, we show that methods based on Normal approximations and on
the bootstrap lead to valid inferences and we provide BerryEsseen bounds on
the accuracy of the Normal approximation. When the dimension is large relative
to sample size, accurate inferences for graphs under low assumptions are not
possible. Instead we propose to estimate something less demanding than the
entire partial correlation graph. In particular, we consider: cluster graphs,
restricted partial correlation graphs and correlation graphs.

We consider the highdimensional discriminant analysis problem. For this
problem, different methods have been proposed and justified by establishing
exact convergence rates for the classification risk, as well as the l2
convergence results to the discriminative rule. However, sharp theoretical
analysis for the variable selection performance of these procedures have not
been established, even though model interpretation is of fundamental importance
in scientific data analysis. This paper bridges the gap by providing sharp
sufficient conditions for consistent variable selection using the sparse
discriminant analysis (Mai et al., 2012). Through careful analysis, we
establish rates of convergence that are significantly faster than the best
known results and admit an optimal scaling of the sample size n, dimensionality
p, and sparsity level s in the highdimensional setting. Sufficient conditions
are complemented by the necessary information theoretic limits on the variable
selection problem in the context of highdimensional discriminant analysis.
Exploiting a numerical equivalence result, our method also establish the
optimal results for the ROAD estimator (Fan et al., 2012) and the sparse
optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an
exhaustive search procedure, whose performance serves as a benchmark, and show
that it is variable selection consistent under weaker conditions. Extensive
simulations demonstrating the sharpness of the bounds are also provided.

We consider the problems of detection and localization of a contiguous block
of weak activation in a large matrix, from a small number of noisy, possibly
adaptive, compressive (linear) measurements. This is closely related to the
problem of compressed sensing, where the task is to estimate a sparse vector
using a small number of linear measurements. Contrary to results in compressed
sensing, where it has been shown that neither adaptivity nor contiguous
structure help much, we show that for reliable localization the magnitude of
the weakest signals is strongly influenced by both structure and the ability to
choose measurements adaptively while for detection neither adaptivity nor
structure reduce the requirement on the magnitude of the signal. We
characterize the precise tradeoffs between the various problem parameters, the
signal strength and the number of measurements required to reliably detect and
localize the block of activation. The sufficient conditions are complemented
with information theoretic lower bounds.

Many real world network problems often concern multivariate nodal attributes
such as image, textual, and multiview feature vectors on nodes, rather than
simple univariate nodal attributes. The existing graph estimation methods built
on Gaussian graphical models and covariance selection algorithms can not handle
such data, neither can the theories developed around such methods be directly
applied. In this paper, we propose a new principled framework for estimating
graphs from multiattribute data. Instead of estimating the partial correlation
as in current literature, our method estimates the partial canonical
correlations that naturally accommodate complex nodal features.
Computationally, we provide an efficient algorithm which utilizes the
multiattribute structure. Theoretically, we provide sufficient conditions
which guarantee consistent graph recovery. Extensive simulation studies
demonstrate performance of our method under various conditions. Furthermore, we
provide illustrative applications to uncovering gene regulatory networks from
gene and protein profiles, and uncovering brain connectivity graph from
functional magnetic resonance imaging data.

Network models have been popular for modeling and representing complex
relationships and dependencies between observed variables. When data comes from
a dynamic stochastic process, a single static network model cannot adequately
capture transient dependencies, such as, gene regulatory dependencies
throughout a developmental cycle of an organism. Kolar et al (2010b) proposed a
method based on kernelsmoothing l1penalized logistic regression for
estimating timevarying networks from nodal observations collected from a
timeseries of observational data. In this paper, we establish conditions under
which the proposed method consistently recovers the structure of a timevarying
network. This work complements previous empirical findings by providing sound
theoretical guarantees for the proposed estimation procedure. For completeness,
we include numerical simulations in the paper.

We consider the highdimensional heteroscedastic regression model, where the
mean and the log variance are modeled as a linear combination of input
variables. Existing literature on highdimensional linear regres sion models
has largely ignored nonconstant error variances, even though they commonly
occur in a variety of applications ranging from biostatis tics to finance. In
this paper we study a class of nonconvex penalized pseudolikelihood estimators
for both the mean and variance parameters. We show that the Heteroscedastic
Iterative Penalized Pseudolikelihood Optimizer (HIPPO) achieves the oracle
property, that is, we prove that the rates of convergence are the same as if
the true model was known. We demonstrate numerical properties of the procedure
on a simulation study and real world data.

We study the problem of estimating a temporally varying coefficient and
varying structure (VCVS) graphical model underlying nonstationary time series
data, such as social states of interacting individuals or microarray expression
profiles of gene networks, as opposed to i.i.d. data from an invariant model
widely considered in current literature of structural estimation. In
particular, we consider the scenario in which the model evolves in a piecewise
constant fashion. We propose a procedure that minimizes the socalled TESLA
loss (i.e., temporally smoothed L1 regularized regression), which allows
jointly estimating the partition boundaries of the VCVS model and the
coefficient of the sparse precision matrix on each block of the partition. A
highly scalable proximal gradient method is proposed to solve the resultant
convex optimization problem; and the conditions for sparsistent estimation and
the convergence rate of both the partition boundaries and the network structure
are established for the first time for such estimators.

We propose a novel application of the Simultaneous Orthogonal Matching
Pursuit (SOMP) procedure for sparsistant variable selection in ultrahigh
dimensional multitask regression problems. Screening of variables, as
introduced in \cite{fan08sis}, is an efficient and highly scalable way to
remove many irrelevant variables from the set of all variables, while retaining
all the relevant variables. SOMP can be applied to problems with hundreds of
thousands of variables and once the number of variables is reduced to a
manageable size, a more computationally demanding procedure can be used to
identify the relevant variables for each of the regression outputs. To our
knowledge, this is the first attempt to utilize relatedness of multiple outputs
to perform fast screening of relevant variables. As our main theoretical
contribution, we prove that, asymptotically, SOMP is guaranteed to reduce an
ultrahigh number of variables to below the sample size without losing true
relevant variables. We also provide formal evidence that a modified Bayesian
information criterion (BIC) can be used to efficiently determine the number of
iterations in SOMP. We further provide empirical evidence on the benefit of
variable selection using multiple regression outputs jointly, as opposed to
performing variable selection for each output separately. The finite sample
performance of SOMP is demonstrated on extensive simulation studies, and on a
genetic association mapping problem. $Keywords$ Adaptive Lasso; Greedy forward
regression; Orthogonal matching pursuit; Multioutput regression; Multitask
learning; Simultaneous orthogonal matching pursuit; Sure screening; Variable
selection

Stochastic networks are a plausible representation of the relational
information among entities in dynamic systems such as living cells or social
communities. While there is a rich literature in estimating a static or
temporally invariant network from observation data, little has been done toward
estimating timevarying networks from time series of entity attributes. In this
paper we present two new machine learning methods for estimating timevarying
networks, which both build on a temporally smoothed $l_1$regularized logistic
regression formalism that can be cast as a standard convexoptimization problem
and solved efficiently using generic solvers scalable to large networks. We
report promising results on recovering simulated timevarying networks. For
real data sets, we reverse engineer the latent sequence of temporally rewiring
political networks between Senators from the US Senate voting records and the
latent evolving regulatory networks underlying 588 genes across the life cycle
of Drosophila melanogaster from the microarray time course.