
This paper studies how to capture dependency graph structures from real data
which may not be multivariate Gaussian. Starting from marginal loss functions
not necessarily derived from probability distributions, we utilize an additive
overparametrization with shrinkage to incorporate variable dependencies into
the criterion. An iterative Gaussian graph learning algorithm is proposed with
ease in implementation. Statistical analysis shows that the estimators achieve
satisfactory accuracy with the error measured in terms of a proper Bregman
divergence. Reallife examples in different settings are given to demonstrate
the efficacy of the proposed methodology.

This paper revisits the classic iterative proportional scaling (IPS) from a
modern optimization perspective. In contrast to the criticisms made in the
literature, we show that based on a coordinate descent characterization, IPS
can be slightly modified to deliver coefficient estimates, and from a
majorizationminimization standpoint, IPS can be extended to handle logaffine
models with features not necessarily binaryvalued or nonnegative. Furthermore,
some stateoftheart optimization techniques such as blockwise computation,
randomization and momentumbased acceleration can be employed to provide more
scalable IPS algorithms, as well as some regularized variants of IPS for
concurrent feature selection.

In highdimensional multivariate regression problems, enforcing low rank in
the coefficient matrix offers effective dimension reduction, which greatly
facilitates parameter estimation and model interpretation. However,
commonlyused reducedrank methods are sensitive to data corruption, as the
lowrank dependence structure between response variables and predictors is
easily distorted by outliers. We propose a robust reducedrank regression
approach for joint modeling and outlier detection. The problem is formulated as
a regularized multivariate regression with a sparse meanshift parametrization,
which generalizes and unifies some popular robust multivariate methods. An
efficient thresholdingbased iterative procedure is developed for optimization.
We show that the algorithm is guaranteed to converge, and the coordinatewise
minimum point produced is statistically accurate under regularity conditions.
Our theoretical investigations focus on nonasymptotic robust analysis, which
demonstrates that joint rank reduction and outlier detection leads to improved
prediction accuracy. In particular, we show that redescending $\psi$functions
can essentially attain the minimax optimal error rate, and in some less
challenging problems convex regularization guarantees the same low error rate.
The performance of the proposed method is examined by simulation studies and
real data examples.

Variable selection for models including interactions between explanatory
variables often needs to obey certain hierarchical constraints. The weak or
strong structural hierarchy requires that the existence of an interaction term
implies at least one or both associated main effects to be present in the
model. Lately, this problem has attracted a lot of attention, but existing
computational algorithms converge slow even with a moderate number of
predictors. Moreover, in contrast to the rich literature on ordinary variable
selection, there is a lack of statistical theory to show reasonably low error
rates of hierarchical variable selection.
This work investigates a new class of estimators that make use of multiple
group penalties to capture structural parsimony. We give the minimax lower
bounds for strong and weak hierarchical variable selection and show that the
proposed estimators enjoy sharp rate oracle inequalities. A generalpurpose
algorithm is developed with guaranteed convergence and global optimality.
Simulations and real data experiments demonstrate the efficiency and efficacy
of the proposed approach.

This paper studies simultaneous feature selection and extraction in
supervised and unsupervised learning. We propose and investigate selective
reduced rank regression for constructing optimal explanatory factors from a
parsimonious subset of input features. The proposed estimators enjoy sharp
oracle inequalities, and with a predictive information criterion for model
selection, they adapt to unknown sparsity by controlling both rank and row
support of the coefficient matrix. A class of algorithms is developed that can
accommodate various convex and nonconvex sparsityinducing penalties, and can
be used for rankconstrained variable screening in highdimensional
multivariate data. The paper also showcases applications in macroeconomics and
computer vision to demonstrate how lowdimensional data structures can be
effectively captured by joint variable selection and projection.

In largescale modern data analysis, firstorder optimization methods are
usually favored to obtain sparse estimators in high dimensions. This paper
performs theoretical analysis of a class of iterative thresholding based
estimators defined in this way. Oracle inequalities are built to show the
nearly minimax rate optimality of such estimators under a new type of
regularity conditions. Moreover, the sequence of iterates is found to be able
to approach the statistical truth within the best statistical accuracy
geometrically fast. Our results also reveal different benefits brought by
convex and nonconvex types of shrinkage.

Many computer vision and medical imaging problems are faced with learning
from largescale datasets, with millions of observations and features. In this
paper we propose a novel efficient learning scheme that tightens a sparsity
constraint by gradually removing variables based on a criterion and a schedule.
The attractive fact that the problem size keeps dropping throughout the
iterations makes it particularly suitable for big data learning. Our approach
applies generically to the optimization of any differentiable loss function,
and finds applications in regression, classification and ranking. The resultant
algorithms build variable screening into estimation and are extremely simple to
implement. We provide theoretical guarantees of convergence and selection
consistency. In addition, one dimensional piecewise linear response functions
are used to account for nonlinearity and a second order prior is imposed on
these functions to avoid overfitting. Experiments on real and synthetic data
show that the proposed method compares very well with other state of the art
methods in regression, classification and ranking while being computationally
very efficient and scalable.

Principal Component Analysis (PCA) is a dimension reduction technique. It
produces inconsistent estimators when the dimensionality is moderate to high,
which is often the problem in modern largescale applications where algorithm
scalability and model interpretability are difficult to achieve, not to mention
the prevalence of missing values. While existing sparse PCA methods alleviate
inconsistency, they are constrained to the Gaussian assumption of classical PCA
and fail to address algorithm scalability issues. We generalize sparse PCA to
the broad exponential family distributions under highdimensional setup, with
builtin treatment for missing values. Meanwhile we propose a family of
iterative sparse generalized PCA (SGPCA) algorithms such that despite the
nonconvexity and nonsmoothness of the optimization task, the loss function
decreases in every iteration. In terms of ease and intuitive parameter tuning,
our sparsityinducing regularization is far superior to the popular Lasso.
Furthermore, to promote overall scalability, accelerated gradient is integrated
for fast convergence, while a progressive screening technique gradually
squeezes out nuisance dimensions of a largescale problem for feasible
optimization. Highdimensional simulation and real data experiments demonstrate
the efficiency and efficacy of SGPCA.

Recently, the robustification of principal component analysis has attracted
lots of attention from statisticians, engineers and computer scientists. In
this work we study the type of outliers that are not necessarily apparent in
the original observation space but can seriously affect the principal subspace
estimation. Based on a mathematical formulation of such transformed outliers, a
novel robust orthogonal complement principal component analysis (ROCPCA) is
proposed. The framework combines the popular sparsityenforcing and low rank
regularization techniques to deal with rowwise outliers as well as
elementwise outliers. A nonasymptotic oracle inequality guarantees the
accuracy and high breakdown performance of ROCPCA in finite samples. To tackle
the computational challenges, an efficient algorithm is developed on the basis
of Stiefel manifold optimization and iterative thresholding. Furthermore, a
batch variant is proposed to significantly reduce the cost in ultra high
dimensions. The paper also points out a pitfall of a common practice of SVD
reduction in robust PCA. Experiments show the effectiveness and efficiency of
ROCPCA in both synthetic and real data.

This paper studies largescale dynamical networks where the current state of
the system is a linear transformation of the previous state, contaminated by a
multivariate Gaussian noise. Examples include stock markets, human brains and
gene regulatory networks. We introduce a transition matrix to describe the
evolution, which can be translated to a directed Granger transition graph, and
use the concentration matrix of the Gaussian noise to capture the secondorder
relations between nodes, which can be translated to an undirected conditional
dependence graph. We propose regularizing the two graphs jointly in topology
identification and dynamics estimation. Based on the notion of joint
association graph (JAG), we develop a joint graphical screening and estimation
(JGSE) framework for efficient network learning in big data. In particular, our
method can predetermine and remove unnecessary edges based on the joint
graphical structure, referred to as JAG screening, and can decompose a large
network into smaller subnetworks in a robust manner, referred to as JAG
decomposition. JAG screening and decomposition can reduce the problem size and
search space for fine estimation at a later stage. Experiments on both
synthetic data and realworld applications show the effectiveness of the
proposed framework in largescale network topology identification and dynamics
estimation.

Largescale recurrent networks have drawn increasing attention recently
because of their capabilities in modeling a large variety of realworld
phenomena and physical mechanisms. This paper studies how to identify all
authentic connections and estimate system parameters of a recurrent network,
given a sequence of node observations. This task becomes extremely challenging
in modern network applications, because the available observations are usually
very noisy and limited, and the associated dynamical system is strongly
nonlinear. By formulating the problem as multivariate sparse sigmoidal
regression, we develop simpletoimplement network learning algorithms, with
rigorous convergence guarantee in theory, for a variety of sparsitypromoting
penalty forms. A quantile variant of progressive recurrent network screening is
proposed for efficient computation and allows for direct cardinality control of
network topology in estimation. Moreover, we investigate recurrent network
stability conditions in Lyapunov's sense, and integrate such stability
constraints into sparse network learning. Experiments show excellent
performance of the proposed algorithms in network topology identification and
forecasting.

We introduce and study the Group SquareRoot Lasso (GSRL) method for
estimation in high dimensional sparse regression models with group structure.
The new estimator minimizes the square root of the residual sum of squares plus
a penalty term proportional to the sum of the Euclidean norms of groups of the
regression parameter vector. The net advantage of the method over the existing
Group Lasso (GL)type procedures consists in the form of the proportionality
factor used in the penalty term, which for GSRL is independent of the variance
of the error terms. This is of crucial importance in models with more
parameters than the sample size, when estimating the variance of the noise
becomes as difficult as the original problem. We show that the GSRL estimator
adapts to the unknown sparsity of the regression vector, and has the same
optimal estimation and prediction accuracy as the GL estimators, under the same
minimal conditions on the model. This extends the results recently established
for the SquareRoot Lasso, for sparse regression without group structure.
Moreover, as a new type of result for SquareRoot Lasso methods, with or
without groups, we study correct pattern recovery, and show that it can be
achieved under conditions similar to those needed by the Lasso or
GroupLassotype methods, but with a simplified tuning strategy. We implement
our method via a new algorithm, with proved convergence properties, which,
unlike existing methods, scales well with the dimension of the problem. Our
simulation studies support strongly our theoretical findings.

Recently, sparsitybased algorithms are proposed for superresolution
spectrum estimation. However, to achieve adequately high resolution in
realworld signal analysis, the dictionary atoms have to be close to each other
in frequency, thereby resulting in a coherent design. The popular convex
compressed sensing methods break down in presence of high coherence and large
noise. We propose a new regularization approach to handle model collinearity
and obtain parsimonious frequency selection simultaneously. It takes advantage
of the pairing structure of sine and cosine atoms in the frequency dictionary.
A probabilistic spectrum screening is also developed for fast computation in
high dimensions. A dataresampling version of highdimensional Bayesian
Information Criterion is used to determine the regularization parameters.
Experiments show the efficacy and efficiency of the proposed algorithms in
challenging situations with small sample size, high frequency resolution, and
low signaltonoise ratio.

We propose dimension reduction methods for sparse, highdimensional
multivariate response regression models. Both the number of responses and that
of the predictors may exceed the sample size. Sometimes viewed as
complementary, predictor selection and rank reduction are the most popular
strategies for obtaining lowerdimensional approximations of the parameter
matrix in such models. We show in this article that important gains in
prediction accuracy can be obtained by considering them jointly. We motivate a
new class of sparse multivariate regression models, in which the coefficient
matrix has low rank and zero rows or can be well approximated by such a matrix.
Next, we introduce estimators that are based on penalized least squares, with
novel penalties that impose simultaneous row and rank restrictions on the
coefficient matrix. We prove that these estimators indeed adapt to the unknown
matrix sparsity and have fast rates of convergence. We support our theoretical
results with an extensive simulation study and two data analyses.

Supervised linear feature extraction can be achieved by fitting a reduced
rank multivariate model. This paper studies rank penalized and rank constrained
vector generalized linear models. From the perspective of thresholding rules,
we build a framework for fitting singular value penalized models and use it for
feature extraction. Through solving the rank constraint form of the problem, we
propose progressive feature space reduction for fast computation in high
dimensions with little performance loss. A novel projective crossvalidation is
proposed for parameter tuning in such nonconvex setups. Real data applications
are given to show the power of the methodology in supervised dimension
reduction and feature extraction.

We provide a simple method and relevant theoretical analysis for efficiently
estimating higherorder lp distances. While the analysis mainly focuses on l4,
our methodology extends naturally to p = 6,8,10..., (i.e., when p is even).
Distancebased methods are popular in machine learning. In largescale
applications, storing, computing, and retrieving the distances can be both
space and time prohibitive. Efficient algorithms exist for estimating lp
distances if 0 < p <= 2. The task for p > 2 is known to be difficult. Our work
partially fills this gap.

Highdimensional data pose challenges in statistical learning and modeling.
Sometimes the predictors can be naturally grouped where pursuing the
betweengroup sparsity is desired. Collinearity may occur in realworld
highdimensional applications where the popular $l_1$ technique suffers from
both selection inconsistency and prediction inaccuracy. Moreover, the problems
of interest often go beyond Gaussian models. To meet these challenges,
nonconvex penalized generalized linear models with grouped predictors are
investigated and a simpletoimplement algorithm is proposed for computation. A
rigorous theoretical result guarantees its convergence and provides tight
preliminary scaling. This framework allows for grouped predictors and nonconvex
penalties, including the discrete $l_0$ and the `$l_0+l_2$' type penalties.
Penalty design and parameter tuning for nonconvex penalties are examined.
Applications of superresolution spectrum estimation in signal processing and
cancer classification with joint gene selection in bioinformatics show the
performance improvement by nonconvex penalized estimation.

This paper studies the outlier detection problem from the point of view of
penalized regressions. Our regression model adds one mean shift parameter for
each of the $n$ data points. We then apply a regularization favoring a sparse
vector of mean shift parameters. The usual $L_1$ penalty yields a convex
criterion, but we find that it fails to deliver a robust estimator. The $L_1$
penalty corresponds to soft thresholding. We introduce a thresholding (denoted
by $\Theta$) based iterative procedure for outlier detection ($\Theta$IPOD). A
version based on hard thresholding correctly identifies outliers on some hard
test problems. We find that $\Theta$IPOD is much faster than iteratively
reweighted least squares for large data because each iteration costs at most
$O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate.
We describe the connection between $\Theta$IPOD and $M$estimators. Our
proposed method has one tuning parameter with which to both identify outliers
and estimate regression coefficients. A datadependent choice can be made based
on BIC. The tuned $\Theta$IPOD shows outstanding performance in identifying
outliers in various situations in comparison to other existing approaches. This
methodology extends to highdimensional modeling with $p\gg n$, if both the
coefficient vector and the outlier pattern are sparse.

We introduce a new criterion, the Rank Selection Criterion (RSC), for
selecting the optimal reduced rank estimator of the coefficient matrix in
multivariate response regression models. The corresponding RSC estimator
minimizes the Frobenius norm of the fit plus a regularization term proportional
to the number of parameters in the reduced rank model. The rank of the RSC
estimator provides a consistent estimator of the rank of the coefficient
matrix; in general, the rank of our estimator is a consistent estimate of the
effective rank, which we define to be the number of singular values of the
target matrix that are appropriately large. The consistency results are valid
not only in the classic asymptotic regime, when $n$, the number of responses,
and $p$, the number of predictors, stay bounded, and $m$, the number of
observations, grows, but also when either, or both, $n$ and $p$ grow, possibly
much faster than $m$. We establish minimax optimal bounds on the mean squared
errors of our estimators. Our finite sample performance bounds for the RSC
estimator show that it achieves the optimal balance between the approximation
error and the penalty term. Furthermore, our procedure has very low
computational complexity, linear in the number of candidate models, making it
particularly appealing for large scale problems. We contrast our estimator with
the nuclear norm penalized least squares (NNP) estimator, which has an
inherently higher computational complexity than RSC, for multivariate
regression models. We show that NNP has estimation properties similar to those
of RSC, albeit under stronger conditions. However, it is not as parsimonious as
RSC. We offer a simple correction of the NNP estimator which leads to
consistent rank estimation.

This paper discusses a class of thresholdingbased iterative selection
procedures (TISP) for model selection and shrinkage. People have long before
noticed the weakness of the convex $l_1$constraint (or the softthresholding)
in wavelets and have designed many different forms of nonconvex penalties to
increase model sparsity and accuracy. But for a nonorthogonal regression
matrix, there is great difficulty in both investigating the performance in
theory and solving the problem in computation. TISP provides a simple and
efficient way to tackle this so that we successfully borrow the rich results in
the orthogonal design to solve the nonconvex penalized regression for a general
design matrix. Our starting point is, however, thresholding rules rather than
penalty functions. Indeed, there is a universal connection between them. But a
drawback of the latter is its nonunique form, and our approach greatly
facilitates the computation and the analysis. In fact, we are able to build the
convergence theorem and explore theoretical properties of the selection and
estimation via TISP nonasymptotically. More importantly, a novel HybridTISP is
proposed based on hardthresholding and ridgethresholding. It provides a
fusion between the $l_0$penalty and the $l_2$penalty, and adaptively achieves
the right balance between shrinkage and selection in statistical modeling. In
practice, HybridTISP shows superior performance in testerror and is
parsimonious.