
Modelbased clustering defines population level clusters relative to a model
that embeds notions of similarity. Algorithms tailored to such models yield
estimated clusters with a clear statistical interpretation. We take this view
here and introduce the class of Gblock covariance models as a background model
for variable clustering. In such models, two variables in a cluster are deemed
similar if they have similar associations will all other variables. This can
arise, for instance, when groups of variables are noise corrupted versions of
the same latent factor. We quantify the difficulty of clustering data generated
from a Gblock covariance model in terms of cluster proximity, measured with
respect to two related, but different, cluster separation metrics. We derive
minimax cluster separation thresholds, which are the metric values below which
no algorithm can recover the modeldefined clusters exactly, and show that they
are different for the two metrics. We therefore develop two algorithms, COD and
PECOK, tailored to Gblock covariance models, and study their
minimaxoptimality with respect to each metric. Of independent interest is the
fact that the analysis of the PECOK algorithm, which is based on a corrected
convex relaxation of the popular Kmeans algorithm, provides the first
statistical analysis of such algorithms for variable clustering. Additionally,
we contrast our methods with another popular clustering method, spectral
clustering, specialized to variable clustering, and show that ensuring exact
cluster recovery via this method requires clusters to have a higher separation,
relative to the minimax threshold. Extensive simulation studies, as well as our
data analyses, confirm the applicability of our approach.

This work introduces a novel estimation method, called LOVE, of the entries
and structure of a loading matrix A in a sparse latent factor model X = AZ + E,
for an observable random vector X in Rp, with correlated unobservable factors Z
\in RK, with K unknown, and independent noise E. Each row of A is scaled and
sparse. In order to identify the loading matrix A, we require the existence of
pure variables, which are components of X that are associated, via A, with one
and only one latent factor. Despite the fact that the number of factors K, the
number of the pure variables, and their location are all unknown, we only
require a mild condition on the covariance matrix of Z, and a minimum of only
two pure variables per latent factor to show that A is uniquely defined, up to
signed permutations. Our proofs for model identifiability are constructive, and
lead to our novel estimation method of the number of factors and of the set of
pure variables, from a sample of size n of observations on X. This is the first
step of our LOVE algorithm, which is optimizationfree, and has low
computational complexity of order p2. The second step of LOVE is an easily
implementable linear program that estimates A. We prove that the resulting
estimator is minimax rate optimal up to logarithmic factors in p. The model
structure is motivated by the problem of overlapping variable clustering,
ubiquitous in data science. We define the population level clusters as groups
of those components of X that are associated, via the sparse matrix A, with the
same unobservable latent factor, and multifactor association is allowed.
Clusters are respectively anchored by the pure variables, and form overlapping
subgroups of the pdimensional random vector X. The Latent model approach to
OVErlapping clustering is reflected in the name of our algorithm, LOVE.

The problem of variable clustering is that of grouping similar components of
a $p$dimensional vector $X=(X_{1},\ldots,X_{p})$, and estimating these groups
from $n$ independent copies of $X$. When cluster similarity is defined via
$G$latent models, in which groups of $X$variables have a common latent
generator, and groups are relative to a partition $G$ of the index set $\{1,
\ldots, p\}$, the most natural clustering strategy is $K$means. We explain why
this strategy cannot lead to perfect cluster recovery and offer a correction,
based on semidefinite programing, that can be viewed as a penalized convex
relaxation of $K$means (PECOK). We introduce a cluster separation measure
tailored to $G$latent models, and derive its minimax lower bound for perfect
cluster recovery. The clusters estimated by PECOK are shown to recover $G$ at a
near minimax optimal cluster separation rate, a result that holds true even if
$K$, the number of clusters, is estimated adaptively from the data. We compare
PECOK with appropriate corrections of spectral clusteringtype procedures, and
show that the former outperforms the latter for perfect cluster recovery of
minimally separated clusters.

This work provides a unified analysis of the properties of the sample
covariance matrix $\Sigma_n$ over the class of $p\times p$ population
covariance matrices $\Sigma$ of reduced effective rank $r_e(\Sigma)$. This
class includes scaled factor models and covariance matrices with decaying
spectrum. We consider $r_e(\Sigma)$ as a measure of matrix complexity, and
obtain sharp minimax rates on the operator and Frobenius norm of
$\Sigma_n\Sigma$, as a function of $r_e(\Sigma)$ and $\\Sigma\_2$, the
operator norm of $\Sigma$. With guidelines offered by the optimal rates, we
define classes of matrices of reduced effective rank over which $\Sigma_n$ is
an accurate estimator. Within the framework of these classes, we perform a
detailed finite sample theoretical analysis of the merits and limitations of
the empirical scree plot procedure routinely used in PCA. We show that
identifying jumps in the empirical spectrum that consistently estimate jumps in
the spectrum of $\Sigma$ is not necessarily informative for other goals, for
instance for the selection of those sample eigenvalues and eigenvectors that
are consistent estimates of their population counterparts. The scree plot
method can still be used for selecting consistent eigenvalues, for appropriate
threshold levels. We provide a threshold construction and also give a rule for
checking the consistency of the corresponding sample eigenvectors. We
specialize these results and analysis to population covariance matrices with
polynomially decaying spectra, and extend it to covariance operators with
polynomially decaying spectra. An application to fPCA illustrates how our
results can be used in functional data analysis.

We introduce a new sparse estimator of the covariance matrix for
highdimensional models in which the variables have a known ordering. Our
estimator, which is the solution to a convex optimization problem, is
equivalently expressed as an estimator which tapers the sample covariance
matrix by a Toeplitz, sparselybanded, dataadaptive matrix. As a result of
this adaptivity, the convex banding estimator enjoys theoretical optimality
properties not attained by previous banding or tapered estimators. In
particular, our convex banding estimator is minimax rate adaptive in Frobenius
and operator norms, up to log factors, over commonlystudied classes of
covariance matrices, and over more general classes. Furthermore, it correctly
recovers the bandwidth when the true covariance is exactly banded. Our convex
formulation admits a simple and efficient algorithm. Empirical studies
demonstrate its practical effectiveness and illustrate that our exactlybanded
estimator works well even when the true covariance matrix is only close to a
banded matrix, confirming our theoretical results. Our method compares
favorably with all existing methods, in terms of accuracy and speed. We
illustrate the practical merits of the convex banding estimator by showing that
it can be used to improve the performance of discriminant analysis for
classifying sound recordings.

This paper considers the banding estimator proposed in Bickel and Levina
(2008) for estimation of large covariance matrices. We prove that the banding
estimator achieves rateoptimality under the operator norm, for a class of
approximately banded covariance matrices, improving the existing results in
Bickel and Levina (2008). In addition, we propose a Stein's unbiased risk
estimate (Sure)type approach for selecting the bandwidth for the banding
estimator. Simulations indicate that the Suretuned banding estimator
outperforms competing estimators.

We introduce and study the Group SquareRoot Lasso (GSRL) method for
estimation in high dimensional sparse regression models with group structure.
The new estimator minimizes the square root of the residual sum of squares plus
a penalty term proportional to the sum of the Euclidean norms of groups of the
regression parameter vector. The net advantage of the method over the existing
Group Lasso (GL)type procedures consists in the form of the proportionality
factor used in the penalty term, which for GSRL is independent of the variance
of the error terms. This is of crucial importance in models with more
parameters than the sample size, when estimating the variance of the noise
becomes as difficult as the original problem. We show that the GSRL estimator
adapts to the unknown sparsity of the regression vector, and has the same
optimal estimation and prediction accuracy as the GL estimators, under the same
minimal conditions on the model. This extends the results recently established
for the SquareRoot Lasso, for sparse regression without group structure.
Moreover, as a new type of result for SquareRoot Lasso methods, with or
without groups, we study correct pattern recovery, and show that it can be
achieved under conditions similar to those needed by the Lasso or
GroupLassotype methods, but with a simplified tuning strategy. We implement
our method via a new algorithm, with proved convergence properties, which,
unlike existing methods, scales well with the dimension of the problem. Our
simulation studies support strongly our theoretical findings.

We propose dimension reduction methods for sparse, highdimensional
multivariate response regression models. Both the number of responses and that
of the predictors may exceed the sample size. Sometimes viewed as
complementary, predictor selection and rank reduction are the most popular
strategies for obtaining lowerdimensional approximations of the parameter
matrix in such models. We show in this article that important gains in
prediction accuracy can be obtained by considering them jointly. We motivate a
new class of sparse multivariate regression models, in which the coefficient
matrix has low rank and zero rows or can be well approximated by such a matrix.
Next, we introduce estimators that are based on penalized least squares, with
novel penalties that impose simultaneous row and rank restrictions on the
coefficient matrix. We prove that these estimators indeed adapt to the unknown
matrix sparsity and have fast rates of convergence. We support our theoretical
results with an extensive simulation study and two data analyses.

We introduce a new criterion, the Rank Selection Criterion (RSC), for
selecting the optimal reduced rank estimator of the coefficient matrix in
multivariate response regression models. The corresponding RSC estimator
minimizes the Frobenius norm of the fit plus a regularization term proportional
to the number of parameters in the reduced rank model. The rank of the RSC
estimator provides a consistent estimator of the rank of the coefficient
matrix; in general, the rank of our estimator is a consistent estimate of the
effective rank, which we define to be the number of singular values of the
target matrix that are appropriately large. The consistency results are valid
not only in the classic asymptotic regime, when $n$, the number of responses,
and $p$, the number of predictors, stay bounded, and $m$, the number of
observations, grows, but also when either, or both, $n$ and $p$ grow, possibly
much faster than $m$. We establish minimax optimal bounds on the mean squared
errors of our estimators. Our finite sample performance bounds for the RSC
estimator show that it achieves the optimal balance between the approximation
error and the penalty term. Furthermore, our procedure has very low
computational complexity, linear in the number of candidate models, making it
particularly appealing for large scale problems. We contrast our estimator with
the nuclear norm penalized least squares (NNP) estimator, which has an
inherently higher computational complexity than RSC, for multivariate
regression models. We show that NNP has estimation properties similar to those
of RSC, albeit under stronger conditions. However, it is not as parsimonious as
RSC. We offer a simple correction of the NNP estimator which leads to
consistent rank estimation.

This paper studies sparse density estimation via $\ell_1$ penalization
(SPADES). We focus on estimation in highdimensional mixture models and
nonparametric adaptive density estimation. We show, respectively, that SPADES
can recover, with high probability, the unknown components of a mixture of
probability densities and that it yields minimax adaptive density estimates.
These results are based on a general sparsity oracle inequality that the SPADES
estimates satisfy. We offer a data driven method for the choice of the tuning
parameter used in the construction of SPADES. The method uses the generalized
bisection method first introduced in \citebb09. The suggested procedure
bypasses the need for a grid search and offers substantial computational
savings. We complement our theoretical results with a simulation study that
employs this method for approximations of one and twodimensional densities
with mixtures. The numerical results strongly support our theoretical findings.

Dimension reduction and variable selection are performed routinely in
casecontrol studies, but the literature on the theoretical aspects of the
resulting estimates is scarce. We bring our contribution to this literature by
studying estimators obtained via L1 penalized likelihood optimization. We show
that the optimizers of the L1 penalized retrospective likelihood coincide with
the optimizers of the L1 penalized prospective likelihood. This extends the
results of Prentice and Pyke (1979), obtained for nonregularized likelihoods.
We establish both the supnorm consistency of the odds ratio, after model
selection, and the consistency of subset selection of our estimators. The
novelty of our theoretical results consists in the study of these properties
under the casecontrol sampling scheme. Our results hold for selection
performed over a large collection of candidate variables, with cardinality
allowed to depend and be greater than the sample size. We complement our
theoretical results with a novel approach of determining data driven tuning
parameters, based on the bisection method. The resulting procedure offers
significant computational savings when compared with grid search based methods.
All our numerical experiments support strongly our theoretical findings.

This paper investigates correct variable selection in finite samples via
$\ell_1$ and $\ell_1+\ell_2$ type penalization schemes. The asymptotic
consistency of variable selection immediately follows from this analysis. We
focus on logistic and linear regression models. The following questions are
central to our paper: given a level of confidence $1\delta$, under which
assumptions on the design matrix, for which strength of the signal and for what
values of the tuning parameters can we identify the true model at the given
level of confidence? Formally, if $\widehat{I}$ is an estimate of the true
variable set $I^*$, we study conditions under which
$\mathbb{P}(\widehat{I}=I^*)\geq 1\delta$, for a given sample size $n$, number
of parameters $M$ and confidence $1\delta$. We show that in identifiable
models, both methods can recover coefficients of size $\frac{1}{\sqrt{n}}$, up
to small multiplicative constants and logarithmic factors in $M$ and
$\frac{1}{\delta}$. The advantage of the $\ell_1+\ell_2$ penalization over the
$\ell_1$ is minor for the variable selection problem, for the models we
consider here. Whereas the former estimates are unique, and become more stable
for highly correlated data matrices as one increases the tuning parameter of
the $\ell_2$ part, too large an increase in this parameter value may preclude
variable selection.

In this article we investigate consistency of selection in regression models
via the popular Lasso method. Here we depart from the traditional linear
regression assumption and consider approximations of the regression function
$f$ with elements of a given dictionary of $M$ functions. The target for
consistency is the index set of those functions from this dictionary that
realize the most parsimonious approximation to $f$ among all linear
combinations belonging to an $L_2$ ball centered at $f$ and of radius
$r_{n,M}^2$. In this framework we show that a consistent estimate of this index
set can be derived via $\ell_1$ penalized least squares, with a data dependent
penalty and with tuning sequence $r_{n,M}>\sqrt{\log(Mn)/n}$, where $n$ is the
sample size. Our results hold for any $1\leq M\leq n^{\gamma}$, for any
$\gamma>0$.

This paper studies statistical aggregation procedures in the regression
setting. A motivating factor is the existence of many different methods of
estimation, leading to possibly competing estimators. We consider here three
different types of aggregation: model selection (MS) aggregation, convex (C)
aggregation and linear (L) aggregation. The objective of (MS) is to select the
optimal single estimator from the list; that of (C) is to select the optimal
convex combination of the given estimators; and that of (L) is to select the
optimal linear combination of the given estimators. We are interested in
evaluating the rates of convergence of the excess risks of the estimators
obtained by these procedures. Our approach is motivated by recently published
minimax results [Nemirovski, A. (2000). Topics in nonparametric statistics.
Lectures on Probability Theory and Statistics (SaintFlour, 1998). Lecture
Notes in Math. 1738 85277. Springer, Berlin; Tsybakov, A. B. (2003). Optimal
rates of aggregation. Learning Theory and Kernel Machines. Lecture Notes in
Artificial Intelligence 2777 303313. Springer, Heidelberg]. There exist
competing aggregation procedures achieving optimal convergence rates for each
of the (MS), (C) and (L) cases separately. Since these procedures are not
directly comparable with each other, we suggest an alternative solution. We
prove that all three optimal rates, as well as those for the newly introduced
(S) aggregation (subset selection), are nearly achieved via a single
``universal'' aggregation procedure. The procedure consists of mixing the
initial estimators with weights obtained by penalized least squares. Two
different penalties are considered: one of them is of the BIC type, the second
one is a datadependent $\ell_1$type penalty.

This paper studies oracle properties of $\ell_1$penalized least squares in
nonparametric regression setting with random design. We show that the penalized
least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in
terms of the number of nonzero components of the oracle vector. The results
are valid even when the dimension of the model is (much) larger than the sample
size and the regression matrix is not positive definite. They can be applied to
highdimensional linear regression, to nonparametric adaptive regression
estimation and to the problem of aggregation of arbitrary estimators.

This paper studies statistical aggregation procedures in regression setting.
A motivating factor is the existence of many different methods of estimation,
leading to possibly competing estimators.
We consider here three different types of aggregation: model selection (MS)
aggregation, convex (C) aggregation and linear (L) aggregation. The objective
of (MS) is to select the optimal single estimator from the list; that of (C) is
to select the optimal convex combination of the given estimators; and that of
(L) is to select the optimal linear combination of the given estimators. We are
interested in evaluating the rates of convergence of the excess risks of the
estimators obtained by these procedures. Our approach is motivated by recent
minimax results in Nemirovski (2000) and Tsybakov (2003).
There exist competing aggregation procedures achieving optimal convergence
separately for each one of (MS), (C) and (L) cases. Since the bounds in these
results are not directly comparable with each other, we suggest an alternative
solution. We prove that all the three optimal bounds can be nearly achieved via
a single "universal" aggregation procedure. We propose such a procedure which
consists in mixing of the initial estimators with the weights obtained by
penalized least squares. Two different penalities are considered: one of them
is related to hard thresholding techniques, the second one is a data dependent
$L_1$type penalty. Consequently, our method can be endorsed by both the
proponents of model selection and the advocates of model averaging.

This paper presents a model selection technique of estimation in
semiparametric regression models of the type
Y_i=\beta^{\prime}\underbarX_i+f(T_i)+W_i, i=1,...,n. The parametric and
nonparametric components are estimated simultaneously by this procedure.
Estimation is based on a collection of finitedimensional models, using a
penalized least squares criterion for selection. We show that by tailoring the
penalty terms developed for nonparametric regression to semiparametric models,
we can consistently estimate the subset of nonzero coefficients of the linear
part. Moreover, the selected estimator of the linear component is
asymptotically normal.