
Vector autoregression (VAR) is a fundamental tool for modeling multivariate
time series. However, as the number of component series is increased, the VAR
model becomes overparameterized. Several authors have addressed this issue by
incorporating regularized approaches, such as the lasso in VAR estimation.
Traditional approaches address overparameterization by selecting a low lag
order, based on the assumption of short range dependence, assuming that a
universal lag order applies to all components. Such an approach constrains the
relationship between the components and impedes forecast performance. The
lassobased approaches work much better in highdimensional situations but do
not incorporate the notion of lag order selection.
We propose a new class of hierarchical lag structures (HLag) that embed the
notion of lag selection into a convex regularizer. The key modeling tool is a
group lasso with nested groups which guarantees that the sparsity pattern of
lag coefficients honors the VAR's ordered structure. The HLag framework offers
three structures, which allow for varying levels of flexibility. A simulation
study demonstrates improved performance in forecasting and lag order selection
over previous approaches, and a macroeconomic application further highlights
forecasting improvements as well as HLag's convenient, interpretable output.

The generalized partially linear additive model (GPLAM) is a flexible and
interpretable approach to building predictive models. It combines features in
an additive manner, allowing each to have either a linear or nonlinear effect
on the response. However, the choice of which features to treat as linear or
nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach
in situations in which little is known $a~priori$ about the features, one must
overcome two primary model selection challenges: deciding which features to
include in the model and determining which of these features to treat
nonlinearly. We introduce the sparse partially linear additive model (SPLAM),
which combines model fitting and $both$ of these model selection challenges
into a single convex optimization problem. SPLAM provides a bridge between the
lasso and sparse additive models. Through a statistical oracle inequality and
thorough simulation, we demonstrate that SPLAM can outperform other methods
across a broad spectrum of statistical regimes, including the highdimensional
($p\gg N$) setting. We develop efficient algorithms that are applied to real
data sets with half a million samples and over 45,000 features with excellent
predictive performance.

It is common in modern prediction problems for many predictor variables to be
counts of rarely occurring events. This leads to design matrices in which many
columns are highly sparse. The challenge posed by such "rare features" has
received little attention despite its prevalence in diverse areas, ranging from
natural language processing (e.g., rare words) to biology (e.g., rare species).
We show, both theoretically and empirically, that not explicitly accounting for
the rareness of features can greatly reduce the effectiveness of an analysis.
We next propose a framework for aggregating rare features into denser features
in a flexible manner that creates better predictors of the response. Our
strategy leverages side information in the form of a tree that encodes feature
similarity.
We apply our method to data from TripAdvisor, in which we predict the
numerical rating of a hotel based on the text of the associated review. Our
method achieves high accuracy by making effective use of rare words; by
contrast, the lasso is unable to identify highly predictive words if they are
too rare. A companion R package, called rare, implements our new estimator,
using the alternating direction method of multipliers.

Regularization has become a primary tool for developing reliable estimators
of the covariance matrix in highdimensional settings. To curb the curse of
dimensionality, numerous methods assume that the population covariance (or
inverse covariance) matrix is sparse, while making no particular structural
assumptions on the desired pattern of sparsity. A highlyrelated, yet
complementary, literature studies the specific setting in which the measured
variables have a known ordering, in which case a banded population matrix is
often assumed. While the banded approach is conceptually and computationally
easier than asking for "patternless sparsity," it is only applicable in very
specific situations (such as when data are measured over time or
onedimensional space). This work proposes a generalization of the notion of
bandedness that greatly expands the range of problems in which banded
estimators apply.
We develop convex regularizers occupying the broad middle ground between the
former approach of "patternless sparsity" and the latter reliance on having a
known ordering. Our framework defines bandedness with respect to a known graph
on the measured variables. Such a graph is available in diverse situations, and
we provide a theoretical, computational, and applied treatment of two new
estimators. An R package, called ggb, implements these new methods.

The lasso has been studied extensively as a tool for estimating the
coefficient vector in the highdimensional linear model; however, considerably
less is known about estimating the error variance. Indeed, most wellknown
theoretical properties of the lasso, including recent advances in selective
inference with the lasso, are established under the assumption that the
underlying error variance is known. Yet the error variance in practice is, of
course, unknown. In this paper, we propose the natural lasso estimator for the
error variance, which maximizes a penalized likelihood objective. A key aspect
of the natural lasso is that the likelihood is expressed in terms of the
natural parameterization of the multiparameter exponential family of a Gaussian
with unknown mean and variance. The result is a remarkably simple estimator
with provably good performance in terms of mean squared error. These
theoretical results do not require placing any assumptions on the design matrix
or the true regression coefficients. We also propose a companion estimator,
called the organic lasso, which theoretically does not require tuning of the
regularization parameter. Both estimators do well compared to preexisting
methods, especially in settings where successful recovery of the true support
of the coefficient vector is hard.

The TREX is a recently introduced approach to sparse linear regression. In
contrast to most wellknown approaches to penalized regression, the TREX can be
formulated without the use of tuning parameters. In this paper, we establish
the first known prediction error bounds for the TREX. Additionally, we
introduce extensions of the TREX to a more general class of penalties, and we
provide a bound on the prediction error in this generalized setting. These
results deepen the understanding of TREX from a theoretical perspective and
provide new insights into penalized regression in general.

The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the
theory of multivariate time series; however, in practice, identifiability
issues have led many authors to abandon VARMA modeling in favor of the simpler
Vector AutoRegressive (VAR) model. Such a practice is unfortunate since even
very simple VARMA models can have quite complicated VAR representations. We
narrow this gap with a new optimizationbased approach to VARMA identification
that is built upon the principle of parsimony. Among all equivalent
datagenerating models, we seek the parameterization that is "simplest" in a
certain sense. A userspecified strongly convex penalty is used to measure
model simplicity, and that same penalty is then used to define an estimator
that can be efficiently computed. We show that our estimator converges to a
parsimonious element in the set of all equivalent datagenerating models, in a
double asymptotic regime where the number of component time series is allowed
to grow with sample size. Further, we derive nonasymptotic upper bounds on the
estimation error of our method relative to our specially identified target.
Novel theoretical machinery includes nonasymptotic analysis of infiniteorder
VAR, elastic net estimation under a singular covariance structure of
regressors, and new concentration inequalities for quadratic forms of random
variables from Gaussian time series. We illustrate the competitive performance
of our methods in simulation and several application domains, including
macroeconomic forecasting, demand forecasting, and volatility forecasting.

Demanding sparsity in estimated models has become a routine practice in
statistics. In many situations, we wish to require that the sparsity patterns
attained honor certain problemspecific constraints. Hierarchical sparse
modeling (HSM) refers to situations in which these constraints specify that one
set of parameters be set to zero whenever another is set to zero. In recent
years, numerous papers have developed convex regularizers for this form of
sparsity structure, which arises in many areas of statistics including
interaction modeling, time series analysis, and covariance estimation. In this
paper, we observe that these methods fall into two frameworks, the group lasso
(GL) and latent overlapping group lasso (LOG), which have not been
systematically compared in the context of HSM. The purpose of this paper is to
provide a sidebyside comparison of these two frameworks for HSM in terms of
their statistical properties and computational efficiency. We call special
attention to GL's more aggressive shrinkage of parameters deep in the
hierarchy, a property not shared by LOG. In terms of computation, we introduce
a finitestep algorithm that exactly solves the proximal operator of LOG for a
certain simple HSM structure; we later exploit this to develop a novel
pathbased block coordinate descent scheme for general HSM structures. Both
algorithms greatly improve the computational performance of LOG. Finally, we
compare the two methods in the context of covariance estimation, where we
introduce a new sparselybanded estimator using LOG, which we show achieves the
statistical advantages of an existing GLbased method but is simpler to express
and more efficient to compute.

In many applications, data come with a natural ordering. This ordering can
often induce local dependence among nearby variables. However, in complex data,
the width of this dependence may vary, making simple assumptions such as a
constant neighborhood size unrealistic. We propose a framework for learning
this local dependence based on estimating the inverse of the Cholesky factor of
the covariance matrix. Penalized maximum likelihood estimation of this matrix
yields a simple regression interpretation for local dependence in which
variables are predicted by their neighbors. Our proposed method involves
solving a convex, penalized Gaussian likelihood problem with a hierarchical
group lasso penalty. The problem decomposes into independent subproblems which
can be solved efficiently in parallel using firstorder methods. Our method
yields a sparse, symmetric, positive definite estimator of the precision
matrix, encoding a Gaussian graphical model. We derive theoretical results not
found in existing methods attaining this structure. In particular, our
conditions for signed support recovery and estimation consistency rates in
multiple norms are as mild as those in a regression problem. Empirical results
show our method performing favorably compared to existing methods. We apply our
method to genomic data to flexibly model linkage disequilibrium. Our method is
also applied to improve the performance of discriminant analysis in sound
recording classification.

The vector autoregression (VAR) has long proven to be an effective method for
modeling the joint dynamics of macroeconomic time series as well as
forecasting. A major shortcoming of the VAR that has hindered its applicability
is its heavy parameterization: the parameter space grows quadratically with the
number of series included, quickly exhausting the available degrees of freedom.
Consequently, forecasting using VARs is intractable for lowfrequency,
highdimensional macroeconomic data. However, empirical evidence suggests that
VARs that incorporate more component series tend to result in more accurate
forecasts. Conventional methods that allow for the estimation of large VARs
either tend to require ad hoc subjective specifications or are computationally
infeasible. Moreover, as global economies become more intricately intertwined,
there has been substantial interest in incorporating the impact of stochastic,
unmodeled exogenous variables. Vector autoregression with exogenous variables
(VARX) extends the VAR to allow for the inclusion of unmodeled variables, but
it similarly faces dimensionality challenges.
We introduce the VARXL framework, a structured family of VARX models, and
provide methodology that allows for both efficient estimation and accurate
forecasting in highdimensional analysis. VARXL adapts several prominent
scalar regression regularization techniques to a vector time series context in
order to greatly reduce the parameter space of VAR and VARX models. We also
highlight a compelling extension that allows for shrinking toward reference
models, such as a vector random walk. We demonstrate the efficacy of VARXL in
both low and highdimensional macroeconomic forecasting applications and
simulated data examples. Our methodology is easily reproducible in a publicly
available R package.

The R package BigVAR allows for the simultaneous estimation of
highdimensional time series by applying structured penalties to the
conventional vector autoregression (VAR) and vector autoregression with
exogenous variables (VARX) frameworks. Our methods can be utilized in many
forecasting applications that make use of timedependent data such as
macroeconomics, finance, and internet traffic. Our package extends solution
algorithms from the machine learning and signal processing literatures to a
time dependent setting: selecting the regularization parameter by sequential
cross validation and provides substantial improvements in forecasting
performance over conventional methods. We offer a userfriendly interface that
utilizes R's s4 object class structure which makes our methodology easily
accessible to practicioners.
In this paper, we present an overview of our notation, the models that
comprise BigVAR, and the functionality of our package with a detailed example
using publicly available macroeconomic data. In addition, we present a
simulation study comparing the performance of several procedures that refit the
support selected by a BigVAR procedure according to several variants of least
squares and conclude that refitting generally degrades forecast performance.

The TREX is a recently introduced method for performing sparse
highdimensional regression. Despite its statistical promise as an alternative
to the lasso, squareroot lasso, and scaled lasso, the TREX is computationally
challenging in that it requires solving a nonconvex optimization problem. This
paper shows a remarkable result: despite the nonconvexity of the TREX problem,
there exists a polynomialtime algorithm that is guaranteed to find the global
minimum. This result adds the TREX to a very short list of nonconvex
optimization problems that can be globally optimized (principal components
analysis being a famous example). After deriving and developing this new
approach, we demonstrate that (i) the ability of the preexisting TREX heuristic
to reach the global minimum is strongly dependent on the difficulty of the
underlying statistical problem, (ii) the new polynomialtime algorithm for TREX
permits a novel variable ranking and selection scheme, (iii) this scheme can be
incorporated into a rule that controls the false discovery rate (FDR) of
included features in the model. To achieve this last aim, we provide an
extension of the results of Barber & Candes (2015) to establish that the
knockoff filter framework can be applied to the TREX. This investigation thus
provides both a rare case study of a heuristic for nonconvex optimization and
a novel way of exploiting nonconvexity for statistical inference.

The simulator is an R package that streamlines the process of performing
simulations by creating a common infrastructure that can be easily used and
reused across projects. Methodological statisticians routinely write
simulations to compare their methods to preexisting ones. While developing
ideas, there is a temptation to write "quick and dirty" simulations to try out
ideas. This approach of rapid prototyping is useful but can sometimes backfire
if bugs are introduced. Using the simulator allows one to remove the "dirty"
without sacrificing the "quick." Coding is quick because the statistician
focuses exclusively on those aspects of the simulation that are specific to the
particular paper being written. Code written with the simulator is succinct,
highly readable, and easily shared with others. The modular nature of
simulations written with the simulator promotes code reusability, which saves
time and facilitates reproducibility. The syntax of the simulator leads to
simulation code that is easily humanreadable. Other benefits of using the
simulator include the ability to "step in" to a simulation and change one
aspect without having to rerun the entire simulation from scratch, the
straightforward integration of parallel computing into simulations, and the
ability to rapidly generate plots, tables, and reports with minimal effort.

We consider the testing of all pairwise interactions in a twoclass problem
with many features. We devise a hierarchical testing framework that considers
an interaction only when one or more of its constituent features has a nonzero
main effect. The test is based on a convex optimization framework that
seamlessly considers main effects and interactions together. We show  both in
simulation and on a genomic data set from the SAPPHIRe study  a potential gain
in power and interpretability over a standard (nonhierarchical) interaction
test.

We introduce a new sparse estimator of the covariance matrix for
highdimensional models in which the variables have a known ordering. Our
estimator, which is the solution to a convex optimization problem, is
equivalently expressed as an estimator which tapers the sample covariance
matrix by a Toeplitz, sparselybanded, dataadaptive matrix. As a result of
this adaptivity, the convex banding estimator enjoys theoretical optimality
properties not attained by previous banding or tapered estimators. In
particular, our convex banding estimator is minimax rate adaptive in Frobenius
and operator norms, up to log factors, over commonlystudied classes of
covariance matrices, and over more general classes. Furthermore, it correctly
recovers the bandwidth when the true covariance is exactly banded. Our convex
formulation admits a simple and efficient algorithm. Empirical studies
demonstrate its practical effectiveness and illustrate that our exactlybanded
estimator works well even when the true covariance matrix is only close to a
banded matrix, confirming our theoretical results. Our method compares
favorably with all existing methods, in terms of accuracy and speed. We
illustrate the practical merits of the convex banding estimator by showing that
it can be used to improve the performance of discriminant analysis for
classifying sound recordings.

We add a set of convex constraints to the lasso to produce sparse interaction
models that honor the hierarchy restriction that an interaction only be
included in a model if one or both variables are marginally important. We give
a precise characterization of the effect of this hierarchy constraint, prove
that hierarchy holds with probability one and derive an unbiased estimate for
the degrees of freedom of our estimator. A bound on this estimate reveals the
amount of fitting "saved" by the hierarchy constraint. We distinguish between
parameter sparsity  the number of nonzero coefficients  and practical
sparsity  the number of raw variables one must measure to make a new
prediction. Hierarchy focuses on the latter, which is more closely tied to
important data collection concerns such as cost, time and effort. We develop an
algorithm, available in the R package hierNet, and perform an empirical study
of our method.

Prototype methods seek a minimal subset of samples that can serve as a
distillation or condensed view of a data set. As the size of modern data sets
grows, being able to present a domain specialist with a short list of
"representative" samples chosen from the data set is of increasing
interpretative value. While much recent statistical research has been focused
on producing sparseinthevariables methods, this paper aims at achieving
sparsity in the samples. We discuss a method for selecting prototypes in the
classification setting (in which the samples fall into known discrete
categories). Our method of focus is derived from three basic properties that we
believe a good prototype set should satisfy. This intuition is translated into
a set cover optimization problem, which we solve approximately using standard
approaches. While prototype selection is usually viewed as purely a means
toward building an efficient classifier, in this paper we emphasize the
inherent value of having a set of prototypical elements. That said, by using
the nearestneighbor rule on the set of prototypes, we can of course discuss
our method as a classifier as well.

We consider rules for discarding predictors in lasso regression and related
problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE"
rules that guarantee that a coefficient will be zero in the solution, based on
the inner products of each predictor with the outcome. In this paper we propose
strong rules that are not foolproof but rarely fail in practice. These can be
complemented with simple checks of the Karush KuhnTucker (KKT) conditions to
provide safe rules that offer substantial speed and space savings in a variety
of statistical convex optimization problems.

The CUR decomposition provides an approximation of a matrix $X$ that has low
reconstruction error and that is sparse in the sense that the resulting
approximation lies in the span of only a few columns of $X$. In this regard, it
appears to be similar to many sparse PCA methods. However, CUR takes a
randomized algorithmic approach, whereas most sparse PCA methods are framed as
convex optimization problems. In this paper, we try to understand CUR from a
sparse optimization viewpoint. We show that CUR is implicitly optimizing a
sparse regression objective and, furthermore, cannot be directly cast as a
sparse PCA method. We also observe that the sparsity attained by CUR possesses
an interesting structure, which leads us to formulate a sparse PCA method that
achieves a CURlike sparsity.

We introduce a new nearestprototype classifier, the prototype vector machine
(PVM). It arises from a combinatorial optimization problem which we cast as a
variant of the set cover problem. We propose two algorithms for approximating
its solution. The PVM selects a relatively small number of representative
points which can then be used for classification. It contains 1NN as a special
case. The method is compatible with any dissimilarity measure, making it
amenable to situations in which the data are not embedded in an underlying
feature space or in which using a nonEuclidean metric is desirable. Indeed, we
demonstrate on the much studied ZIP code data how the PVM can reap the benefits
of a problemspecific metric. In this example, the PVM outperforms the highly
successful 1NN with tangent distance, and does so retaining fewer than half of
the data points. This example highlights the strengths of the PVM in yielding a
lowerror, highly interpretable model. Additionally, we apply the PVM to a
protein classification problem in which a kernelbased distance is used.