
We consider the estimation and inference of graphical models that
characterize the dependency structure of highdimensional tensorvalued data.
To facilitate the estimation of the precision matrix corresponding to each way
of the tensor, we assume the data follow a tensor normal distribution whose
covariance has a Kronecker product structure. A critical challenge in the
estimation and inference of this model is the fact that its penalized maximum
likelihood estimation involves minimizing a nonconvex objective function. To
address it, this paper makes two contributions: (i) In spite of the
nonconvexity of this estimation problem, we prove that an alternating
minimization algorithm, which iteratively estimates each sparse precision
matrix while fixing the others, attains an estimator with an optimal
statistical rate of convergence. (ii) We propose a debiased statistical
inference procedure for testing hypotheses on the true support of the sparse
precision matrices, and employ it for testing a growing number of hypothesis
with false discovery rate (FDR) control. The asymptotic normality of our test
statistic and the consistency of FDR control procedure are established. Our
theoretical results are backed up by thorough numerical studies and our real
applications on neuroimaging studies of Autism spectrum disorder and users'
advertising click analysis bring new scientific findings and business insights.
The proposed methods are encoded into a publicly available R package Tlasso.

Tuning parameter selection is of critical importance for kernel ridge
regression. To this date, data driven tuning method for divideandconquer
kernel ridge regression (dKRR) has been lacking in the literature, which
limits the applicability of dKRR for large data sets. In this paper, by
modifying the Generalized Crossvalidation (GCV, Wahba, 1990) score, we propose
a distributed Generalized CrossValidation (dGCV) as a datadriven tool for
selecting the tuning parameters in dKRR. Not only the proposed dGCV is
computationally scalable for massive data sets, it is also shown, under mild
conditions, to be asymptotically optimal in the sense that minimizing the dGCV
score is equivalent to minimizing the true global conditional empirical loss of
the averaged function estimator, extending the existing optimality results of
GCV to the divideandconquer framework.

The increased availability of massive data sets provides a unique opportunity
to discover subtle patterns in their distributions, but also imposes
overwhelming computational challenges. To fully utilize the information
contained in big data, we propose a twostep procedure: (i) estimate
conditional quantile functions at different levels in a parallel computing
environment; (ii) construct a conditional quantile regression process through
projection based on these estimated quantile curves. Our general quantile
regression framework covers both linear models with fixed or growing dimension
and series approximation models. We prove that the proposed procedure does not
sacrifice any statistical inferential accuracy provided that the number of
distributed computing units and quantile levels are chosen properly. In
particular, a sharp upper bound for the former and a sharp lower bound for the
latter are derived to capture the minimal computational cost from a statistical
perspective. As an important application, the statistical inference on
conditional distribution functions is considered. Moreover, we propose
computationally efficient approaches to conducting inference in the distributed
estimation setting described above. Those approaches directly utilize the
availability of estimators from subsamples and can be carried out at almost no
additional computational cost. Simulations confirm our statistical inferential
theory.

A common challenge in nonparametric inference is its high computational
complexity when data volume is large. In this paper, we develop computationally
efficient nonparametric testing by employing a random projection strategy. In
the specific kernel ridge regression setup, a simple distancebased test
statistic is proposed. Notably, we derive the minimum number of random
projections that is sufficient for achieving testing optimality in terms of the
minimax rate. An adaptive testing procedure is further established without
prior knowledge of regularity. One technical contribution is to establish upper
bounds for a range of tail sums of empirical kernel eigenvalues. Simulations
and real data analysis are conducted to support our theory.

In this paper, we propose a general framework for sparse and lowrank tensor
estimation from cubic sketchings. A twostage nonconvex implementation is
developed based on sparse tensor decomposition and thresholded gradient
descent, which ensures exact recovery in the noiseless case and stable recovery
in the noisy case with high probability. The nonasymptotic analysis sheds
light on an interplay between optimization error and statistical error. The
proposed procedure is shown to be rateoptimal under certain conditions. As a
technical byproduct, novel highorder concentration inequalities are derived
for studying highmoment subGaussian tensors. An interesting tensor
formulation illustrates the potential application to highorder interaction
pursuit in highdimensional linear regression.

In this paper, we derive minimax rates for estimating both parametric and
nonparametric components in partially linear additive models with high
dimensional sparse vectors and smooth functional components. The minimax lower
bound for Euclidean components is the typical sparse estimation rate that is
independent of nonparametric smoothness indices. However, the minimax lower
bound for each component function exhibits an interplay between the
dimensionality and sparsity of the parametric component and the smoothness of
the relevant nonparametric component. Indeed, the minimax risk for smooth
nonparametric estimation can be slowed down to the sparse estimation rate
whenever the smoothness of the nonparametric component or dimensionality of the
parametric component is suffciently large. In the above setting, we demonstrate
that penalized least square estimators can nearly achieve minimax lower bounds.

We consider joint estimation of multiple graphical models arising from
heterogeneous and highdimensional observations. Unlike most previous
approaches which assume that the cluster structure is given in advance, an
appealing feature of our method is to learn cluster structure while estimating
heterogeneous graphical models. This is achieved via a high dimensional version
of Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993).
A joint graphical lasso penalty is imposed on the conditional maximization step
to extract both homogeneity and heterogeneity components across all clusters.
Our algorithm is computationally efficient due to fast sparse learning routines
and can be implemented without unsupervised learning knowledge. The superior
performance of our method is demonstrated by extensive experiments and its
application to a Glioblastoma cancer dataset reveals some new insights in
understanding the Glioblastoma cancer. In theory, a nonasymptotic error bound
is established for the output directly from our high dimensional ECM algorithm,
and it consists of two quantities: statistical error (statistical accuracy) and
optimization error (computational complexity). Such a result gives a
theoretical guideline in terminating our ECM iterations.

We propose two semiparametric versions of the debiased Lasso procedure for
the model $Y_i = X_i\beta_0 + g_0(Z_i) + \epsilon_i$, where $\beta_0$ is high
dimensional but sparse (exactly or approximately). Both versions are shown to
have the same asymptotic normal distribution and do not require the minimal
signal condition for statistical inference of any component in $\beta_0$. Our
method also works when $Z_i$ is high dimensional provided that the function
classes $E(X_{ij} Z_i)$s and $E(Y_iZ_i)$ belong to exhibit certain sparsity
features, e.g., a sparse additive decomposition structure. We further develop a
simultaneous hypothesis testing procedure based on multiplier bootstrap. Our
testing method automatically takes into account of the dependence structure
within the debiased estimates, and allows the number of tested components to be
exponentially high.

A collection of quantile curves provides a complete picture of conditional
distributions. Properly centered and scaled versions of estimated curves at
various quantile levels give rise to the socalled quantile regression process
(QRP). In this paper, we establish weak convergence of QRP in a general series
approximation framework, which includes linear models with increasing
dimension, nonparametric models and partial linear models. An interesting
consequence is obtained in the last class of models, where parametric and
nonparametric estimators are shown to be asymptotically independent.
Applications of our general process convergence results include the
construction of noncrossing quantile curves and the estimation of conditional
distribution functions. As a result of independent interest, we obtain a series
of Bahadur representations with exponential bounds for tail probabilities of
all remainder terms. Bounds of this kind are potentially useful in analyzing
statistical inference procedures under divideandconquer setup.

In this paper, we explore statistical versus computational tradeoff to
address a basic question in the application of a distributed algorithm: what is
the minimal computational cost in obtaining statistical optimality? In
smoothing spline setup, we observe a phase transition phenomenon for the number
of deployed machines that ends up being a simple proxy for computing cost.
Specifically, a sharp upper bound for the number of machines is established:
when the number is below this bound, statistical optimality (in terms of
nonparametric estimation or testing) is achievable; otherwise, statistical
optimality becomes impossible. These sharp bounds partly capture intrinsic
computational limits of the distributed algorithm considered in this paper, and
turn out to be fully determined by the smoothness of the regression function.
As a side remark, we argue that sample splitting may be viewed as an
alternative form of regularization, playing a similar role as smoothing
parameter.

We develop a set of scalable Bayesian inference procedures for a general
class of nonparametric regression models. Specifically, nonparametric Bayesian
inferences are separately performed on each subset randomly split from a
massive dataset, and then the obtained local results are aggregated into global
counterparts. This aggregation step is explicit without involving any
additional computation cost. By a careful partition, we show that our
aggregated inference results obtain an oracle rule in the sense that they are
equivalent to those obtained directly from the entire data (which are
computationally prohibitive). For example, an aggregated credible ball achieves
desirable credibility level and also frequentist coverage while possessing the
same radius as the oracle ball.

We consider nonparametric testing in a nonasymptotic framework. Our
statistical guarantees are exact in the sense that Type I and II errors are
controlled for any finite sample size. Meanwhile, one proposed test is shown to
achieve minimax optimality in the asymptotic sense. An important consequence of
this nonasymptotic theory is a new and practically useful formula for
selecting the optimal smoothing parameter in nonparametric testing. The leading
example in this paper is smoothing spline models under Gaussian errors. The
results obtained therein can be further generalized to the kernel ridge
regression framework under possibly nonGaussian errors. Simulations
demonstrate that our proposed test improves over the conventional asymptotic
test when sample size is small to moderate.

Stability is an important aspect of a classification procedure because
unstable predictions can potentially reduce users' trust in a classification
system and also harm the reproducibility of scientific conclusions. The major
goal of our work is to introduce a novel concept of classification instability,
i.e., decision boundary instability (DBI), and incorporate it with the
generalization error (GE) as a standard for selecting the most accurate and
stable classifier. Specifically, we implement a twostage algorithm: (i)
initially select a subset of classifiers whose estimated GEs are not
significantly different from the minimal estimated GE among all the candidate
classifiers; (ii) the optimal classifier is chosen as the one achieving the
minimal DBI among the subset selected in stage (i). This general selection
principle applies to both linear and nonlinear classifiers. Largemargin
classifiers are used as a prototypical example to illustrate the above idea.
Our selection method is shown to be consistent in the sense that the optimal
classifier simultaneously achieves the minimal GE and the minimal DBI. Various
simulations and real examples further demonstrate the advantage of our method
over several alternative approaches.

In a general class of Bayesian nonparametric models, we prove that the
posterior distribution can be asymptotically approximated by a Gaussian
process. Our results apply to nonparametric exponential family that contains
both Gaussian and nonGaussian regression, no matter the design is fxed or
random, and also hold for both efficient (rootn) and inefficient (non rootn)
estimation. Our general approximation theorem does not rely on posterior
conjugacy, and can be verified in a class of Gaussian process priors that has a
smoothing spline interpretation [51, 36]. In particular, the limiting posterior
measure becomes priorfree under a Bayesian version of "undersmoothing"
condition. Finally, we apply our approximation theorem to examine the
asymptotic frequentist properties of Bayesian procedures such as credible
regions and credible intervals.

Factor modeling is an essential tool for exploring intrinsic dependence
structures among highdimensional random variables. Much progress has been made
for estimating the covariance matrix from a highdimensional factor model.
However, the blessing of dimensionality has not yet been fully embraced in the
literature: much of the available data is often ignored in constructing
covariance matrix estimates. If our goal is to accurately estimate a covariance
matrix of a set of targeted variables, shall we employ additional data, which
are beyond the variables of interest, in the estimation? In this paper, we
provide sufficient conditions for an affirmative answer, and further quantify
its gain in terms of Fisher information and convergence rate. In fact, even an
oraclelike result (as if all the factors were known) can be achieved when a
sufficiently large number of variables is used. The idea of utilizing data as
much as possible brings computational challenges. A divideandconquer
algorithm is thus proposed to alleviate the computational burden, and also
shown not to sacrifice any statistical accuracy in comparison with a pooled
analysis. Simulation studies further confirm our advocacy for the use of full
data, and demonstrate the effectiveness of the above algorithm. Our proposal is
applied to a microarray data example that shows empirical benefits of using
more data.

We propose a novel sparse tensor decomposition method, namely Tensor
Truncated Power (TTP) method, that incorporates variable selection into the
estimation of decomposition components. The sparsity is achieved via an
efficient truncation step embedded in the tensor power iteration. Our method
applies to a broad family of high dimensional latent variable models, including
high dimensional Gaussian mixture and mixtures of sparse regressions. A
thorough theoretical investigation is further conducted. In particular, we show
that the final decomposition estimator is guaranteed to achieve a local
statistical rate, and further strengthen it to the global statistical rate by
introducing a proper initialization procedure. In high dimensional regimes, the
obtained statistical rate significantly improves those shown in the existing
nonsparse decomposition methods. The empirical advantages of TTP are confirmed
in extensive simulated results and two real applications of clickthrough rate
prediction and highdimensional gene clustering.

This paper proposes a bootstrapassisted procedure to conduct simultaneous
inference for high dimensional sparse linear models based on the recent
desparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows
the dimension of the parameter vector of interest to be exponentially larger
than sample size, and it automatically accounts for the dependence within the
desparsifying Lasso estimator. Moreover, our simultaneous testing method can
be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its
power in sparse testing with a reduced computational cost, or with the
stepdown method (Romano and Wolf 2005) to provide a strong control for the
familywise error rate. In theory, we prove that our simultaneous testing
procedure asymptotically achieves the prespecified significance level, and
enjoys certain optimality in terms of its power even when the model errors are
nonGaussian. Our general theory is also useful in studying the support
recovery problem. To broaden the applicability, we further extend our main
results to generalized linear models with convex loss functions. The
effectiveness of our methods is demonstrated via simulation studies.

We consider a partially linear framework for modelling massive heterogeneous
data. The major goal is to extract common features across all subpopulations
while exploring heterogeneity of each subpopulation. In particular, we propose
an aggregation type estimator for the commonality parameter that possesses the
(nonasymptotic) minimax optimal bound and asymptotic distribution as if there
were no heterogeneity. This oracular result holds when the number of
subpopulations does not grow too fast. A plugin estimator for the
heterogeneity parameter is further constructed, and shown to possess the
asymptotic distribution as if the commonality information were available. We
also test the heterogeneity among a large number of subpopulations. All the
above results require to regularize each subestimation as though it had the
entire sample size. Our general theory applies to the divideandconquer
approach that is often used to deal with massive homogeneous data. A technical
byproduct of this paper is the statistical inferences for the general kernel
ridge regression. Thorough numerical results are also provided to back up our
theory.

A massive dataset often consists of a growing number of (potentially)
heterogeneous subpopulations. This paper is concerned about testing various
forms of heterogeneity arising from massive data. In a general nonparametric
framework, a set of testing procedures are designed to accommodate a growing
number of subpopulations, denoted as $s$, with computational feasibility. In
theory, their null limit distributions are derived as being nearly Chisquare
with diverging degrees of freedom as long as $s$ does not grow too fast.
Interestingly, we find that a lower bound on $s$ needs to be set for obtaining
a sufficiently powerful testing result, socalled "blessing of aggregation." As
a byproduc, a type of homogeneity testing is also proposed with a test
statistic being aggregated over all subpopulations. Numerical results are
presented to support our theory.

The stability of statistical analysis is an important indicator for
reproducibility, which is one main principle of scientific method. It entails
that similar statistical conclusions can be reached based on independent
samples from the same underlying population. In this paper, we introduce a
general measure of classification instability (CIS) to quantify the sampling
variability of the prediction made by a classification method. Interestingly,
the asymptotic CIS of any weighted nearest neighbor classifier turns out to be
proportional to the Euclidean norm of its weight vector. Based on this concise
form, we propose a stabilized nearest neighbor (SNN) classifier, which
distinguishes itself from other nearest neighbor classifiers, by taking the
stability into consideration. In theory, we prove that SNN attains the minimax
optimal convergence rate in risk, and a sharp convergence rate in CIS. The
latter rate result is established for general plugin classifiers under a
lownoise condition. Extensive simulated and real examples demonstrate that SNN
achieves a considerable improvement in CIS over existing nearest neighbor
classifiers, with comparable classification accuracy. We implement the
algorithm in a publicly available R package snn.

We propose a roughness regularization approach in making nonparametric
inference for generalized functional linear models. In a reproducing kernel
Hilbert space framework, we construct asymptotically valid confidence intervals
for regression mean, prediction intervals for future response and various
statistical procedures for hypothesis testing. In particular, one procedure for
testing global behaviors of the slope function is adaptive to the smoothness of
the slope function and to the structure of the predictors. As a byproduct, a
new type of Wilks phenomenon [Ann. Math. Stat. 9 (1938) 6062; Ann. Statist. 29
(2001) 153193] is discovered when testing the functional linear models.
Despite the generality, our inference procedures are easy to implement.
Numerical examples are provided to demonstrate the empirical advantages over
the competing methods. A collection of technical tools such as
integrodifferential equation techniques [Trans. Amer. Math. Soc. (1927) 29
755800; Trans. Amer. Math. Soc. (1928) 30 453471; Trans. Amer. Math. Soc.
(1930) 32 860868], Stein's method [Ann. Statist. 41 (2013) 27862819] [Stein,
Approximate Computation of Expectations (1986) IMS] and functional Bahadur
representation [Ann. Statist. 41 (2013) 26082638] are employed in this paper.

Individualized treatment rules (ITRs) tailor treatments according to
individual patient characteristics. They can significantly improve patient care
and are thus becoming increasingly popular. The data collected during
randomized clinical trials are often used to estimate the optimal ITRs.
However, these trials are generally expensive to run, and, moreover, they are
not designed to efficiently estimate ITRs. In this paper, we propose a
costeffective estimation method from an active learning perspective. In
particular, our method recruits only the "most informative" patients (in terms
of learning the optimal ITRs) from an ongoing clinical trial. Simulation
studies and realdata examples show that our active clinical trial method
significantly improves on competing methods. We derive risk bounds and show
that they support these observed empirical advantages.

We consider a joint asymptotic framework for studying seminonparametric
regression models where (finitedimensional) Euclidean parameters and
(infinitedimensional) functional parameters are both of interest. The class of
models in consideration share a partially linear structure and are estimated in
two general contexts: (i) quasilikelihood and (ii) true likelihood. We first
show that the Euclidean estimator and (pointwise) functional estimator, which
are rescaled at different rates, jointly converge to a zeromean Gaussian
vector. This weak convergence result reveals a surprising joint asymptotics
phenomenon: these two estimators are asymptotically independent. A major goal
of this paper is to gain firsthand insights into the above phenomenon.
Moreover, a likelihood ratio testing is proposed for a set of joint local
hypotheses, where a new version of the Wilks phenomenon [Ann. Math. Stat. 9
(1938) 6062; Ann. Statist. 1 (2001) 153193] is unveiled. A novel technical
tool, called a joint Bahadur representation, is developed for studying these
joint asymptotics results.

The major goal of this paper is to study the second order frequentist
properties of the marginal posterior distribution of the parametric component
in semiparametric Bayesian models, in particular, a second order semiparametric
Bernsteinvon Mises (BvM) Theorem. Our first contribution is to discover an
interesting interference phenomenon between Bayesian estimation and frequentist
inferential accuracy: more accurate Bayesian estimation on the nuisance
function leads to higher frequentist inferential accuracy on the parametric
component. As the second contribution, we propose a new class of dependent
priors under which Bayesian inference procedures for the parametric component
are not only efficient but also adaptive (w.r.t. the smoothness of
nonparametric component) up to the second order frequentist validity. However,
commonly used independent priors may even fail to produce a desirable rootn
contraction rate for the parametric component in this adaptive case unless some
stringent assumption is imposed. Three important classes of semiparametric
models are examined, and extensive simulations are also provided.

In Bayesian nonparametric models, Gaussian processes provide a popular prior
choice for regression function estimation. Existing literature on the
theoretical investigation of the resulting posterior distribution almost
exclusively assume a fixed design for covariates. The only random design result
we are aware of (van der Vaart & van Zanten, 2011) assumes the assigned
Gaussian process to be supported on the smoothness class specified by the true
function with probability one. This is a fairly restrictive assumption as it
essentially rules out the Gaussian process prior with a squared exponential
kernel when modeling rougher functions. In this article, we show that an
appropriate rescaling of the above Gaussian process leads to a rateoptimal
posterior distribution even when the covariates are independently realized from
a known density on a compact set. The proofs are based on deriving sharp
concentration inequalities for frequentist kernel estimators; the results might
be of independent interest.