• ### Tensor Graphical Model: Non-convex Optimization and Statistical Inference(1609.04522)

Feb. 25, 2019 stat.ME, stat.ML
We consider the estimation and inference of graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. A critical challenge in the estimation and inference of this model is the fact that its penalized maximum likelihood estimation involves minimizing a non-convex objective function. To address it, this paper makes two contributions: (i) In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with an optimal statistical rate of convergence. (ii) We propose a de-biased statistical inference procedure for testing hypotheses on the true support of the sparse precision matrices, and employ it for testing a growing number of hypothesis with false discovery rate (FDR) control. The asymptotic normality of our test statistic and the consistency of FDR control procedure are established. Our theoretical results are backed up by thorough numerical studies and our real applications on neuroimaging studies of Autism spectrum disorder and users' advertising click analysis bring new scientific findings and business insights. The proposed methods are encoded into a publicly available R package Tlasso.
• ### Distributed Generalized Cross-Validation for Divide-and-Conquer Kernel Ridge Regression and its Asymptotic Optimality(1612.05907)

Feb. 18, 2019 stat.ML
Tuning parameter selection is of critical importance for kernel ridge regression. To this date, data driven tuning method for divide-and-conquer kernel ridge regression (d-KRR) has been lacking in the literature, which limits the applicability of d-KRR for large data sets. In this paper, by modifying the Generalized Cross-validation (GCV, Wahba, 1990) score, we propose a distributed Generalized Cross-Validation (dGCV) as a data-driven tool for selecting the tuning parameters in d-KRR. Not only the proposed dGCV is computationally scalable for massive data sets, it is also shown, under mild conditions, to be asymptotically optimal in the sense that minimizing the dGCV score is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework.
• ### Distributed inference for quantile regression processes(1701.06088)

April 10, 2018 math.ST, stat.TH, stat.ME
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditional quantile regression process through projection based on these estimated quantile curves. Our general quantile regression framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifice any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. As an important application, the statistical inference on conditional distribution functions is considered. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from sub-samples and can be carried out at almost no additional computational cost. Simulations confirm our statistical inferential theory.
• ### Nonparametric Testing under Random Projection(1802.06308)

Feb. 17, 2018 math.ST, stat.TH, stat.ME, stat.ML
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. An adaptive testing procedure is further established without prior knowledge of regularity. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues. Simulations and real data analysis are conducted to support our theory.
• ### Sparse and Low-rank Tensor Estimation via Cubic Sketchings(1801.09326)

Jan. 29, 2018 math.ST, stat.TH, stat.ML
In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.
• ### Minimax Optimal Estimation in Partially Linear Additive Models under High Dimension(1612.05906)

Jan. 14, 2018 math.ST, stat.TH
In this paper, we derive minimax rates for estimating both parametric and nonparametric components in partially linear additive models with high dimensional sparse vectors and smooth functional components. The minimax lower bound for Euclidean components is the typical sparse estimation rate that is independent of nonparametric smoothness indices. However, the minimax lower bound for each component function exhibits an interplay between the dimensionality and sparsity of the parametric component and the smoothness of the relevant nonparametric component. Indeed, the minimax risk for smooth nonparametric estimation can be slowed down to the sparse estimation rate whenever the smoothness of the nonparametric component or dimensionality of the parametric component is suffciently large. In the above setting, we demonstrate that penalized least square estimators can nearly achieve minimax lower bounds.
• ### Simultaneous Clustering and Estimation of Heterogeneous Graphical Models(1611.09391)

Jan. 12, 2018 math.ST, stat.TH, stat.ML
We consider joint estimation of multiple graphical models arising from heterogeneous and high-dimensional observations. Unlike most previous approaches which assume that the cluster structure is given in advance, an appealing feature of our method is to learn cluster structure while estimating heterogeneous graphical models. This is achieved via a high dimensional version of Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993). A joint graphical lasso penalty is imposed on the conditional maximization step to extract both homogeneity and heterogeneity components across all clusters. Our algorithm is computationally efficient due to fast sparse learning routines and can be implemented without unsupervised learning knowledge. The superior performance of our method is demonstrated by extensive experiments and its application to a Glioblastoma cancer dataset reveals some new insights in understanding the Glioblastoma cancer. In theory, a non-asymptotic error bound is established for the output directly from our high dimensional ECM algorithm, and it consists of two quantities: statistical error (statistical accuracy) and optimization error (computational complexity). Such a result gives a theoretical guideline in terminating our ECM iterations.
• ### High Dimensional Inference in Partially Linear Models(1708.02564)

Aug. 8, 2017 math.ST, stat.TH
We propose two semiparametric versions of the debiased Lasso procedure for the model $Y_i = X_i\beta_0 + g_0(Z_i) + \epsilon_i$, where $\beta_0$ is high dimensional but sparse (exactly or approximately). Both versions are shown to have the same asymptotic normal distribution and do not require the minimal signal condition for statistical inference of any component in $\beta_0$. Our method also works when $Z_i$ is high dimensional provided that the function classes $E(X_{ij} |Z_i)$s and $E(Y_i|Z_i)$ belong to exhibit certain sparsity features, e.g., a sparse additive decomposition structure. We further develop a simultaneous hypothesis testing procedure based on multiplier bootstrap. Our testing method automatically takes into account of the dependence structure within the debiased estimates, and allows the number of tested components to be exponentially high.
• ### Quantile Processes for Semi and Nonparametric Regression(1604.02130)

July 21, 2017 math.ST, stat.TH
A collection of quantile curves provides a complete picture of conditional distributions. Properly centered and scaled versions of estimated curves at various quantile levels give rise to the so-called quantile regression process (QRP). In this paper, we establish weak convergence of QRP in a general series approximation framework, which includes linear models with increasing dimension, nonparametric models and partial linear models. An interesting consequence is obtained in the last class of models, where parametric and non-parametric estimators are shown to be asymptotically independent. Applications of our general process convergence results include the construction of non-crossing quantile curves and the estimation of conditional distribution functions. As a result of independent interest, we obtain a series of Bahadur representations with exponential bounds for tail probabilities of all remainder terms. Bounds of this kind are potentially useful in analyzing statistical inference procedures under divide-and-conquer setup.
• ### Computational Limits of A Distributed Algorithm For Smoothing Spline(1512.09226)

July 21, 2017 math.ST, stat.TH
In this paper, we explore statistical versus computational trade-off to address a basic question in the application of a distributed algorithm: what is the minimal computational cost in obtaining statistical optimality? In smoothing spline setup, we observe a phase transition phenomenon for the number of deployed machines that ends up being a simple proxy for computing cost. Specifically, a sharp upper bound for the number of machines is established: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. These sharp bounds partly capture intrinsic computational limits of the distributed algorithm considered in this paper, and turn out to be fully determined by the smoothness of the regression function. As a side remark, we argue that sample splitting may be viewed as an alternative form of regularization, playing a similar role as smoothing parameter.
• ### Nonparametric Bayesian Aggregation for Massive Data(1508.04175)

Sept. 4, 2019 math.ST, stat.TH
We develop a set of scalable Bayesian inference procedures for a general class of nonparametric regression models. Specifically, nonparametric Bayesian inferences are separately performed on each subset randomly split from a massive dataset, and then the obtained local results are aggregated into global counterparts. This aggregation step is explicit without involving any additional computation cost. By a careful partition, we show that our aggregated inference results obtain an oracle rule in the sense that they are equivalent to those obtained directly from the entire data (which are computationally prohibitive). For example, an aggregated credible ball achieves desirable credibility level and also frequentist coverage while possessing the same radius as the oracle ball.
• ### Non-asymptotic theory for nonparametric testing(1702.01330)

Feb. 4, 2017 math.ST, stat.TH
We consider nonparametric testing in a non-asymptotic framework. Our statistical guarantees are exact in the sense that Type I and II errors are controlled for any finite sample size. Meanwhile, one proposed test is shown to achieve minimax optimality in the asymptotic sense. An important consequence of this non-asymptotic theory is a new and practically useful formula for selecting the optimal smoothing parameter in nonparametric testing. The leading example in this paper is smoothing spline models under Gaussian errors. The results obtained therein can be further generalized to the kernel ridge regression framework under possibly non-Gaussian errors. Simulations demonstrate that our proposed test improves over the conventional asymptotic test when sample size is small to moderate.
• ### Stability Enhanced Large-Margin Classifier Selection(1701.05672)

Jan. 20, 2017 stat.ML
Stability is an important aspect of a classification procedure because unstable predictions can potentially reduce users' trust in a classification system and also harm the reproducibility of scientific conclusions. The major goal of our work is to introduce a novel concept of classification instability, i.e., decision boundary instability (DBI), and incorporate it with the generalization error (GE) as a standard for selecting the most accurate and stable classifier. Specifically, we implement a two-stage algorithm: (i) initially select a subset of classifiers whose estimated GEs are not significantly different from the minimal estimated GE among all the candidate classifiers; (ii) the optimal classifier is chosen as the one achieving the minimal DBI among the subset selected in stage (i). This general selection principle applies to both linear and nonlinear classifiers. Large-margin classifiers are used as a prototypical example to illustrate the above idea. Our selection method is shown to be consistent in the sense that the optimal classifier simultaneously achieves the minimal GE and the minimal DBI. Various simulations and real examples further demonstrate the advantage of our method over several alternative approaches.
• ### Gaussian Approximation of General Nonparametric Posterior Distributions(1411.3686)

Nov. 15, 2016 math.ST, stat.TH
In a general class of Bayesian nonparametric models, we prove that the posterior distribution can be asymptotically approximated by a Gaussian process. Our results apply to nonparametric exponential family that contains both Gaussian and non-Gaussian regression, no matter the design is fxed or random, and also hold for both efficient (root-n) and inefficient (non root-n) estimation. Our general approximation theorem does not rely on posterior conjugacy, and can be verified in a class of Gaussian process priors that has a smoothing spline interpretation [51, 36]. In particular, the limiting posterior measure becomes prior-free under a Bayesian version of "undersmoothing" condition. Finally, we apply our approximation theorem to examine the asymptotic frequentist properties of Bayesian procedures such as credible regions and credible intervals.
• ### Embracing the Blessing of Dimensionality in Factor Models(1610.07697)

Oct. 25, 2016 math.ST, stat.TH, stat.ME
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data is often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this paper, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of utilizing data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data.
• ### Provable Sparse Tensor Decomposition(1502.01425)

May 2, 2016 stat.ML
We propose a novel sparse tensor decomposition method, namely Tensor Truncated Power (TTP) method, that incorporates variable selection into the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixture and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the obtained statistical rate significantly improves those shown in the existing non-sparse decomposition methods. The empirical advantages of TTP are confirmed in extensive simulated results and two real applications of click-through rate prediction and high-dimensional gene clustering.
• ### Simultaneous Inference for High-dimensional Linear Models(1603.01295)

March 3, 2016 math.ST, stat.TH
This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.
• ### A Partially Linear Framework for Massive Heterogeneous Data(1410.8570)

Jan. 25, 2016 math.ST, stat.TH
We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.
• ### Nonparametric Heterogeneity Testing For Massive Data(1601.06212)

Jan. 23, 2016 math.ST, stat.TH
A massive dataset often consists of a growing number of (potentially) heterogeneous sub-populations. This paper is concerned about testing various forms of heterogeneity arising from massive data. In a general nonparametric framework, a set of testing procedures are designed to accommodate a growing number of sub-populations, denoted as $s$, with computational feasibility. In theory, their null limit distributions are derived as being nearly Chi-square with diverging degrees of freedom as long as $s$ does not grow too fast. Interestingly, we find that a lower bound on $s$ needs to be set for obtaining a sufficiently powerful testing result, so-called "blessing of aggregation." As a by-produc, a type of homogeneity testing is also proposed with a test statistic being aggregated over all sub-populations. Numerical results are presented to support our theory.
• ### Stabilized Nearest Neighbor Classifier and Its Statistical Properties(1405.6642)

Aug. 30, 2015 cs.LG, stat.ML
The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this paper, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn.
• ### Nonparametric inference in generalized functional linear models(1405.6655)

July 30, 2015 math.ST, stat.TH
We propose a roughness regularization approach in making nonparametric inference for generalized functional linear models. In a reproducing kernel Hilbert space framework, we construct asymptotically valid confidence intervals for regression mean, prediction intervals for future response and various statistical procedures for hypothesis testing. In particular, one procedure for testing global behaviors of the slope function is adaptive to the smoothness of the slope function and to the structure of the predictors. As a by-product, a new type of Wilks phenomenon [Ann. Math. Stat. 9 (1938) 60-62; Ann. Statist. 29 (2001) 153-193] is discovered when testing the functional linear models. Despite the generality, our inference procedures are easy to implement. Numerical examples are provided to demonstrate the empirical advantages over the competing methods. A collection of technical tools such as integro-differential equation techniques [Trans. Amer. Math. Soc. (1927) 29 755-800; Trans. Amer. Math. Soc. (1928) 30 453-471; Trans. Amer. Math. Soc. (1930) 32 860-868], Stein's method [Ann. Statist. 41 (2013) 2786-2819] [Stein, Approximate Computation of Expectations (1986) IMS] and functional Bahadur representation [Ann. Statist. 41 (2013) 2608-2638] are employed in this paper.
• ### Active Clinical Trials for Personalized Medicine(1404.2971)

June 29, 2015 stat.ME
Individualized treatment rules (ITRs) tailor treatments according to individual patient characteristics. They can significantly improve patient care and are thus becoming increasingly popular. The data collected during randomized clinical trials are often used to estimate the optimal ITRs. However, these trials are generally expensive to run, and, moreover, they are not designed to efficiently estimate ITRs. In this paper, we propose a cost-effective estimation method from an active learning perspective. In particular, our method recruits only the "most informative" patients (in terms of learning the optimal ITRs) from an ongoing clinical trial. Simulation studies and real-data examples show that our active clinical trial method significantly improves on competing methods. We derive risk bounds and show that they support these observed empirical advantages.
• ### Joint asymptotics for semi-nonparametric regression models with partially linear structure(1311.2628)

June 3, 2015 math.ST, stat.TH
We consider a joint asymptotic framework for studying semi-nonparametric regression models where (finite-dimensional) Euclidean parameters and (infinite-dimensional) functional parameters are both of interest. The class of models in consideration share a partially linear structure and are estimated in two general contexts: (i) quasi-likelihood and (ii) true likelihood. We first show that the Euclidean estimator and (pointwise) functional estimator, which are re-scaled at different rates, jointly converge to a zero-mean Gaussian vector. This weak convergence result reveals a surprising joint asymptotics phenomenon: these two estimators are asymptotically independent. A major goal of this paper is to gain first-hand insights into the above phenomenon. Moreover, a likelihood ratio testing is proposed for a set of joint local hypotheses, where a new version of the Wilks phenomenon [Ann. Math. Stat. 9 (1938) 60-62; Ann. Statist. 1 (2001) 153-193] is unveiled. A novel technical tool, called a joint Bahadur representation, is developed for studying these joint asymptotics results.
• ### Semiparametric Bernstein-von Mises Theorem: Second Order Studies(1503.04493)

March 16, 2015 math.ST, stat.TH
The major goal of this paper is to study the second order frequentist properties of the marginal posterior distribution of the parametric component in semiparametric Bayesian models, in particular, a second order semiparametric Bernstein-von Mises (BvM) Theorem. Our first contribution is to discover an interesting interference phenomenon between Bayesian estimation and frequentist inferential accuracy: more accurate Bayesian estimation on the nuisance function leads to higher frequentist inferential accuracy on the parametric component. As the second contribution, we propose a new class of dependent priors under which Bayesian inference procedures for the parametric component are not only efficient but also adaptive (w.r.t. the smoothness of nonparametric component) up to the second order frequentist validity. However, commonly used independent priors may even fail to produce a desirable root-n contraction rate for the parametric component in this adaptive case unless some stringent assumption is imposed. Three important classes of semiparametric models are examined, and extensive simulations are also provided.
• ### Optimal Bayesian estimation in random covariate design with a rescaled Gaussian process prior(1411.7420)

March 5, 2015 math.ST, stat.TH
In Bayesian nonparametric models, Gaussian processes provide a popular prior choice for regression function estimation. Existing literature on the theoretical investigation of the resulting posterior distribution almost exclusively assume a fixed design for covariates. The only random design result we are aware of (van der Vaart & van Zanten, 2011) assumes the assigned Gaussian process to be supported on the smoothness class specified by the true function with probability one. This is a fairly restrictive assumption as it essentially rules out the Gaussian process prior with a squared exponential kernel when modeling rougher functions. In this article, we show that an appropriate rescaling of the above Gaussian process leads to a rate-optimal posterior distribution even when the covariates are independently realized from a known density on a compact set. The proofs are based on deriving sharp concentration inequalities for frequentist kernel estimators; the results might be of independent interest.