• ### Shape-Constrained Univariate Density Estimation(1804.01458)

April 4, 2018 stat.ME
While the problem of estimating a probability density function (pdf) from its observations is classical, the estimation under additional shape constraints is both important and challenging. We introduce an efficient, geometric approach for estimating pdfs given the number of its modes. This approach explores the space of constrained pdf's using an action of the diffeomorphism group that preserves their shapes. It starts with an initial template, with the desired number of modes and arbitrarily chosen heights at the critical points, and transforms it via: (1) composition by diffeomorphisms and (2) normalization to obtain the final density estimate. The search for optimal diffeomorphism is performed under the maximum-likelihood criterion and is accomplished by mapping diffeomorphisms to the tangent space of a Hilbert sphere, a vector space whose elements can be expressed using an orthogonal basis. This framework is first applied to shape-constrained univariate, unconditional pdf estimation and then extended to conditional pdf estimation. We derive asymptotic convergence rates of the estimator and demonstrate this approach using a synthetic dataset involving speed distribution for different traffic flow on Californian driveways.
• ### Probabilistic community detection with unknown number of communities(1602.08062)

March 29, 2018 math.ST, stat.TH, stat.ME
A fundamental problem in network analysis is clustering the nodes into groups which share a similar connectivity pattern. Existing algorithms for community detection assume the knowledge of the number of clusters or estimate it a priori using various selection criteria and subsequently estimate the community structure. Ignoring the uncertainty in the first stage may lead to erroneous clustering, particularly when the community structure is vague. We instead propose a coherent probabilistic framework for simultaneous estimation of the number of communities and the community structure, adapting recently developed Bayesian nonparametric techniques to network models. An efficient Markov chain Monte Carlo (MCMC) algorithm is proposed which obviates the need to perform reversible jump MCMC on the number of clusters. The methodology is shown to outperform recently developed community detection algorithms in a variety of synthetic data examples and in benchmark real-datasets. Using an appropriate metric on the space of all configurations, we develop non-asymptotic Bayes risk bounds even when the number of clusters is unknown. Enroute, we develop concentration properties of non-linear functions of Bernoulli random variables, which may be of independent interest.
• ### $\alpha$-Variational Inference with Statistical Guarantees(1710.03266)

We propose a family of variational approximations to Bayesian posterior distributions, called $\alpha$-VB, with provable statistical guarantees. The standard variational approximation is a special case of $\alpha$-VB with $\alpha=1$. When $\alpha \in(0,1]$, a novel class of variational inequalities are developed for linking the Bayes risk under the variational approximation to the objective function in the variational optimization problem, implying that maximizing the evidence lower bound in variational inference has the effect of minimizing the Bayes risk within the variational density family. Operating in a frequentist setup, the variational inequalities imply that point estimates constructed from the $\alpha$-VB procedure converge at an optimal rate to the true parameter in a wide range of problems. We illustrate our general theory with a number of examples, including the mean-field variational approximation to (low)-high-dimensional Bayesian linear regression with spike and slab priors, mixture of Gaussian models, latent Dirichlet allocation, and (mixture of) Gaussian variational approximation in regular parametric models.
• ### On Statistical Optimality of Variational Bayes(1712.08983)

Dec. 25, 2017 math.ST, stat.TH, stat.ML
The article addresses a long-standing open problem on the justification of using variational Bayes methods for parameter estimation. We provide general conditions for obtaining optimal risk bounds for point estimates acquired from mean-field variational Bayesian inference. The conditions pertain to the existence of certain test functions for the distance metric on the parameter space and minimal assumptions on the prior. A general recipe for verification of the conditions is outlined which is broadly applicable to existing Bayesian models with or without latent variables. As illustrations, specific applications to Latent Dirichlet Allocation and Gaussian mixture models are discussed.
• ### Frequentist coverage and sup-norm convergence rate in Gaussian process regression(1708.04753)

Aug. 16, 2017 stat.CO, math.ST, stat.TH, stat.ML
Gaussian process (GP) regression is a powerful interpolation technique due to its flexibility in capturing non-linearity. In this paper, we provide a general framework for understanding the frequentist coverage of point-wise and simultaneous Bayesian credible sets in GP regression. As an intermediate result, we develop a Bernstein von-Mises type result under supremum norm in random design GP regression. Identifying both the mean and covariance function of the posterior distribution of the Gaussian process as regularized $M$-estimators, we show that the sampling distribution of the posterior mean function and the centered posterior distribution can be respectively approximated by two population level GPs. By developing a comparison inequality between two GPs, we provide exact characterization of frequentist coverage probabilities of Bayesian point-wise credible intervals and simultaneous credible bands of the regression function. Our results show that inference based on GP regression tends to be conservative; when the prior is under-smoothed, the resulting credible intervals and bands have minimax-optimal sizes, with their frequentist coverage converging to a non-degenerate value between their nominal level and one. As a byproduct of our theory, we show that the GP regression also yields minimax-optimal posterior contraction rate relative to the supremum norm, which provides a positive evidence to the long standing problem on optimal supremum norm contraction rate in GP regression.
• ### Bayesian Variable Selection for Skewed Heteroscedastic Response(1602.09100)

July 3, 2017 math.ST, stat.TH, stat.ME
In this article, we propose new Bayesian methods for selecting and estimating a sparse coefficient vector for skewed heteroscedastic response. Our novel Bayesian procedures effectively estimate the median and other quantile functions, accommodate non-local prior for regression effects without compromising ease of implementation via sampling based tools, and asymptotically select the true set of predictors even when the number of covariates increases in the same order of the sample size. We also extend our method to deal with some observations with very large errors. Via simulation studies and a re-analysis of a medical cost study with large number of potential predictors, we illustrate the ease of implementation and other practical advantages of our approach compared to existing methods for such studies.
• ### Compressed Covariance Estimation With Automated Dimension Learning(1704.00247)

April 2, 2017 stat.ME
We propose a method for estimating a covariance matrix that can be represented as a sum of a low-rank matrix and a diagonal matrix. The proposed method compresses high-dimensional data, computes the sample covariance in the compressed space, and lifts it back to the ambient space via a decompression operation. A salient feature of our approach relative to existing literature on combining sparsity and low-rank structures in covariance matrix estimation is that we do not require the low-rank component to be sparse. A principled framework for estimating the compressed dimension using Stein's Unbiased Risk Estimation theory is demonstrated. Experimental simulation results demonstrate the efficacy and scalability of our proposed approach.
• ### Adaptive posterior convergence rates in non-linear latent variable models(1701.07572)

Jan. 26, 2017 math.ST, stat.TH
Non-linear latent variable models have become increasingly popular in a variety of applications. However, there has been little study on theoretical properties of these models. In this article, we study rates of posterior contraction in univariate density estimation for a class of non-linear latent variable models where unobserved U(0,1) latent variables are related to the response variables via a random non-linear regression with an additive error. Our approach relies on characterizing the space of densities induced by the above model as kernel convolutions with a general class of continuous mixing measures. The literature on posterior rates of contraction in density estimation almost entirely focuses on finite or countably infinite mixture models. We develop approximation results for our class of continuous mixing measures. Using an appropriate Gaussian process prior on the unknown regression function, we obtain the optimal frequentist rate up to a logarithmic factor under standard regularity conditions on the true density.
• ### A Geometric Framework For Density Modeling(1701.05656)

Jan. 20, 2017 stat.ME
We introduce a geometric approach for estimating a probability density function (pdf) given its samples. The procedure involves obtaining an initial estimate of the pdf and then transforming it via a warping function to reach the final estimate. The initial estimate is intended to be computationally fast, albeit suboptimal, but its warping creates a larger, flexible class of density functions, resulting in substantially improved estimation. The warping is accomplished by mapping diffeomorphic functions to the tangent space of a Hilbert sphere, a vector space whose elements can be expressed using an orthogonal basis. Using a truncated basis expansion, we estimate the optimal warping and, thus, the optimal density estimate. This framework is introduced for univariate, unconditional pdf estimation and then extended to conditional pdf estimation. The approach avoids many of the computational pitfalls associated with current methods without losing on estimation performance. In presence of irrelevant predictors, the approach achieves both statistical and computational efficiency compared to classical approaches for conditional density estimation. We derive asymptotic convergence rates of the density estimator and demonstrate this approach using synthetic datasets, and a case study to understand association of a toxic metabolite on preterm birth.
• ### Bayesian model selection consistency and oracle inequality with intractable marginal likelihood(1701.00311)

Jan. 9, 2017 math.ST, stat.TH, stat.ME, stat.ML
In this article, we investigate large sample properties of model selection procedures in a general Bayesian framework when a closed form expression of the marginal likelihood function is not available or a local asymptotic quadratic approximation of the log-likelihood function does not exist. Under appropriate identifiability assumptions on the true model, we provide sufficient conditions for a Bayesian model selection procedure to be consistent and obey the Occam's razor phenomenon, i.e., the probability of selecting the "smallest" model that contains the truth tends to one as the sample size goes to infinity. In order to show that a Bayesian model selection procedure selects the smallest model containing the truth, we impose a prior anti-concentration condition, requiring the prior mass assigned by large models to a neighborhood of the truth to be sufficiently small. In a more general setting where the strong model identifiability assumption may not hold, we introduce the notion of local Bayesian complexity and develop oracle inequalities for Bayesian model selection procedures. Our Bayesian oracle inequality characterizes a trade-off between the approximation error and a Bayesian characterization of the local complexity of the model, illustrating the adaptive nature of averaging-based Bayesian procedures towards achieving an optimal rate of posterior convergence. Specific applications of the model selection theory are discussed in the context of high-dimensional nonparametric regression and density regression where the regression function or the conditional density is assumed to depend on a fixed subset of predictors. As a result of independent interest, we propose a general technique for obtaining upper bounds of certain small ball probability of stationary Gaussian processes.
• ### A Divide and Conquer Strategy for High Dimensional Bayesian Factor Models(1612.02875)

Dec. 29, 2016 stat.ME
We propose a distributed computing framework, based on a divide and conquer strategy and hierarchical modeling, to accelerate posterior inference for high-dimensional Bayesian factor models. Our approach distributes the task of high-dimensional covariance matrix estimation to multiple cores, solves each subproblem separately via a latent factor model, and then combines these estimates to produce a global estimate of the covariance matrix. Existing divide and conquer methods focus exclusively on dividing the total number of observations $n$ into subsamples while keeping the dimension $p$ fixed. Our approach is novel in this regard: it includes all of the $n$ samples in each subproblem and, instead, splits the dimension $p$ into smaller subsets for each subproblem. The subproblems themselves can be challenging to solve when $p$ is large due to the dependencies across dimensions. To circumvent this issue, we specify a novel hierarchical structure on the latent factors that allows for flexible dependencies across dimensions, while still maintaining computational efficiency. Our approach is readily parallelizable and is shown to have computational efficiency of several orders of magnitude in comparison to fitting a full factor model. We report the performance of our method in synthetic examples and a genomics application.
• ### Exact tests for stochastic block models(1612.06040)

Dec. 19, 2016 math.ST, stat.TH, stat.ME
We develop a finite-sample goodness-of-fit test for \emph{latent-variable} block models for networks and test it on simulated and real data sets. The main building block for the latent block assignment model test is the exact test for the model with observed blocks assignment. The latter is implemented using algebraic statistics. While we focus on three variants of the stochastic block model, the methodology extends to any mixture of log-linear models on discrete data.
• ### Bayesian Semiparametric Multivariate Density Deconvolution(1404.6462)

Dec. 5, 2016 stat.ME
We consider the problem of multivariate density deconvolution when the interest lies in estimating the distribution of a vector-valued random variable but precise measurements of the variable of interest are not available, observations being contaminated with additive measurement errors. The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density is not known but replicated proxies are available for each unobserved value of the random vector. Additionally, we allow the variability of the measurement errors to depend on the associated unobserved value of the vector of interest through unknown relationships which also automatically includes the case of multivariate multiplicative measurement errors. Basic properties of finite mixture models, multivariate normal kernels and exchangeable priors are exploited in many novel ways to meet the modeling and computational challenges. Theoretical results that show the flexibility of the proposed methods are provided. We illustrate the efficiency of the proposed methods in recovering the true density of interest through simulation experiments. The methodology is applied to estimate the joint consumption pattern of different dietary components from contaminated 24 hour recalls.
• ### Bayesian fractional posteriors(1611.01125)

Nov. 7, 2016 math.ST, stat.TH
We consider the fractional posterior distribution that is obtained by updating a prior distribution via Bayes theorem with a fractional likelihood function, a usual likelihood function raised to a fractional power. First, we analyze the contraction property of the fractional posterior in a general misspecified framework. Our contraction results only require a prior mass condition on certain Kullback-Leibler (KL) neighborhood of the true parameter (or the KL divergence minimizer in the misspecified case), and obviate constructions of test functions and sieves commonly used in the literature for analyzing the contraction property of a regular posterior. We show through a counterexample that some condition controlling the complexity of the parameter space is necessary for the regular posterior to contract, rendering additional flexibility on the choice of the prior for the fractional posterior. Second, we derive a novel Bayesian oracle inequality based on a PAC-Bayes inequality in misspecified models. Our derivation reveals several advantages of averaging based Bayesian procedures over optimization based frequentist procedures. As an application of the Bayesian oracle inequality, we derive a sharp oracle inequality in the convex regression problem under an arbitrary dimension. We also illustrate the theory in Gaussian process regression and density estimation problems.
• ### Sparse additive Gaussian process with soft interactions(1607.02670)

July 9, 2016 stat.ML
Additive nonparametric regression models provide an attractive tool for variable selection in high dimensions when the relationship between the response and predictors is complex. They offer greater flexibility compared to parametric non-linear regression models and better interpretability and scalability than the non-parametric regression models. However, achieving sparsity simultaneously in the number of nonparametric components as well as in the variables within each nonparametric component poses a stiff computational challenge. In this article, we develop a novel Bayesian additive regression model using a combination of hard and soft shrinkages to separately control the number of additive components and the variables within each component. An efficient algorithm is developed to select the importance variables and estimate the interaction network. Excellent performance is obtained in simulated and real data examples.
• ### Sub-optimality of some continuous shrinkage priors(1605.05671)

May 18, 2016 stat.CO, math.ST, stat.TH
Two-component mixture priors provide a traditional way to induce sparsity in high-dimensional Bayes models. However, several aspects of such a prior, including computational complexities in high-dimensions, interpretation of exact zeros and non-sparse posterior summaries under standard loss functions, has motivated an amazing variety of continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians. Interestingly, we demonstrate that many commonly used shrinkage priors, including the Bayesian Lasso, do not have adequate posterior concentration in high-dimensional settings.
• ### Adaptive Bayesian Estimation of Conditional Densities(1408.5355)

Jan. 19, 2016 math.ST, stat.TH
We consider a non-parametric Bayesian model for conditional densities. The model is a finite mixture of normal distributions with covariate dependent multinomial logit mixing probabilities. A prior for the number of mixture components is specified on positive integers. The marginal distribution of covariates is not modeled. We study asymptotic frequentist behavior of the posterior in this model. Specifically, we show that when the true conditional density has a certain smoothness level, then the posterior contraction rate around the truth is equal up to a log factor to the frequentist minimax rate of estimation. An extension to the case when the covariate space is unbounded is also established. As our result holds without a priori knowledge of the smoothness level of the true density, the established posterior contraction rates are adaptive. Moreover, we show that the rate is not affected by inclusion of irrelevant covariates in the model. In Monte Carlo simulations, a version of the model compares favorably to a cross-validated kernel conditional density estimator.
• ### Posterior contraction in Gaussian process regression using Wasserstein approximations(1502.02336)

Oct. 2, 2015 math.ST, stat.TH
We study posterior rates of contraction in Gaussian process regression with unbounded covariate domain. Our argument relies on developing a Gaussian approximation to the posterior of the leading coefficients of a Karhunen--Lo\'{e}ve expansion of the Gaussian process. The salient feature of our result is deriving such an approximation in the $L^2$ Wasserstein distance and relating the speed of the approximation to the posterior contraction rate using a coupling argument. Specific illustrations are provided for the Gaussian or squared-exponential covariance kernel.
• ### Optimal Bayesian estimation in stochastic block models(1505.06794)

May 26, 2015 math.ST, stat.TH
With the advent of structured data in the form of social networks, genetic circuits and protein interaction networks, statistical analysis of networks has gained popularity over recent years. Stochastic block model constitutes a classical cluster-exhibiting random graph model for networks. There is a substantial amount of literature devoted to proposing strategies for estimating and inferring parameters of the model, both from classical and Bayesian viewpoints. Unlike the classical counterpart, there is however a dearth of theoretical results on the accuracy of estimation in the Bayesian setting. In this article, we undertake a theoretical investigation of the posterior distribution of the parameters in a stochastic block model. In particular, we show that one obtains optimal rates of posterior convergence with routinely used multinomial-Dirichlet priors on cluster indicators and uniform priors on the probabilities of the random edge indicators. En route, we develop geometric embedding techniques to exploit the lower dimensional structure of the parameter space which may be of independent interest.
• ### Bayesian Clustering of Shapes of Curves(1504.00377)

April 1, 2015 cs.LG, stat.ML
Unsupervised clustering of curves according to their shapes is an important problem with broad scientific applications. The existing model-based clustering techniques either rely on simple probability models (e.g., Gaussian) that are not generally valid for shape analysis or assume the number of clusters. We develop an efficient Bayesian method to cluster curve data using an elastic shape metric that is based on joint registration and comparison of shapes of curves. The elastic-inner product matrix obtained from the data is modeled using a Wishart distribution whose parameters are assigned carefully chosen prior distributions to allow for automatic inference on the number of clusters. Posterior is sampled through an efficient Markov chain Monte Carlo procedure based on the Chinese restaurant process to infer (1) the posterior distribution on the number of clusters, and (2) clustering configuration of shapes. This method is demonstrated on a variety of synthetic data and real data examples on protein structure analysis, cell shape analysis in microscopy images, and clustering of shaped from MPEG7 database.
• ### Variable Selection Using Shrinkage Priors(1503.04303)

March 22, 2015 stat.ME
Variable selection has received widespread attention over the last decade as we routinely encounter high-throughput datasets in complex biological and environment research. Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the noise. However, such priors encounter computational issues in high dimensions. This has motivated continuous shrinkage priors, resembling the two-component priors facilitating computation and interpretability. While such priors are widely used for estimating high-dimensional sparse vectors, selecting a subset of variables remains a daunting task. In this article, we propose a general approach for variable selection with shrinkage priors. The presence of very few tuning parameters makes our method attractive in comparison to adhoc thresholding approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can be used along with any shrinkage prior. Theoretical properties for near-collinear design matrices are investigated and the method is shown to have good performance in a wide range of synthetic data examples.
• ### Optimal Bayesian estimation in random covariate design with a rescaled Gaussian process prior(1411.7420)

March 5, 2015 math.ST, stat.TH
In Bayesian nonparametric models, Gaussian processes provide a popular prior choice for regression function estimation. Existing literature on the theoretical investigation of the resulting posterior distribution almost exclusively assume a fixed design for covariates. The only random design result we are aware of (van der Vaart & van Zanten, 2011) assumes the assigned Gaussian process to be supported on the smoothness class specified by the true function with probability one. This is a fairly restrictive assumption as it essentially rules out the Gaussian process prior with a squared exponential kernel when modeling rougher functions. In this article, we show that an appropriate rescaling of the above Gaussian process leads to a rate-optimal posterior distribution even when the covariates are independently realized from a known density on a compact set. The proofs are based on deriving sharp concentration inequalities for frequentist kernel estimators; the results might be of independent interest.
• ### A location-mixture autoregressive model for online forecasting of lung tumor motion(1309.4144)

Nov. 5, 2014 stat.ME, stat.AP
Lung tumor tracking for radiotherapy requires real-time, multiple-step ahead forecasting of a quasi-periodic time series recording instantaneous tumor locations. We introduce a location-mixture autoregressive (LMAR) process that admits multimodal conditional distributions, fast approximate inference using the EM algorithm and accurate multiple-step ahead predictive distributions. LMAR outperforms several commonly used methods in terms of out-of-sample prediction accuracy using clinical data from lung tumor patients. With its superior predictive performance and real-time computation, the LMAR model could be effectively implemented for use in current tumor tracking systems.
• ### Posterior contraction in sparse Bayesian factor models for massive covariance matrices(1206.3627)

June 2, 2014 math.ST, stat.TH
Sparse Bayesian factor models are routinely implemented for parsimonious dependence modeling and dimensionality reduction in high-dimensional applications. We provide theoretical understanding of such Bayesian procedures in terms of posterior convergence rates in inferring high-dimensional covariance matrices where the dimension can be larger than the sample size. Under relevant sparsity assumptions on the true covariance matrix, we show that commonly-used point mass mixture priors on the factor loadings lead to consistent estimation in the operator norm even when $p\gg n$. One of our major contributions is to develop a new class of continuous shrinkage priors and provide insights into their concentration around sparse vectors. Using such priors for the factor loadings, we obtain similar rate of convergence as obtained with point mass mixture priors. To obtain the convergence rates, we construct test functions to separate points in the space of high-dimensional covariance matrices using insights from random matrix theory; the tools developed may be of independent interest. We also derive minimax rates and show that the Bayesian posterior rates of convergence coincide with the minimax rates upto a $\sqrt{\log n}$ term.
• ### Anisotropic function estimation using multi-bandwidth Gaussian processes(1111.1044)

March 21, 2014 math.ST, stat.TH
In nonparametric regression problems involving multiple predictors, there is typically interest in estimating an anisotropic multivariate regression surface in the important predictors while discarding the unimportant ones. Our focus is on defining a Bayesian procedure that leads to the minimax optimal rate of posterior contraction (up to a log factor) adapting to the unknown dimension and anisotropic smoothness of the true surface. We propose such an approach based on a Gaussian process prior with dimension-specific scalings, which are assigned carefully-chosen hyperpriors. We additionally show that using a homogenous Gaussian process with a single bandwidth leads to a sub-optimal rate in anisotropic cases.