• ### Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations(1704.00963)

Sept. 21, 2018 stat.CO, stat.ME, stat.ML
Bayesian optimization (BO) is a global optimization strategy designed to find the minimum of an expensive black-box function, typically defined on a compact subset of $\mathcal{R}^d$, by using a Gaussian process (GP) as a surrogate model for the objective. Although currently available acquisition functions address this goal with different degree of success, an over-exploration effect of the contour of the search space is typically observed. However, in problems like the configuration of machine learning algorithms, the function domain is conservatively large and with a high probability the global minimum does not sit on the boundary of the domain. We propose a method to incorporate this knowledge into the search process by adding virtual derivative observations in the \gp at the boundary of the search space. We use the properties of GPs to impose conditions on the partial derivatives of the objective. The method is applicable with any acquisition function, it is easy to use and consistently reduces the number of evaluations required to optimize the objective irrespective of the acquisition used. We illustrate the benefits of our approach in an extensive experimental comparison.
• ### Efficient acquisition rules for model-based approximate Bayesian computation(1704.00520)

Aug. 8, 2018 stat.CO, stat.ME, stat.ML
Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.
• ### ELFI: Engine for Likelihood-Free Inference(1708.00707)

July 5, 2018 cs.MS, stat.CO, stat.ML
Engine for Likelihood-Free Inference (ELFI) is a Python software library for performing likelihood-free inference (LFI). ELFI provides a convenient syntax for arranging components in LFI, such as priors, simulators, summaries or distances, to a network called ELFI graph. The components can be implemented in a wide variety of languages. The stand-alone ELFI graph can be used with any of the available inference methods without modifications. A central method implemented in ELFI is Bayesian Optimization for Likelihood-Free Inference (BOLFI), which has recently been shown to accelerate likelihood-free inference up to several orders of magnitude by surrogate-modelling the distance. ELFI also has an inbuilt support for output data storing for reuse and analysis, and supports parallelization of computation from multiple cores up to a cluster environment. ELFI is designed to be extensible and provides interfaces for widening its functionality. This makes the adding of new inference methods to ELFI straightforward and automatically compatible with the inbuilt features.
• ### Visualization in Bayesian workflow(1709.01449)

June 9, 2018 stat.ME, stat.AP
Bayesian data analysis is about more than just computing a posterior distribution, and Bayesian visualization is about more than trace plots of Markov chains. Practical Bayesian data analysis, like all data analysis, is an iterative process of model building, inference, model checking and evaluation, and model expansion. Visualization is helpful in each of these stages of the Bayesian workflow and it is indispensable when drawing inferences from the types of modern, high-dimensional models that are used by applied researchers.
• ### Validating Bayesian Inference Algorithms with Simulation-Based Calibration(1804.06788)

April 18, 2018 stat.ME
Verifying the correctness of Bayesian computation is challenging. This is especially true for complex models that are common in practice, as these require sophisticated model implementations and algorithms. In this paper we introduce \emph{simulation-based calibration} (SBC), a general procedure for validating inferences from Bayesian algorithms capable of generating posterior samples. This procedure not only identifies inaccurate computation and inconsistencies in model implementations but also provides graphical summaries that can indicate the nature of the problems that arise. We argue that SBC is a critical part of a robust Bayesian workflow, as well as being a useful tool for those developing computational algorithms and statistical software.
• ### Bayesian Estimation of Gaussian Graphical Models with Projection Predictive Selection(1801.05725)

April 6, 2018 stat.ME, stat.AP
Gaussian graphical models are used for determining conditional relationships between variables. This is accomplished by identifying off-diagonal elements in the inverse-covariance matrix that are non-zero. When the ratio of variables (p) to observations (n) approaches one, the maximum likelihood estimator of the covariance matrix becomes unstable and requires shrinkage estimation. Whereas several classical (frequentist) methods have been introduced to address this issue, Bayesian methods remain relatively uncommon in practice and methodological literatures. Here we introduce a Bayesian method for estimating sparse matrices, in which conditional relationships are determined with projection predictive selection. This method uses Kullback-Leibler divergence and cross-validation for variable selection, in addition to the horseshoe prior for regularization. Through simulation and an applied example, we demonstrate that the proposed method often outperforms classical methods, such as the graphical lasso, as well as an alternative Bayesian method with respect to edge identification and frequentist risk. Further, projection predictive selection consistently had the lowest false positive rate, both with simulated and real data. We end by discussing future directions and contributions to the Bayesian literature on the topic of sparsity.
• ### Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data(1412.4869)

Nov. 30, 2019 stat.CO, stat.ME, stat.ML
A common divide-and-conquer approach for Bayesian computation with big data is to partition the data, perform local inference for each piece separately, and combine the results to obtain a global posterior approximation. While being conceptually and computationally appealing, this method involves the problematic need to also split the prior for the local inferences; these weakened priors may not provide enough regularization for each separate computation, thus eliminating one of the key advantages of Bayesian methods. To resolve this dilemma while still retaining the generalizability of the underlying local inference method, we apply the idea of expectation propagation (EP) as a framework for distributed Bayesian inference. The central idea is to iteratively update approximations to the local likelihoods given the state of the other approximations and the prior. The present paper has two roles: we review the steps that are needed to keep EP algorithms numerically stable, and we suggest a general approach, inspired by EP, for approaching data partitioning problems in a way that achieves the computational benefits of parallelism while allowing each local update to make use of relevant information from the other sites. In addition, we demonstrate how the method can be applied in a hierarchical context to make use of partitioning of both data and parameters. The paper describes a general algorithmic framework, rather than a specific algorithm, and presents an example implementation for it.
• ### User Modelling for Avoiding Overfitting in Interactive Knowledge Elicitation for Prediction(1710.04881)

March 9, 2018 cs.HC, cs.LG, stat.ML
In human-in-the-loop machine learning, the user provides information beyond that in the training data. Many algorithms and user interfaces have been designed to optimize and facilitate this human--machine interaction; however, fewer studies have addressed the potential defects the designs can cause. Effective interaction often requires exposing the user to the training data or its statistics. The design of the system is then critical, as this can lead to double use of data and overfitting, if the user reinforces noisy patterns in the data. We propose a user modelling methodology, by assuming simple rational behaviour, to correct the problem. We show, in a user study with 48 participants, that the method improves predictive performance in a sparse linear regression sentiment analysis task, where graded user knowledge on feature relevance is elicited. We believe that the key idea of inferring user knowledge with probabilistic user models has general applicability in guarding against overfitting and improving interactive machine learning.
• ### Gaussian process modeling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria(1610.06462)

Feb. 16, 2018 stat.ME, stat.AP, stat.ML
Approximate Bayesian computation (ABC) can be used for model fitting when the likelihood function is intractable but simulating from the model is feasible. However, even a single evaluation of a complex model may take several hours, limiting the number of model evaluations available. Modelling the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by ABC, but the sensitivity of this approach to a specific GP formulation has not yet been thoroughly investigated. We begin with a comprehensive empirical evaluation of using GPs in ABC, including various transformations of the discrepancies and two novel GP formulations. Our results indicate the choice of GP may significantly affect the accuracy of the estimated posterior distribution. Selection of an appropriate GP model is thus important. We formulate expected utility to measure the accuracy of classifying discrepancies below or above the ABC threshold, and show that it can be used to automate the GP model selection step. Finally, based on the understanding gained with toy examples, we fit a population genetic model for bacteria, providing insight into horizontal gene transfer events within the population and from external origins.
• ### Yes, but Did It Work?: Evaluating Variational Inference(1802.02538)

Feb. 7, 2018 stat.CO, stat.ML
While it's always possible to compute a variational approximation to a posterior distribution, it can be difficult to discover problems with this approximation". We propose two diagnostic algorithms to alleviate this problem. The Pareto-smoothed importance sampling (PSIS) diagnostic gives a goodness of fit measurement for joint distributions, while simultaneously improving the error in the estimate. The variational simulation-based calibration (VSBC) assesses the average performance of point estimates.
• ### Bayesian inference for spatio-temporal spike-and-slab priors(1509.04752)

Dec. 1, 2017 stat.CO, stat.ME, stat.ML
In this work, we address the problem of solving a series of underdetermined linear inverse problems subject to a sparsity constraint. We generalize the spike-and-slab prior distribution to encode a priori correlation of the support of the solution in both space and time by imposing a transformed Gaussian process on the spike-and-slab probabilities. An expectation propagation (EP) algorithm for posterior inference under the proposed model is derived. For large scale problems, the standard EP algorithm can be prohibitively slow. We therefore introduce three different approximation schemes to reduce the computational complexity. Finally, we demonstrate the proposed model using numerical experiments based on both synthetic and real data sets.
• ### Iterative Supervised Principal Components(1710.06229)

Oct. 17, 2017 stat.ME
In high-dimensional prediction problems, where the number of features may greatly exceed the number of training instances, fully Bayesian approach with a sparsifying prior is known to produce good results but is computationally challenging. To alleviate this computational burden, we propose to use a preprocessing step where we first apply a dimension reduction to the original data to reduce the number of features to something that is computationally conveniently handled by Bayesian methods. To do this, we propose a new dimension reduction technique, called iterative supervised principal components (ISPC), which combines variable screening and dimension reduction and can be considered as an extension to the existing technique of supervised principal components (SPCs). Our empirical evaluations confirm that, although not foolproof, the proposed approach provides very good results on several microarray benchmark datasets with very affordable computation time, and can also be very useful for visualizing high-dimensional data.
• ### Using stacking to average Bayesian predictive distributions(1704.02030)

Sept. 16, 2017 stat.CO, stat.ME
The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.
• ### Nudged elastic band calculations accelerated with Gaussian process regression(1706.04606)

Minimum energy paths for transitions such as atomic and/or spin rearrangements in thermalized systems are the transition paths of largest statistical weight. Such paths are frequently calculated using the nudged elastic band method, where an initial path is iteratively shifted to the nearest minimum energy path. The computational effort can be large, especially when ab initio or electron density functional calculations are used to evaluate the energy and atomic forces. Here, we show how the number of such evaluations can be reduced by an order of magnitude using a Gaussian process regression approach where an approximate energy surface is generated and refined in each iteration. When the goal is to evaluate the transition rate within harmonic transition state theory, the evaluation of the Hessian matrix at the initial and final state minima can be carried out beforehand and used as input in the minimum energy path calculation, thereby improving stability and reducing the number of iterations needed for convergence. A Gaussian process model also provides an uncertainty estimate for the approximate energy surface, and this can be used to focus the calculations on the lesser-known part of the path, thereby reducing the number of needed energy and force evaluations to a half in the present calculations. The methodology is illustrated using the two-dimensional M\"uller-Brown potential surface and performance assessed on an established benchmark involving 13 rearrangement transitions of a heptamer island on a solid surface.
• ### Sparsity information and regularization in the horseshoe and other shrinkage priors(1707.01694)

July 6, 2017 stat.ME
The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation. This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.
• ### On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior(1610.05559)

April 27, 2017 stat.ME
The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter. We argue that the previous default choices are dubious due to their tendency to favor solutions with more unshrunk coefficients than we typically expect a priori. This can lead to bad results if this parameter is not strongly identified by data. We derive the relationship between the global parameter and the effective number of nonzeros in the coefficient vector, and show an easy and intuitive way of setting up the prior for the global parameter based on our prior beliefs about the number of nonzero coefficients in the model. The results on real world data show that one can benefit greatly -- in terms of improved parameter estimates, prediction accuracy, and reduced computation time -- from transforming even a crude guess for the number of nonzero coefficients into the prior for the global parameter using our framework.
• ### Minimum energy path calculations with Gaussian process regression(1703.10423)

The calculation of minimum energy paths for transitions such as atomic and/or spin re-arrangements is an important task in many contexts and can often be used to determine the mechanism and rate of transitions. An important challenge is to reduce the computational effort in such calculations, especially when ab initio or electron density functional calculations are used to evaluate the energy since they can require large computational effort. Gaussian process regression is used here to reduce significantly the number of energy evaluations needed to find minimum energy paths of atomic rearrangements. By using results of previous calculations to construct an approximate energy surface and then converge to the minimum energy path on that surface in each Gaussian process iteration, the number of energy evaluations is reduced significantly as compared with regular nudged elastic band calculations. For a test problem involving rearrangements of a heptamer island on a crystal surface, the number of energy evaluations is reduced to less than a fifth. The scaling of the computational effort with the number of degrees of freedom as well as various possible further improvements to this approach are discussed.
• ### Automatic monotonicity detection for Gaussian Processes(1610.05440)

Oct. 18, 2016 stat.ME
We propose a new method for automatically detecting monotonic input-output relationships from data using Gaussian Process (GP) models with virtual derivative observations. Our results on synthetic and real datasets show that the proposed method detects monotonic directions from input spaces with high accuracy. We expect the method to be useful especially for improving explainability of the models and improving the accuracy of regression and classification tasks, especially near the edges of the data or when extrapolating.
• ### Pareto Smoothed Importance Sampling(1507.02646)

July 2, 2019 stat.CO, stat.ME, stat.ML
Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be noisy when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates and convergence diagnostics.
• ### Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC(1507.04544)

Sept. 12, 2016 stat.CO, stat.ME
Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparing of predictive errors between two models. We implement the computations in an R package called 'loo' and demonstrate using models fit with the Bayesian inference package Stan.
• ### Projection predictive model selection for Gaussian processes(1510.04813)

Aug. 2, 2016 stat.ME
We propose a new method for simplification of Gaussian process (GP) models by projecting the information contained in the full encompassing model and selecting a reduced number of variables based on their predictive relevance. Our results on synthetic and real world datasets show that the proposed method improves the assessment of variable relevance compared to the automatic relevance determination (ARD) via the length-scale parameters. We expect the method to be useful for improving explainability of the models, reducing the future measurement costs and reducing the computation time for making new predictions.
• ### Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models(1412.7461)

May 23, 2016 stat.CO, stat.ML
The future predictive performance of a Bayesian model can be estimated using Bayesian cross-validation. In this article, we consider Gaussian latent variable models where the integration over the latent values is approximated using the Laplace method or expectation propagation (EP). We study the properties of several Bayesian leave-one-out (LOO) cross-validation approximations that in most cases can be computed with a small additional cost after forming the posterior approximation given the full data. Our main objective is to assess the accuracy of the approximative LOO cross-validation estimators. That is, for each method (Laplace and EP) we compare the approximate fast computation with the exact brute force LOO computation. Secondarily, we evaluate the accuracy of the Laplace and EP approximations themselves against a ground truth established through extensive Markov chain Monte Carlo simulation. Our empirical results show that the approach based upon a Gaussian approximation to the LOO marginal distribution (the so-called cavity distribution) gives the most accurate and reliable results among the fast methods.
• ### Chained Gaussian Processes(1604.05263)

April 18, 2016 cs.LG, stat.ML
Gaussian process models are flexible, Bayesian non-parametric approaches to regression. Properties of multivariate Gaussians mean that they can be combined linearly in the manner of additive models and via a link function (like in generalized linear models) to handle non-Gaussian data. However, the link function formalism is restrictive, link functions are always invertible and must convert a parameter of interest to a linear combination of the underlying processes. There are many likelihoods and models where a non-linear combination is more appropriate. We term these more general models Chained Gaussian Processes: the transformation of the GPs to the likelihood parameters will not generally be invertible, and that implies that linearisation would only be possible with multiple (localized) links, i.e. a chain. We develop an approximate inference procedure for Chained GPs that is scalable and applicable to any factorized likelihood. We demonstrate the approximation on a range of likelihood functions.
• ### Comparison of Bayesian predictive methods for model selection(1503.08650)

March 23, 2016 stat.ME, cs.LG
The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. This can also lead to substantial selection induced bias and optimism in the performance evaluation for the selected model. From a predictive viewpoint, best results are obtained by accounting for model uncertainty by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. If the encompassing model is too complex, it can be robustly simplified by the projection method, in which the information of the full model is projected onto the submodels. This approach is substantially less prone to overfitting than selection based on CV-score. Overall, the projection method appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.
• ### Hierarchical expectation propagation for Bayesian aggregation of average data(1602.02055)

Feb. 5, 2016 stat.ME, stat.AP
We often wish to use external data to improve the precision of an inference, but concerns arise when the different datasets have been collected under different conditions so that we do not want to simply pool the information. This is the well-known problem of meta-analysis, for which Bayesian methods have long been used to achieve partial pooling. Here we consider the challenge when the external data are averages rather than raw data. We provide a Bayesian solution by using simulation to approximate the likelihood of the external summary, and by allowing the parameters in the model to vary under the different conditions. Inferences are constructed using importance sampling from an approximate distribution determined by an expectation propagation-like algorithm. We demonstrate with the problem that motivated this research, a hierarchical nonlinear model in pharmacometrics, implementing the computation in Stan.