
Bayesian optimization (BO) is a global optimization strategy designed to find
the minimum of an expensive blackbox function, typically defined on a compact
subset of $\mathcal{R}^d$, by using a Gaussian process (GP) as a surrogate
model for the objective. Although currently available acquisition functions
address this goal with different degree of success, an overexploration effect
of the contour of the search space is typically observed. However, in problems
like the configuration of machine learning algorithms, the function domain is
conservatively large and with a high probability the global minimum does not
sit on the boundary of the domain. We propose a method to incorporate this
knowledge into the search process by adding virtual derivative observations in
the \gp at the boundary of the search space. We use the properties of GPs to
impose conditions on the partial derivatives of the objective. The method is
applicable with any acquisition function, it is easy to use and consistently
reduces the number of evaluations required to optimize the objective
irrespective of the acquisition used. We illustrate the benefits of our
approach in an extensive experimental comparison.

Approximate Bayesian computation (ABC) is a method for Bayesian inference
when the likelihood is unavailable but simulating from the model is possible.
However, many ABC algorithms require a large number of simulations, which can
be costly. To reduce the computational cost, Bayesian optimisation (BO) and
surrogate models such as Gaussian processes have been proposed. Bayesian
optimisation enables one to intelligently decide where to evaluate the model
next but common BO strategies are not designed for the goal of estimating the
posterior distribution. Our paper addresses this gap in the literature. We
propose to compute the uncertainty in the ABC posterior density, which is due
to a lack of simulations to estimate this quantity accurately, and define a
loss function that measures this uncertainty. We then propose to select the
next evaluation location to minimise the expected loss. Experiments show that
the proposed method often produces the most accurate approximations as compared
to common BO strategies.

Engine for LikelihoodFree Inference (ELFI) is a Python software library for
performing likelihoodfree inference (LFI). ELFI provides a convenient syntax
for arranging components in LFI, such as priors, simulators, summaries or
distances, to a network called ELFI graph. The components can be implemented in
a wide variety of languages. The standalone ELFI graph can be used with any of
the available inference methods without modifications. A central method
implemented in ELFI is Bayesian Optimization for LikelihoodFree Inference
(BOLFI), which has recently been shown to accelerate likelihoodfree inference
up to several orders of magnitude by surrogatemodelling the distance. ELFI
also has an inbuilt support for output data storing for reuse and analysis, and
supports parallelization of computation from multiple cores up to a cluster
environment. ELFI is designed to be extensible and provides interfaces for
widening its functionality. This makes the adding of new inference methods to
ELFI straightforward and automatically compatible with the inbuilt features.

Bayesian data analysis is about more than just computing a posterior
distribution, and Bayesian visualization is about more than trace plots of
Markov chains. Practical Bayesian data analysis, like all data analysis, is an
iterative process of model building, inference, model checking and evaluation,
and model expansion. Visualization is helpful in each of these stages of the
Bayesian workflow and it is indispensable when drawing inferences from the
types of modern, highdimensional models that are used by applied researchers.

Verifying the correctness of Bayesian computation is challenging. This is
especially true for complex models that are common in practice, as these
require sophisticated model implementations and algorithms. In this paper we
introduce \emph{simulationbased calibration} (SBC), a general procedure for
validating inferences from Bayesian algorithms capable of generating posterior
samples. This procedure not only identifies inaccurate computation and
inconsistencies in model implementations but also provides graphical summaries
that can indicate the nature of the problems that arise. We argue that SBC is a
critical part of a robust Bayesian workflow, as well as being a useful tool for
those developing computational algorithms and statistical software.

Gaussian graphical models are used for determining conditional relationships
between variables. This is accomplished by identifying offdiagonal elements in
the inversecovariance matrix that are nonzero. When the ratio of variables
(p) to observations (n) approaches one, the maximum likelihood estimator of the
covariance matrix becomes unstable and requires shrinkage estimation. Whereas
several classical (frequentist) methods have been introduced to address this
issue, Bayesian methods remain relatively uncommon in practice and
methodological literatures. Here we introduce a Bayesian method for estimating
sparse matrices, in which conditional relationships are determined with
projection predictive selection. This method uses KullbackLeibler divergence
and crossvalidation for variable selection, in addition to the horseshoe prior
for regularization. Through simulation and an applied example, we demonstrate
that the proposed method often outperforms classical methods, such as the
graphical lasso, as well as an alternative Bayesian method with respect to edge
identification and frequentist risk. Further, projection predictive selection
consistently had the lowest false positive rate, both with simulated and real
data. We end by discussing future directions and contributions to the Bayesian
literature on the topic of sparsity.

A common divideandconquer approach for Bayesian computation with big data
is to partition the data, perform local inference for each piece separately,
and combine the results to obtain a global posterior approximation. While being
conceptually and computationally appealing, this method involves the
problematic need to also split the prior for the local inferences; these
weakened priors may not provide enough regularization for each separate
computation, thus eliminating one of the key advantages of Bayesian methods. To
resolve this dilemma while still retaining the generalizability of the
underlying local inference method, we apply the idea of expectation propagation
(EP) as a framework for distributed Bayesian inference. The central idea is to
iteratively update approximations to the local likelihoods given the state of
the other approximations and the prior. The present paper has two roles: we
review the steps that are needed to keep EP algorithms numerically stable, and
we suggest a general approach, inspired by EP, for approaching data
partitioning problems in a way that achieves the computational benefits of
parallelism while allowing each local update to make use of relevant
information from the other sites. In addition, we demonstrate how the method
can be applied in a hierarchical context to make use of partitioning of both
data and parameters. The paper describes a general algorithmic framework,
rather than a specific algorithm, and presents an example implementation for
it.

In humanintheloop machine learning, the user provides information beyond
that in the training data. Many algorithms and user interfaces have been
designed to optimize and facilitate this humanmachine interaction; however,
fewer studies have addressed the potential defects the designs can cause.
Effective interaction often requires exposing the user to the training data or
its statistics. The design of the system is then critical, as this can lead to
double use of data and overfitting, if the user reinforces noisy patterns in
the data. We propose a user modelling methodology, by assuming simple rational
behaviour, to correct the problem. We show, in a user study with 48
participants, that the method improves predictive performance in a sparse
linear regression sentiment analysis task, where graded user knowledge on
feature relevance is elicited. We believe that the key idea of inferring user
knowledge with probabilistic user models has general applicability in guarding
against overfitting and improving interactive machine learning.

Approximate Bayesian computation (ABC) can be used for model fitting when the
likelihood function is intractable but simulating from the model is feasible.
However, even a single evaluation of a complex model may take several hours,
limiting the number of model evaluations available. Modelling the discrepancy
between the simulated and observed data using a Gaussian process (GP) can be
used to reduce the number of model evaluations required by ABC, but the
sensitivity of this approach to a specific GP formulation has not yet been
thoroughly investigated. We begin with a comprehensive empirical evaluation of
using GPs in ABC, including various transformations of the discrepancies and
two novel GP formulations. Our results indicate the choice of GP may
significantly affect the accuracy of the estimated posterior distribution.
Selection of an appropriate GP model is thus important. We formulate expected
utility to measure the accuracy of classifying discrepancies below or above the
ABC threshold, and show that it can be used to automate the GP model selection
step. Finally, based on the understanding gained with toy examples, we fit a
population genetic model for bacteria, providing insight into horizontal gene
transfer events within the population and from external origins.

While it's always possible to compute a variational approximation to a
posterior distribution, it can be difficult to discover problems with this
approximation". We propose two diagnostic algorithms to alleviate this problem.
The Paretosmoothed importance sampling (PSIS) diagnostic gives a goodness of
fit measurement for joint distributions, while simultaneously improving the
error in the estimate. The variational simulationbased calibration (VSBC)
assesses the average performance of point estimates.

In this work, we address the problem of solving a series of underdetermined
linear inverse problems subject to a sparsity constraint. We generalize the
spikeandslab prior distribution to encode a priori correlation of the support
of the solution in both space and time by imposing a transformed Gaussian
process on the spikeandslab probabilities. An expectation propagation (EP)
algorithm for posterior inference under the proposed model is derived. For
large scale problems, the standard EP algorithm can be prohibitively slow. We
therefore introduce three different approximation schemes to reduce the
computational complexity. Finally, we demonstrate the proposed model using
numerical experiments based on both synthetic and real data sets.

In highdimensional prediction problems, where the number of features may
greatly exceed the number of training instances, fully Bayesian approach with a
sparsifying prior is known to produce good results but is computationally
challenging. To alleviate this computational burden, we propose to use a
preprocessing step where we first apply a dimension reduction to the original
data to reduce the number of features to something that is computationally
conveniently handled by Bayesian methods. To do this, we propose a new
dimension reduction technique, called iterative supervised principal components
(ISPC), which combines variable screening and dimension reduction and can be
considered as an extension to the existing technique of supervised principal
components (SPCs). Our empirical evaluations confirm that, although not
foolproof, the proposed approach provides very good results on several
microarray benchmark datasets with very affordable computation time, and can
also be very useful for visualizing highdimensional data.

The widely recommended procedure of Bayesian model averaging is flawed in the
Mopen setting in which the true datagenerating process is not one of the
candidate models being fit. We take the idea of stacking from the point
estimation literature and generalize to the combination of predictive
distributions, extending the utility function to any proper scoring rule, using
Pareto smoothed importance sampling to efficiently compute the required
leaveoneout posterior distributions and regularization to get more stability.
We compare stacking of predictive distributions to several alternatives:
stacking of means, Bayesian model averaging (BMA), pseudoBMA using AICtype
weighting, and a variant of pseudoBMA that is stabilized using the Bayesian
bootstrap. Based on simulations and realdata applications, we recommend
stacking of predictive distributions, with BBpseudoBMA as an approximate
alternative when computation cost is an issue.

Minimum energy paths for transitions such as atomic and/or spin
rearrangements in thermalized systems are the transition paths of largest
statistical weight. Such paths are frequently calculated using the nudged
elastic band method, where an initial path is iteratively shifted to the
nearest minimum energy path. The computational effort can be large, especially
when ab initio or electron density functional calculations are used to evaluate
the energy and atomic forces. Here, we show how the number of such evaluations
can be reduced by an order of magnitude using a Gaussian process regression
approach where an approximate energy surface is generated and refined in each
iteration. When the goal is to evaluate the transition rate within harmonic
transition state theory, the evaluation of the Hessian matrix at the initial
and final state minima can be carried out beforehand and used as input in the
minimum energy path calculation, thereby improving stability and reducing the
number of iterations needed for convergence. A Gaussian process model also
provides an uncertainty estimate for the approximate energy surface, and this
can be used to focus the calculations on the lesserknown part of the path,
thereby reducing the number of needed energy and force evaluations to a half in
the present calculations. The methodology is illustrated using the
twodimensional M\"ullerBrown potential surface and performance assessed on an
established benchmark involving 13 rearrangement transitions of a heptamer
island on a solid surface.

The horseshoe prior has proven to be a noteworthy alternative for sparse
Bayesian estimation, but has previously suffered from two problems. First,
there has been no systematic way of specifying a prior for the global shrinkage
hyperparameter based on the prior information about the degree of sparsity in
the parameter vector. Second, the horseshoe prior has the undesired property
that there is no possibility of specifying separately information about
sparsity and the amount of regularization for the largest coefficients, which
can be problematic with weakly identified parameters, such as the logistic
regression coefficients in the case of data separation. This paper proposes
solutions to both of these problems. We introduce a concept of effective number
of nonzero parameters, show an intuitive way of formulating the prior for the
global hyperparameter based on the sparsity assumptions, and argue that the
previous default choices are dubious based on their tendency to favor solutions
with more unshrunk parameters than we typically expect a priori. Moreover, we
introduce a generalization to the horseshoe prior, called the regularized
horseshoe, that allows us to specify a minimum level of regularization to the
largest values. We show that the new prior can be considered as the continuous
counterpart of the spikeandslab prior with a finite slab width, whereas the
original horseshoe resembles the spikeandslab with an infinitely wide slab.
Numerical experiments on synthetic and real world data illustrate the benefit
of both of these theoretical advances.

The horseshoe prior has proven to be a noteworthy alternative for sparse
Bayesian estimation, but as shown in this paper, the results can be sensitive
to the prior choice for the global shrinkage hyperparameter. We argue that the
previous default choices are dubious due to their tendency to favor solutions
with more unshrunk coefficients than we typically expect a priori. This can
lead to bad results if this parameter is not strongly identified by data. We
derive the relationship between the global parameter and the effective number
of nonzeros in the coefficient vector, and show an easy and intuitive way of
setting up the prior for the global parameter based on our prior beliefs about
the number of nonzero coefficients in the model. The results on real world data
show that one can benefit greatly  in terms of improved parameter estimates,
prediction accuracy, and reduced computation time  from transforming even a
crude guess for the number of nonzero coefficients into the prior for the
global parameter using our framework.

The calculation of minimum energy paths for transitions such as atomic and/or
spin rearrangements is an important task in many contexts and can often be
used to determine the mechanism and rate of transitions. An important challenge
is to reduce the computational effort in such calculations, especially when ab
initio or electron density functional calculations are used to evaluate the
energy since they can require large computational effort. Gaussian process
regression is used here to reduce significantly the number of energy
evaluations needed to find minimum energy paths of atomic rearrangements. By
using results of previous calculations to construct an approximate energy
surface and then converge to the minimum energy path on that surface in each
Gaussian process iteration, the number of energy evaluations is reduced
significantly as compared with regular nudged elastic band calculations. For a
test problem involving rearrangements of a heptamer island on a crystal
surface, the number of energy evaluations is reduced to less than a fifth. The
scaling of the computational effort with the number of degrees of freedom as
well as various possible further improvements to this approach are discussed.

We propose a new method for automatically detecting monotonic inputoutput
relationships from data using Gaussian Process (GP) models with virtual
derivative observations. Our results on synthetic and real datasets show that
the proposed method detects monotonic directions from input spaces with high
accuracy. We expect the method to be useful especially for improving
explainability of the models and improving the accuracy of regression and
classification tasks, especially near the edges of the data or when
extrapolating.

Importance weighting is a general way to adjust Monte Carlo integration to
account for draws from the wrong distribution, but the resulting estimate can
be noisy when the importance ratios have a heavy right tail. This routinely
occurs when there are aspects of the target distribution that are not well
captured by the approximating distribution, in which case more stable estimates
can be obtained by modifying extreme importance ratios. We present a new method
for stabilizing importance weights using a generalized Pareto distribution fit
to the upper tail of the distribution of the simulated importance ratios. The
method, which empirically performs better than existing methods for stabilizing
importance sampling estimates, includes stabilized effective sample size
estimates, Monte Carlo error estimates and convergence diagnostics.

Leaveoneout crossvalidation (LOO) and the widely applicable information
criterion (WAIC) are methods for estimating pointwise outofsample prediction
accuracy from a fitted Bayesian model using the loglikelihood evaluated at the
posterior simulations of the parameter values. LOO and WAIC have various
advantages over simpler estimates of predictive error such as AIC and DIC but
are less used in practice because they involve additional computational steps.
Here we lay out fast and stable computations for LOO and WAIC that can be
performed using existing simulation draws. We introduce an efficient
computation of LOO using Paretosmoothed importance sampling (PSIS), a new
procedure for regularizing importance weights. Although WAIC is asymptotically
equal to LOO, we demonstrate that PSISLOO is more robust in the finite case
with weak priors or influential observations. As a byproduct of our
calculations, we also obtain approximate standard errors for estimated
predictive errors and for comparing of predictive errors between two models. We
implement the computations in an R package called 'loo' and demonstrate using
models fit with the Bayesian inference package Stan.

We propose a new method for simplification of Gaussian process (GP) models by
projecting the information contained in the full encompassing model and
selecting a reduced number of variables based on their predictive relevance.
Our results on synthetic and real world datasets show that the proposed method
improves the assessment of variable relevance compared to the automatic
relevance determination (ARD) via the lengthscale parameters. We expect the
method to be useful for improving explainability of the models, reducing the
future measurement costs and reducing the computation time for making new
predictions.

The future predictive performance of a Bayesian model can be estimated using
Bayesian crossvalidation. In this article, we consider Gaussian latent
variable models where the integration over the latent values is approximated
using the Laplace method or expectation propagation (EP). We study the
properties of several Bayesian leaveoneout (LOO) crossvalidation
approximations that in most cases can be computed with a small additional cost
after forming the posterior approximation given the full data. Our main
objective is to assess the accuracy of the approximative LOO crossvalidation
estimators. That is, for each method (Laplace and EP) we compare the
approximate fast computation with the exact brute force LOO computation.
Secondarily, we evaluate the accuracy of the Laplace and EP approximations
themselves against a ground truth established through extensive Markov chain
Monte Carlo simulation. Our empirical results show that the approach based upon
a Gaussian approximation to the LOO marginal distribution (the socalled cavity
distribution) gives the most accurate and reliable results among the fast
methods.

Gaussian process models are flexible, Bayesian nonparametric approaches to
regression. Properties of multivariate Gaussians mean that they can be combined
linearly in the manner of additive models and via a link function (like in
generalized linear models) to handle nonGaussian data. However, the link
function formalism is restrictive, link functions are always invertible and
must convert a parameter of interest to a linear combination of the underlying
processes. There are many likelihoods and models where a nonlinear combination
is more appropriate. We term these more general models Chained Gaussian
Processes: the transformation of the GPs to the likelihood parameters will not
generally be invertible, and that implies that linearisation would only be
possible with multiple (localized) links, i.e. a chain. We develop an
approximate inference procedure for Chained GPs that is scalable and applicable
to any factorized likelihood. We demonstrate the approximation on a range of
likelihood functions.

The goal of this paper is to compare several widely used Bayesian model
selection methods in practical model selection problems, highlight their
differences and give recommendations about the preferred approaches. We focus
on the variable subset selection for regression and classification and perform
several numerical experiments using both simulated and real world data. The
results show that the optimization of a utility estimate such as the
crossvalidation (CV) score is liable to finding overfitted models due to
relatively high variance in the utility estimates when the data is scarce. This
can also lead to substantial selection induced bias and optimism in the
performance evaluation for the selected model. From a predictive viewpoint,
best results are obtained by accounting for model uncertainty by forming the
full encompassing model, such as the Bayesian model averaging solution over the
candidate models. If the encompassing model is too complex, it can be robustly
simplified by the projection method, in which the information of the full model
is projected onto the submodels. This approach is substantially less prone to
overfitting than selection based on CVscore. Overall, the projection method
appears to outperform also the maximum a posteriori model and the selection of
the most probable variables. The study also demonstrates that the model
selection can greatly benefit from using crossvalidation outside the searching
process both for guiding the model size selection and assessing the predictive
performance of the finally selected model.

We often wish to use external data to improve the precision of an inference,
but concerns arise when the different datasets have been collected under
different conditions so that we do not want to simply pool the information.
This is the wellknown problem of metaanalysis, for which Bayesian methods
have long been used to achieve partial pooling. Here we consider the challenge
when the external data are averages rather than raw data. We provide a Bayesian
solution by using simulation to approximate the likelihood of the external
summary, and by allowing the parameters in the model to vary under the
different conditions. Inferences are constructed using importance sampling from
an approximate distribution determined by an expectation propagationlike
algorithm. We demonstrate with the problem that motivated this research, a
hierarchical nonlinear model in pharmacometrics, implementing the computation
in Stan.