
Approximate Bayesian computation (ABC) is a method for Bayesian inference
when the likelihood is unavailable but simulating from the model is possible.
However, many ABC algorithms require a large number of simulations, which can
be costly. To reduce the computational cost, Bayesian optimisation (BO) and
surrogate models such as Gaussian processes have been proposed. Bayesian
optimisation enables one to intelligently decide where to evaluate the model
next but common BO strategies are not designed for the goal of estimating the
posterior distribution. Our paper addresses this gap in the literature. We
propose to compute the uncertainty in the ABC posterior density, which is due
to a lack of simulations to estimate this quantity accurately, and define a
loss function that measures this uncertainty. We then propose to select the
next evaluation location to minimise the expected loss. Experiments show that
the proposed method often produces the most accurate approximations as compared
to common BO strategies.

Engine for LikelihoodFree Inference (ELFI) is a Python software library for
performing likelihoodfree inference (LFI). ELFI provides a convenient syntax
for arranging components in LFI, such as priors, simulators, summaries or
distances, to a network called ELFI graph. The components can be implemented in
a wide variety of languages. The standalone ELFI graph can be used with any of
the available inference methods without modifications. A central method
implemented in ELFI is Bayesian Optimization for LikelihoodFree Inference
(BOLFI), which has recently been shown to accelerate likelihoodfree inference
up to several orders of magnitude by surrogatemodelling the distance. ELFI
also has an inbuilt support for output data storing for reuse and analysis, and
supports parallelization of computation from multiple cores up to a cluster
environment. ELFI is designed to be extensible and provides interfaces for
widening its functionality. This makes the adding of new inference methods to
ELFI straightforward and automatically compatible with the inbuilt features.

Approximate Bayesian computation (ABC) can be used for model fitting when the
likelihood function is intractable but simulating from the model is feasible.
However, even a single evaluation of a complex model may take several hours,
limiting the number of model evaluations available. Modelling the discrepancy
between the simulated and observed data using a Gaussian process (GP) can be
used to reduce the number of model evaluations required by ABC, but the
sensitivity of this approach to a specific GP formulation has not yet been
thoroughly investigated. We begin with a comprehensive empirical evaluation of
using GPs in ABC, including various transformations of the discrepancies and
two novel GP formulations. Our results indicate the choice of GP may
significantly affect the accuracy of the estimated posterior distribution.
Selection of an appropriate GP model is thus important. We formulate expected
utility to measure the accuracy of classifying discrepancies below or above the
ABC threshold, and show that it can be used to automate the GP model selection
step. Finally, based on the understanding gained with toy examples, we fit a
population genetic model for bacteria, providing insight into horizontal gene
transfer events within the population and from external origins.

Predicting the efficacy of a drug for a given individual, using
highdimensional genomic measurements, is at the core of precision medicine.
However, identifying features on which to base the predictions remains a
challenge, especially when the sample size is small. Incorporating expert
knowledge offers a promising alternative to improve a prediction model, but
collecting such knowledge is laborious to the expert if the number of candidate
features is very large. We introduce a probabilistic model that can incorporate
expert feedback about the impact of genomic measurements on the sensitivity of
a cancer cell for a given drug. We also present two methods to intelligently
collect this feedback from the expert, using experimental design and
multiarmed bandit models. In a multiple myeloma blood cancer data set (n=51),
expert knowledge decreased the prediction error by 8%. Furthermore, the
intelligent approaches can be used to reduce the workload of feedback
collection to less than 30% on average compared to a naive approach.

Providing accurate predictions is challenging for machine learning algorithms
when the number of features is larger than the number of samples in the data.
Prior knowledge can improve machine learning models by indicating relevant
variables and parameter values. Yet, this prior knowledge is often tacit and
only available from domain experts. We present a novel approach that uses
interactive visualization to elicit the tacit prior knowledge and uses it to
improve the accuracy of prediction models. The main component of our approach
is a user model that models the domain expert's knowledge of the relevance of
different features for a prediction task. In particular, based on the expert's
earlier input, the user model guides the selection of the features on which to
elicit user's knowledge next. The results of a controlled user study show that
the user model significantly improves prior knowledge elicitation and
prediction accuracy, when predicting the relative citation counts of scientific
documents in a specific domain.

In highdimensional data, structured noise caused by observed and unobserved
factors affecting multiple target variables simultaneously, imposes a serious
challenge for modeling, by masking the often weak signal. Therefore, (1)
explaining away the structured noise in multipleoutput regression is of
paramount importance. Additionally, (2) assumptions about the correlation
structure of the regression weights are needed. We note that both can be
formulated in a natural way in a latent variable model, in which both the
interesting signal and the noise are mediated through the same latent factors.
Under this assumption, the signal model then borrows strength from the noise
model by encouraging similar effects on correlated targets. We introduce a
hyperparameter for the \emph{latent signaltonoise ratio} which turns out to
be important for modelling weak signals, and an ordered infinitedimensional
shrinkage prior that resolves the rotational unidentifiability in reducedrank
regression models. Simulations and prediction experiments with metabolite, gene
expression, FMRI measurement, and macroeconomic time series data show that our
model equals or exceeds the stateoftheart performance and, in particular,
outperforms the standard approach of assuming independent noise and signal
models.

We consider the prediction of weak effects in a multipleoutput regression
setup, when covariates are expected to explain a small amount, less than
$\approx 1%$, of the variance of the target variables. To facilitate the
prediction of the weak effects, we constrain our model structure by introducing
a novel Bayesian approach of sharing information between the regression model
and the noise model. Further reduction of the effective number of parameters is
achieved by introducing an infinite shrinkage prior and group sparsity in the
context of the Bayesian reduced rank regression, and using the Bayesian
infinite factor model as a flexible lowrank noise model. In our experiments
the model incorporating the novelties outperformed alternatives in genomic
prediction of rich phenotype data. In particular, the information sharing
between the noise and regression models led to significant improvement in
prediction accuracy.

Highdimensional phenotypes hold promise for richer findings in association
studies, but testing of several phenotype traits aggravates the grand challenge
of association studies, that of multiple testing. Several methods have recently
been proposed for testing jointly all traits in a highdimensional vector of
phenotypes, with prospect of increased power to detect small effects that would
be missed if tested individually. However, the methods have rarely been
compared to the extent of enabling assessment of their relative merits and
setting up guidelines on which method to use, and how to use it. We compare the
methods on simulated data and with a real metabolomics data set comprising 137
highly correlated variables and approximately 550,000 SNPs.
Applying the methods to genomewide data with hundreds of thousands of
markers inevitably requires division of the problem into manageable parts
facilitating parallel processing, parts corresponding to individual genetic
variants, pathways, or genes, for example. Here we utilize a straightforward
formulation according to which the genome is divided into blocks of nearby
correlated genetic markers, tested jointly for association with the phenotypes.
This formulation is computationally feasible, reduces the number of tests, and
lets the methods take advantage of combining information over several
correlated variables not only on the phenotype side, but also on the genotype
side.
Our experiments show that canonical correlation analysis has higher power
than alternative methods, while remaining computationally tractable for routine
use in the GWAS setting, provided the number of samples is sufficient compared
to the numbers of phenotype and genotype variables tested. Sparse canonical
correlation analysis and regression models with latent confounding factors show
promising performance when the number of samples is small.