
Inferring the relations between two images is an important class of tasks in
computer vision. Examples of such tasks include computing optical flow and
stereo disparity. We treat the relation inference tasks as a machine learning
problem and tackle it with neural networks. A key to the problem is learning a
representation of relations. We propose a new neural network module, contrast
association unit (CAU), which explicitly models the relations between two sets
of input variables. Due to the nonnegativity of the weights in CAU, we adopt a
multiplicative update algorithm for learning these weights. Experiments show
that neural networks with CAUs are more effective in learning five fundamental
image transformations than conventional neural networks.

Bayesian matrix factorization (BMF) is a powerful tool for producing lowrank
representations of matrices and for predicting missing values and providing
confidence intervals. Scaling up the posterior inference for massivescale
matrices is challenging and requires distributing both data and computation
over many workers, making communication the main computational bottleneck.
Embarrassingly parallel inference would remove the communication needed, by
using completely independent computations on different data subsets, but it
suffers from the inherent unidentifiability of BMF solutions. We introduce a
hierarchical decomposition of the joint posterior distribution, which couples
the subset inferences, allowing for embarrassingly parallel computations in a
sequence of at most three stages. Using an efficient approximate
implementation, we show improvements empirically on both real and simulated
data. Our distributed approach is able to achieve a speedup of almost an order
of magnitude over the full posterior, with a negligible effect on predictive
accuracy. Our method outperforms stateoftheart embarrassingly parallel MCMC
methods in accuracy, and achieves results competitive to other available
distributed and parallel implementations of BMF.

We consider the problem of parametric statistical inference when likelihood
computations are prohibitively expensive but sampling from the model is
possible. Several socalled likelihoodfree methods have been developed to
perform inference in the absence of a likelihood function. The popular
synthetic likelihood approach infers the parameters by modelling summary
statistics of the data by a Gaussian probability distribution. In another
popular approach called approximate Bayesian computation, the inference is
performed by identifying parameter values for which the summary statistics of
the simulated data are close to those of the observed data. Synthetic
likelihood is easier to use as no measure of `closeness' is required but the
Gaussianity assumption is often limiting. Moreover, both approaches require
judiciously chosen summary statistics. We here present an alternative inference
approach that is as easy to use as synthetic likelihood but not as restricted
in its assumptions, and that, in a natural way, enables automatic selection of
relevant summary statistic from a large set of candidates. The basic idea is to
frame the problem of estimating the posterior as a problem of estimating the
ratio between the data generating distribution and the marginal distribution.
This problem can be solved by logistic regression, and including regularising
penalty terms enables automatic selection of the summary statistics relevant to
the inference task. We illustrate the general theory on canonical examples and
employ it to perform inference for challenging stochastic nonlinear dynamical
systems and highdimensional summary statistics.

Engine for LikelihoodFree Inference (ELFI) is a Python software library for
performing likelihoodfree inference (LFI). ELFI provides a convenient syntax
for arranging components in LFI, such as priors, simulators, summaries or
distances, to a network called ELFI graph. The components can be implemented in
a wide variety of languages. The standalone ELFI graph can be used with any of
the available inference methods without modifications. A central method
implemented in ELFI is Bayesian Optimization for LikelihoodFree Inference
(BOLFI), which has recently been shown to accelerate likelihoodfree inference
up to several orders of magnitude by surrogatemodelling the distance. ELFI
also has an inbuilt support for output data storing for reuse and analysis, and
supports parallelization of computation from multiple cores up to a cluster
environment. ELFI is designed to be extensible and provides interfaces for
widening its functionality. This makes the adding of new inference methods to
ELFI straightforward and automatically compatible with the inbuilt features.

Inverse reinforcement learning (IRL) aims to explain observed strategic
behavior by fitting reinforcement learning models to behavioral data. However,
traditional IRL methods are only applicable when the observations are in the
form of stateaction paths. This assumption may not hold in many realworld
modeling settings, where only partial or summarized observations are available.
In general, we may assume that there is a summarizing function $\sigma$, which
acts as a filter between us and the true stateaction paths that constitute the
demonstration. Some initial approaches to extending IRL to such situations have
been presented, but with very specific assumptions about the structure of
$\sigma$, such as that only certain state observations are missing. This paper
instead focuses on the most general case of the problem, where no assumptions
are made about the summarizing function, except that it can be evaluated. We
demonstrate that inference is still possible. The paper presents exact and
approximate inference algorithms that allow full posterior inference, which is
particularly important for assessing parameter uncertainty in this challenging
inference situation. Empirical scalability is demonstrated to reasonably sized
problems, and practical applicability is demonstrated by estimating the
posterior for a cognitive science RL model based on an observed user's task
completion time only.

Metabolic flux balance analyses are a standard tool in analysing metabolic
reaction rates compatible with measurements, steadystate and the metabolic
reaction network stoichiometry. Flux analysis methods commonly place
unrealistic assumptions on fluxes due to the convenience of formulating the
problem as a linear programming model, and most methods ignore the notable
uncertainty in flux estimates. We introduce a novel paradigm of Bayesian
metabolic flux analysis that models the reactions of the whole genomescale
cellular system in probabilistic terms, and can infer the full flux vector
distribution of genomescale metabolic systems based on exchange and
intracellular (e.g. 13C) flux measurements, steadystate assumptions, and
target function assumptions. The Bayesian model couples all fluxes jointly
together in a simple truncated multivariate posterior distribution, which
reveals informative flux couplings. Our model is a plugin replacement to
conventional metabolic balance methods, such as flux balance analysis (FBA).
Our experiments indicate that we can characterise the genomescale flux
covariances, reveal flux couplings, and determine more intracellular unobserved
fluxes in C. acetobutylicum from 13C data than flux variability analysis. The
COBRA compatible software is available at github.com/markusheinonen/bamfa

Zeroinflated datasets, which have an excess of zero outputs, are commonly
encountered in problems such as climate or rare event modelling. Conventional
machine learning approaches tend to overestimate the nonzeros leading to poor
performance. We propose a novel model family of zeroinflated Gaussian
processes (ZiGP) for such zeroinflated datasets, produced by sparse kernels
through learning a latent probit Gaussian process that can zero out kernel rows
and columns whenever the signal is absent. The ZiGPs are particularly useful
for making the powerful Gaussian process networks more interpretable. We
introduce sparse GP networks where variableorder latent modelling is achieved
through sparse mixing signals. We derive the nontrivial stochastic variational
inference tractably for scalable learning of the sparse kernels in both models.
The novel outputsparse approach improves both prediction of zeroinflated data
and interpretability of latent mixing models.

In humanintheloop machine learning, the user provides information beyond
that in the training data. Many algorithms and user interfaces have been
designed to optimize and facilitate this humanmachine interaction; however,
fewer studies have addressed the potential defects the designs can cause.
Effective interaction often requires exposing the user to the training data or
its statistics. The design of the system is then critical, as this can lead to
double use of data and overfitting, if the user reinforces noisy patterns in
the data. We propose a user modelling methodology, by assuming simple rational
behaviour, to correct the problem. We show, in a user study with 48
participants, that the method improves predictive performance in a sparse
linear regression sentiment analysis task, where graded user knowledge on
feature relevance is elicited. We believe that the key idea of inferring user
knowledge with probabilistic user models has general applicability in guarding
against overfitting and improving interactive machine learning.

We introduce a novel kernel that models inputdependent couplings across
multiple latent processes. The pairwise joint kernel measures covariance along
inputs and across different latent signals in a mutuallydependent fashion. A
latent correlation Gaussian process (LCGP) model combines these nonstationary
latent components into multiple outputs by an inputdependent mixing matrix.
Probit classification and support for multiple observation sets are derived by
Variational Bayesian inference. Results on several datasets indicate that the
LCGP model can recover the correlations between latent signals while
simultaneously achieving stateoftheart performance. We highlight the latent
covariances with an EEG classification dataset where latent brain processes and
their couplings simultaneously emerge from the model.

Prediction in a smallsized sample with a large number of covariates, the
"small n, large p" problem, is challenging. This setting is encountered in
multiple applications, such as precision medicine, where obtaining additional
samples can be extremely costly or even impossible, and extensive research
effort has recently been dedicated to finding principled solutions for accurate
prediction. However, a valuable source of additional information, domain
experts, has not yet been efficiently exploited. We formulate knowledge
elicitation generally as a probabilistic inference process, where expert
knowledge is sequentially queried to improve predictions. In the specific case
of sparse linear regression, where we assume the expert has knowledge about the
values of the regression coefficients or about the relevance of the features,
we propose an algorithm and computational approximation for fast and efficient
interaction, which sequentially identifies the most informative features on
which to query expert knowledge. Evaluations of our method in experiments with
simulated and real users show improved prediction accuracy already with a small
effort from the expert.

Users of a personalised recommendation system face a dilemma: recommendations
can be improved by learning from data, but only if the other users are willing
to share their private information. Good personalised predictions are vitally
important in precision medicine, but genomic information on which the
predictions are based is also particularly sensitive, as it directly identifies
the patients and hence cannot easily be anonymised. Differential privacy has
emerged as a potentially promising solution: privacy is considered sufficient
if presence of individual patients cannot be distinguished. However,
differentially private learning with current methods does not improve
predictions with feasible data sizes and dimensionalities. Here we show that
useful predictors can be learned under powerful differential privacy
guarantees, and even from moderatelysized data sets, by demonstrating
significant improvements with a new robust private regression method in the
accuracy of private drug sensitivity prediction. The method combines two key
properties not present even in recent proposals, which can be generalised to
other predictors: we prove it is asymptotically consistently and efficiently
private, and demonstrate that it performs well on finite data. Good finite data
performance is achieved by limiting the sharing of private information by
decreasing the dimensionality and by projecting outliers to fit tighter bounds,
therefore needing to add less noise for equal privacy. As already the
simpletoimplement method shows promise on the challenging genomic data, we
anticipate rapid progress towards practical applications in many fields, such
as mobile sensing and social media, in addition to the badly needed precision
medicine solutions.

Bayesian networks, and especially their structures, are powerful tools for
representing conditional independencies and dependencies between random
variables. In applications where related variables form a priori known groups,
chosen to represent different "views" to or aspects of the same entities, one
may be more interested in modeling dependencies between groups of variables
rather than between individual variables. Motivated by this, we study prospects
of representing relationships between variable groups using Bayesian network
structures. We show that for dependency structures between groups to be
expressible exactly, the data have to satisfy the socalled groupwise
faithfulness assumption. We also show that one cannot learn causal relations
between groups using only groupwise conditional independencies, but also
variablewise relations are needed. Additionally, we present algorithms for
finding the groupwise dependency structures.

Many applications of machine learning, for example in health care, would
benefit from methods that can guarantee privacy of data subjects. Differential
privacy (DP) has become established as a standard for protecting learning
results. The standard DP algorithms require a single trusted party to have
access to the entire data, which is a clear weakness. We consider DP Bayesian
learning in a distributed setting, where each party only holds a single sample
or a few samples of the data. We propose a learning strategy based on a secure
multiparty sum function for aggregating summaries from data holders and the
Gaussian mechanism for DP. Our method builds on an asymptotically optimal and
practically efficient DP Bayesian inference with rapidly diminishing extra
cost.

We propose nonstationary spectral kernels for Gaussian process regression.
We propose to model the spectral density of a nonstationary kernel function as
a mixture of inputdependent Gaussian process frequency density surfaces. We
solve the generalised Fourier transform with such a model, and present a family
of nonstationary and nonmonotonic kernels that can learn inputdependent and
potentially longrange, nonmonotonic covariances between inputs. We derive
efficient inference using model whitening and marginalized posterior, and show
with case studies that these kernels are necessary when modelling even rather
simple time series, image or geospatial data with nonstationary
characteristics.

Predicting the efficacy of a drug for a given individual, using
highdimensional genomic measurements, is at the core of precision medicine.
However, identifying features on which to base the predictions remains a
challenge, especially when the sample size is small. Incorporating expert
knowledge offers a promising alternative to improve a prediction model, but
collecting such knowledge is laborious to the expert if the number of candidate
features is very large. We introduce a probabilistic model that can incorporate
expert feedback about the impact of genomic measurements on the sensitivity of
a cancer cell for a given drug. We also present two methods to intelligently
collect this feedback from the expert, using experimental design and
multiarmed bandit models. In a multiple myeloma blood cancer data set (n=51),
expert knowledge decreased the prediction error by 8%. Furthermore, the
intelligent approaches can be used to reduce the workload of feedback
collection to less than 30% on average compared to a naive approach.

Increasingly complex generative models are being used across disciplines as
they allow for realistic characterization of data, but a common difficulty with
them is the prohibitively large computational cost to evaluate the likelihood
function and thus to perform likelihoodbased statistical inference. A
likelihoodfree inference framework has emerged where the parameters are
identified by finding values that yield simulated data resembling the observed
data. While widely applicable, a major difficulty in this framework is how to
measure the discrepancy between the simulated and observed data. Transforming
the original problem into a problem of classifying the data into simulated
versus observed, we find that classification accuracy can be used to assess the
discrepancy. The complete arsenal of classification methods becomes thereby
available for inference of intractable generative models. We validate our
approach using theory and simulations for both point estimation and Bayesian
inference, and demonstrate its use on real data by inferring an
individualbased epidemiological model for bacterial infections in child care
centers.

Regression under the "small $n$, large $p$" conditions, of small sample size
$n$ and large number of features $p$ in the learning data set, is a recurring
setting in which learning from data is difficult. With prior knowledge about
relationships of the features, $p$ can effectively be reduced, but explicating
such prior knowledge is difficult for experts. In this paper we introduce a new
method for eliciting expert prior knowledge about the similarity of the roles
of features in the prediction task. The key idea is to use an interactive
multidimensionalscaling (MDS) type scatterplot display of the features to
elicit the similarity relationships, and then use the elicited relationships in
the prior distribution of prediction parameters. Specifically, for learning to
predict a target variable with Bayesian linear regression, the feature
relationships are used to construct a Gaussian prior with a full covariance
matrix for the regression coefficients. Evaluation of our method in experiments
with simulated and real users on text data confirm that prior elicitation of
feature similarities improves prediction accuracy. Furthermore, elicitation
with an interactive scatterplot display outperforms straightforward elicitation
where the users choose feature pairs from a feature list.

We propose an inlierbased outlier detection method capable of both
identifying the outliers and explaining why they are outliers, by identifying
the outlierspecific features. Specifically, we employ an inlierbased outlier
detection criterion, which uses the ratio of inlier and test probability
densities as a measure of plausibility of being an outlier. For estimating the
density ratio function, we propose a localized logistic regression algorithm.
Thanks to the locality of the model, variable selection can be
outlierspecific, and will help interpret why points are outliers in a
highdimensional space. Through synthetic experiments, we show that the
proposed algorithm can successfully detect the important features for outliers.
Moreover, we show that the proposed algorithm tends to outperform existing
algorithms in benchmark datasets.

We consider regression under the "extremely small $n$ large $p$" condition,
where the number of samples $n$ is so small compared to the dimensionality $p$
that predictors cannot be estimated without prior knowledge. This setup occurs
in personalized medicine, for instance, when predicting treatment outcomes for
an individual patient based on noisy highdimensional genomics data. A
remaining source of information is expert knowledge, which has received
relatively little attention in recent years. We formulate the inference problem
of asking expert feedback on features on a budget, propose an elicitation
strategy for a simple "small $n$" setting, and derive conditions under which
the elicitation strategy is optimal. Experiments on simulated experts, both on
synthetic and genomics data, demonstrate that the proposed strategy can
drastically improve prediction accuracy.

Providing accurate predictions is challenging for machine learning algorithms
when the number of features is larger than the number of samples in the data.
Prior knowledge can improve machine learning models by indicating relevant
variables and parameter values. Yet, this prior knowledge is often tacit and
only available from domain experts. We present a novel approach that uses
interactive visualization to elicit the tacit prior knowledge and uses it to
improve the accuracy of prediction models. The main component of our approach
is a user model that models the domain expert's knowledge of the relevance of
different features for a prediction task. In particular, based on the expert's
earlier input, the user model guides the selection of the features on which to
elicit user's knowledge next. The results of a controlled user study show that
the user model significantly improves prior knowledge elicitation and
prediction accuracy, when predicting the relative citation counts of scientific
documents in a specific domain.

An important problem for HCI researchers is to estimate the parameter values
of a cognitive model from behavioral data. This is a difficult problem, because
of the substantial complexity and variety in human behavioral strategies. We
report an investigation into a new approach using approximate Bayesian
computation (ABC) to condition model parameters to data and prior knowledge. As
the case study we examine menu interaction, where we have click time data only
to infer a cognitive model that implements a search behaviour with parameters
such as fixation duration and recall probability. Our results demonstrate that
ABC (i) improves estimates of model parameter values, (ii) enables meaningful
comparisons between model variants, and (iii) supports fitting models to
individual users. ABC provides ample opportunities for theoretical HCI research
by allowing principled inference of model parameter values and their
uncertainty.

The R package GFA provides a full pipeline for factor analysis of multiple
data sources that are represented as matrices with cooccurring samples. It
allows learning dependencies between subsets of the data sources, decomposed
into latent factors. The package also implements sparse priors for the
factorization, providing interpretable biclusters of the multisource data

We introduce the localized Lasso, which is suited for learning models that
are both interpretable and have a high predictive power in problems with high
dimensionality $d$ and small sample size $n$. More specifically, we consider a
function defined by local sparse models, one at each data point. We introduce
samplewise network regularization to borrow strength across the models, and
samplewise exclusive group sparsity (a.k.a., $\ell_{1,2}$ norm) to introduce
diversity into the choice of feature sets in the local models. The local models
are interpretable in terms of similarity of their sparsity patterns. The cost
function is convex, and thus has a globally optimal solution. Moreover, we
propose a simple yet efficient iterative leastsquares based optimization
procedure for the localized Lasso, which does not need a tuning parameter, and
is guaranteed to converge to a globally optimal solution. The solution is
empirically shown to outperform alternatives for both simulated and genomic
personalized medicine data.

We introduce Bayesian multitensor factorization, a model that is the first
Bayesian formulation for joint factorization of multiple matrices and tensors.
The research problem generalizes the joint matrixtensor factorization problem
to arbitrary sets of tensors of any depth, including matrices, can be
interpreted as unsupervised multiview learning from multiple data tensors, and
can be generalized to relax the usual trilinear tensor factorization
assumptions. The result is a factorization of the set of tensors into factors
shared by any subsets of the tensors, and factors private to individual
tensors. We demonstrate the performance against existing baselines in multiple
tensor factorization tasks in structural toxicogenomics and functional
neuroimaging.

We propose the convex factorization machine (CFM), which is a convex variant
of the widely used Factorization Machines (FMs). Specifically, we employ a
linear+quadratic model and regularize the linear term with the
$\ell_2$regularizer and the quadratic term with the trace norm regularizer.
Then, we formulate the CFM optimization as a semidefinite programming problem
and propose an efficient optimization procedure with Hazan's algorithm. A key
advantage of CFM over existing FMs is that it can find a globally optimal
solution, while FMs may get a poor locally optimal solution since the objective
function of FMs is nonconvex. In addition, the proposed algorithm is simple
yet effective and can be implemented easily. Finally, CFM is a general
factorization method and can also be used for other factorization problems
including including multiview matrix factorization and tensor completion
problems. Through synthetic and movielens datasets, we first show that the
proposed CFM achieves results competitive to FMs. Furthermore, in a
toxicogenomics prediction task, we show that CFM outperforms a stateoftheart
tensor factorization method.