
Tailoring the functional properties of advanced organic/inorganic
heterogeonous devices to their intended technological applications requires
knowledge and control of the microscopic structure inside the device. Atomistic
quantum mechanical simulation methods deliver accurate energies and properties
for individual configurations, however, finding the most favourable
configurations remains computationally prohibitive. We propose a 'building
block'based Bayesian Optimisation Structure Search (BOSS) approach for
addressing extended organic/inorganic interface problems and demonstrate its
feasibility in a molecular surface adsorption study. In BOSS, a likelihoodfree
Bayesian scheme accelerates the identification of material energy landscapes
with the number of sampled configurations during active learning, enabling
structural inference with high chemical accuracy and featuring large simulation
cells. This allowed us to identify several most favourable molecular adsorption
configurations for $\mathrm{C}_{60}$ on the (101) surface of $\mathrm{TiO}_2$
anatase and clarify the key moleculesurface interactions governing structural
assembly. Inferred structures were in good agreement with detailed experimental
images of this surface adsorbate, demonstrating good predictive power of BOSS
and opening the route towards largescale surface adsorption studies of
molecular aggregates and films.

We consider the problem of parametric statistical inference when likelihood
computations are prohibitively expensive but sampling from the model is
possible. Several socalled likelihoodfree methods have been developed to
perform inference in the absence of a likelihood function. The popular
synthetic likelihood approach infers the parameters by modelling summary
statistics of the data by a Gaussian probability distribution. In another
popular approach called approximate Bayesian computation, the inference is
performed by identifying parameter values for which the summary statistics of
the simulated data are close to those of the observed data. Synthetic
likelihood is easier to use as no measure of `closeness' is required but the
Gaussianity assumption is often limiting. Moreover, both approaches require
judiciously chosen summary statistics. We here present an alternative inference
approach that is as easy to use as synthetic likelihood but not as restricted
in its assumptions, and that, in a natural way, enables automatic selection of
relevant summary statistic from a large set of candidates. The basic idea is to
frame the problem of estimating the posterior as a problem of estimating the
ratio between the data generating distribution and the marginal distribution.
This problem can be solved by logistic regression, and including regularising
penalty terms enables automatic selection of the summary statistics relevant to
the inference task. We illustrate the general theory on canonical examples and
employ it to perform inference for challenging stochastic nonlinear dynamical
systems and highdimensional summary statistics.

Stochastic Neighbor Embedding (SNE) methods minimize the divergence between
the similarity matrix of a highdimensional data set and its counterpart from a
lowdimensional embedding, leading to widely applied tools for data
visualization. Despite their popularity, the current SNE methods experience a
crowding problem when the data include highly imbalanced similarities. This
implies that the data points with higher total similarity tend to get crowded
around the display center. To solve this problem, we introduce a fast
normalization method and normalize the similarity matrix to be doubly
stochastic such that all the data points have equal total similarities.
Furthermore, we show empirically and theoretically that the doubly
stochasticity constraint often leads to embeddings which are approximately
spherical. This suggests replacing a flat space with spheres as the embedding
space. The spherical embedding eliminates the discrepancy between the center
and the periphery in visualization, which efficiently resolves the crowding
problem. We compared the proposed method (DOSNES) with the stateoftheart SNE
method on three realworld datasets and the results clearly indicate that our
method is more favorable in terms of visualization quality.

Engine for LikelihoodFree Inference (ELFI) is a Python software library for
performing likelihoodfree inference (LFI). ELFI provides a convenient syntax
for arranging components in LFI, such as priors, simulators, summaries or
distances, to a network called ELFI graph. The components can be implemented in
a wide variety of languages. The standalone ELFI graph can be used with any of
the available inference methods without modifications. A central method
implemented in ELFI is Bayesian Optimization for LikelihoodFree Inference
(BOLFI), which has recently been shown to accelerate likelihoodfree inference
up to several orders of magnitude by surrogatemodelling the distance. ELFI
also has an inbuilt support for output data storing for reuse and analysis, and
supports parallelization of computation from multiple cores up to a cluster
environment. ELFI is designed to be extensible and provides interfaces for
widening its functionality. This makes the adding of new inference methods to
ELFI straightforward and automatically compatible with the inbuilt features.

We consider the statistical properties of interaction parameter estimates
obtained by the direct coupling analysis (DCA) approach to learning
interactions from large data sets. Assuming that the data are generated from a
random background distribution, we determine the distribution of inferred
interactions. Two inference methods are considered: the L2 regularized naive
meanfield inference procedure (regularized least squares, RLS), and the
pseudolikelihood maximization (plmDCA). For RLS we also study a model where
the data matrix elements are real numbers, identically and independently
generated from a Gaussian distribution; in this setting we analytically find
that the distribution of the inferred interactions is Gaussian. For data of
Boolean type, more realistic in practice, the inferred interactions do not
generally follow a Gaussian. However, extensive numerical simulations indicate
that their distribution can be characterized by a single function determined by
a few system parameters after normalization by the standard deviation. This
property holds for both RLS and plmDCA and may be exploitable for inferring the
distribution of extremely large interactions from simulations for smaller
system sizes.

Increasingly complex generative models are being used across disciplines as
they allow for realistic characterization of data, but a common difficulty with
them is the prohibitively large computational cost to evaluate the likelihood
function and thus to perform likelihoodbased statistical inference. A
likelihoodfree inference framework has emerged where the parameters are
identified by finding values that yield simulated data resembling the observed
data. While widely applicable, a major difficulty in this framework is how to
measure the discrepancy between the simulated and observed data. Transforming
the original problem into a problem of classifying the data into simulated
versus observed, we find that classification accuracy can be used to assess the
discrepancy. The complete arsenal of classification methods becomes thereby
available for inference of intractable generative models. We validate our
approach using theory and simulations for both point estimation and Bayesian
inference, and demonstrate its use on real data by inferring an
individualbased epidemiological model for bacterial infections in child care
centers.

An important problem for HCI researchers is to estimate the parameter values
of a cognitive model from behavioral data. This is a difficult problem, because
of the substantial complexity and variety in human behavioral strategies. We
report an investigation into a new approach using approximate Bayesian
computation (ABC) to condition model parameters to data and prior knowledge. As
the case study we examine menu interaction, where we have click time data only
to infer a cognitive model that implements a search behaviour with parameters
such as fixation duration and recall probability. Our results demonstrate that
ABC (i) improves estimates of model parameter values, (ii) enables meaningful
comparisons between model variants, and (iii) supports fitting models to
individual users. ABC provides ample opportunities for theoretical HCI research
by allowing principled inference of model parameter values and their
uncertainty.

Efficient index structures for fast approximate nearest neighbor queries are
required in many applications such as recommendation systems. In
highdimensional spaces, many conventional methods suffer from excessive usage
of memory and slow response times. We propose a method where multiple random
projection trees are combined by a novel voting scheme. The key idea is to
exploit the redundancy in a large number of candidate sets obtained by
independently generated random projections in order to reduce the number of
expensive exact distance evaluations. The method is straightforward to
implement using sparse projections which leads to a reduced memory footprint
and fast index construction. Furthermore, it enables grouping of the required
computations into big matrix multiplications, which leads to additional savings
due to cache effects and lowlevel parallelization. We demonstrate by extensive
experiments on a wide variety of data sets that the method is faster than
existing partitioning tree or hashing based approaches, making it the fastest
available technique on high accuracy levels.

Various $\ell_1$penalised estimation methods such as graphical lasso and
CLIME are widely used for sparse precision matrix estimation. Many of these
methods have been shown to be consistent under various quantitative assumptions
about the underlying true covariance matrix. Intuitively, these conditions are
related to situations where the penalty term will dominate the optimisation. In
this paper, we explore the consistency of $\ell_1$based methods for a class of
sparse latent variable like models, which are strongly motivated by several
types of applications. We show that all $\ell_1$based methods fail
dramatically for models with nearly linear dependencies between the variables.
We also study the consistency on models derived from real gene expression data
and note that the assumptions needed for consistency never hold even for modest
sized gene networks and $\ell_1$based methods also become unreliable in
practice for larger networks.

We propose a Bayesian approximate inference method for learning the
dependence structure of a Gaussian graphical model. Using pseudolikelihood, we
derive an analytical expression to approximate the marginal likelihood for an
arbitrary graph structure without invoking any assumptions about
decomposability. The majority of the existing methods for learning Gaussian
graphical models are either restricted to decomposable graphs or require
specification of a tuning parameter that may have a substantial impact on
learned structures. By combining a simple sparsity inducing prior for the graph
structures with a default reference prior for the model parameters, we obtain a
fast and easily applicable scoring function that works well for even
highdimensional data. We demonstrate the favourable performance of our
approach by largescale comparisons against the leading methods for learning
nondecomposable Gaussian graphical models. A theoretical justification for our
method is provided by showing that it yields a consistent estimator of the
graph structure.

Rapidly assaying the diversity of a bacterial species present in a sample
obtained from a hospital patient or an evironmental source has become possible
after recent technological advances in DNA sequencing. For several applications
it is important to accurately identify the presence and estimate relative
abundances of the target organisms from short sequence reads obtained from a
sample. This task is particularly challenging when the set of interest includes
very closely related organisms, such as different strains of pathogenic
bacteria, which can vary considerably in terms of virulence, resistance and
spread. Using advanced Bayesian statistical modelling and computation
techniques we introduce a novel pipeline for bacterial identification that is
shown to outperform the currently leading pipeline for this purpose. Our
approach enables fast and accurate sequencebased identification of bacterial
strains while using only modest computational resources. Hence it provides a
useful tool for a wide spectrum of applications, including rapid clinical
diagnostics to distinguish among closely related strains causing nosocomial
infections. The software implementation is available at
https://github.com/PROBIC/BIB

Probability distributions in Stiefel manifold such as the vonMises Fisher
and Bingham distributions find diverse applications in signal processing and
other applied sciences. Use of these statistical models in practice is
complicated by the difficulties in numerical evaluation of their normalization
constants. In this letter, we derive asymptotical approximations to the
normalization constants via recent results in random matrix theory. The derived
approximations take simple forms and are reasonably accurate in regimes of
practical interest. As an application, we show that the proposed analytical
results lead to a remarkably reduction of the sampling complexity compared to
existing simulation based approaches.

Our paper deals with inferring simulatorbased statistical models given some
observed data. A simulatorbased model is a parametrized mechanism which
specifies how data are generated. It is thus also referred to as generative
model. We assume that only a finite number of parameters are of interest and
allow the generative process to be very general; it may be a noisy nonlinear
dynamical system with an unrestricted number of hidden variables. This weak
assumption is useful for devising realistic models but it renders statistical
inference very difficult. The main challenge is the intractability of the
likelihood function. Several likelihoodfree inference methods have been
proposed which share the basic idea of identifying the parameters by finding
values for which the discrepancy between simulated and observed data is small.
A major obstacle to using these methods is their computational cost. The cost
is largely due to the need to repeatedly simulate data sets and the lack of
knowledge about how the parameters affect the discrepancy. We propose a
strategy which combines probabilistic modeling of the discrepancy with
optimization to facilitate likelihoodfree inference. The strategy is
implemented using Bayesian optimization and is shown to accelerate the
inference through a reduction in the number of required simulations by several
orders of magnitude.

Volume of metric balls relates to ratedistortion theory and packing bounds
on codes. In this paper, the volume of balls in complex Grassmann manifolds is
evaluated for an arbitrary radius. The ball is defined as a set of hyperplanes
of a fixed dimension with reference to a center of possibly different
dimension, and a generalized chordal distance for unequal dimensional subspaces
is used. First, the volume is reduced to onedimensional integral
representation. The overall problem boils down to evaluating a determinant of a
matrix of the same size as the subspace dimensionality. Interpreting this
determinant as a characteristic function of the Jacobi ensemble, an asymptotic
analysis is carried out. The obtained asymptotic volume is moreover refined
using momentmatching techniques to provide a tighter approximation in
finitesize regimes. Lastly, the pertinence of the derived results is shown by
ratedistortion analysis of source coding on Grassmann manifolds.

Volume estimates of metric balls in manifolds find diverse applications in
information and coding theory. In this paper, some new results for the volume
of a metric ball in unitary group are derived via various tools from random
matrix theory. The first result is an integral representation of the exact
volume, which involves a Toeplitz determinant of Bessel functions. The
connection to matrixvariate hypergeometric functions and Szeg\H{o}'s strong
limit theorem lead independently from the finite size formula to an asymptotic
one. The convergence of the limiting formula is exceptionally fast due to an
underlying mockGaussian behavior. The proposed volume estimate enables simple
but accurate analytical evaluation of codingtheoretic bounds of unitary codes.
In particular, the GilbertVarshamov lower bound and the Hamming upper bound on
cardinality as well as the resulting bounds on code rate and minimum distance
are derived. Moreover, bounds on the scaling law of code rate are found.
Lastly, a closedform bound on diversity sum relevant to unitary spacetime
codes is obtained, which was only computed numerically in literature.

Some statistical models are specified via a data generating process for which
the likelihood function cannot be computed in closed form. Standard
likelihoodbased inference is then not feasible but the model parameters can be
inferred by finding the values which yield simulated data that resemble the
observed data. This approach faces at least two major difficulties: The first
difficulty is the choice of the discrepancy measure which is used to judge
whether the simulated data resemble the observed data. The second difficulty is
the computationally efficient identification of regions in the parameter space
where the discrepancy is low. We give here an introduction to our recent work
where we tackle the two difficulties through classification and Bayesian
optimization.

The Rayleigh product channel model is useful in capturing the performance
degradation due to rank deficiency of MIMO channels. In this paper, such a
performance degradation is investigated via the channel outage probability
assuming slowly varying channel with delayconstrained decoding. Using
techniques of free probability theory, the asymptotic variance of channel
capacity is derived when the dimensions of the channel matrices approach
infinity. In this asymptotic regime, the channel capacity is rigorously proven
to be Gaussian distributed. Using the obtained results, a fundamental tradeoff
between multiplexing gain and diversity gain of Rayleigh product channels can
be characterized by closedform expression at any finite signaltonoise ratio.
Numerical results are provided to compare the relative outage performance
between Rayleigh product channels and conventional Rayleigh MIMO channels.

Undirected graphical models known as Markov networks are popular for a wide
variety of applications ranging from statistical physics to computational
biology. Traditionally, learning of the network structure has been done under
the assumption of chordality which ensures that efficient scoring methods can
be used. In general, nonchordal graphs have intractable normalizing constants
which renders the calculation of Bayesian and other scores difficult beyond
very smallscale systems. Recently, there has been a surge of interest towards
the use of regularized pseudolikelihood methods for structural learning of
largescale Markov network models, as such an approach avoids the assumption of
chordality. The currently available methods typically necessitate the use of a
tuning parameter to adapt the level of regularization for a particular dataset,
which can be optimized for example by crossvalidation. Here we introduce a
Bayesian version of pseudolikelihood scoring of Markov networks, which enables
an automatic regularization through marginalization over the nuisance
parameters in the model. We prove consistency of the resulting MPL estimator
for the network structure via comparison with the pseudo information criterion.
Identification of the MPLoptimal network on a prescanned graph space is
considered with both greedy hill climbing and exact pseudoBoolean optimization
algorithms. We find that for reasonable sample sizes the hill climbing approach
most often identifies networks that are at a negligible distance from the
restricted global optimum. Using synthetic and existing benchmark networks, the
marginal pseudolikelihood method is shown to generally perform favorably
against recent popular inference methods for Markov networks.

Loglinear models are the popular workhorses of analyzing contingency tables.
A loglinear parameterization of an interaction model can be more expressive
than a direct parameterization based on probabilities, leading to a powerful
way of defining restrictions derived from marginal, conditional and
contextspecific independence. However, parameter estimation is often simpler
under a direct parameterization, provided that the model enjoys certain
decomposability properties. Here we introduce a cyclical projection algorithm
for obtaining maximum likelihood estimates of loglinear parameters under an
arbitrary contextspecific graphical loglinear model, which needs not satisfy
criteria of decomposability. We illustrate that lifting the restriction of
decomposability makes the models more expressive, such that additional
contextspecific independencies embedded in real data can be identified. It is
also shown how a contextspecific graphical model can correspond to a
nonhierarchical loglinear parameterization with a concise interpretation.
This observation can pave way to further development of nonhierarchical
loglinear models, which have been largely neglected due to their believed lack
of interpretability.

Gaussian graphical models represent the backbone of the statistical toolbox
for analyzing continuous multivariate systems. However, due to the intrinsic
properties of the multivariate normal distribution, use of this model family
may hide certain forms of contextspecific independence that are natural to
consider from an applied perspective. Such independencies have been earlier
introduced to generalize discrete graphical models and Bayesian networks into
more flexible model families. Here we adapt the idea of contextspecific
independence to Gaussian graphical models by introducing a stratification of
the Euclidean space such that a conditional independence may hold in certain
segments but be absent elsewhere. It is shown that the stratified models define
a curved exponential family, which retains considerable tractability for
parameter estimation and model selection.

Motivation: Estimation of bacterial community composition from a
highthroughput sequenced sample is an important task in metagenomics
applications. Since the sample sequence data typically harbors reads of
variable lengths and different levels of biological and technical noise,
accurate statistical analysis of such data is challenging. Currently popular
estimation methods are typically very time consuming in a desktop computing
environment.
Results: Using sparsity enforcing methods from the general sparse signal
processing field (such as compressed sensing), we derive a solution to the
community composition estimation problem by a simultaneous assignment of all
sample reads to a preprocessed reference database. A general statistical model
based on kernel density estimation techniques is introduced for the assignment
task and the model solution is obtained using convex optimization tools.
Further, we design a greedy algorithm solution for a fast solution. Our
approach offers a reasonably fast community composition estimation method which
is shown to be more robust to input data variation than a recently introduced
related method.
Availability: A platformindependent Matlab implementation of the method is
freely available at http://www.ee.kth.se/ctsoftware; source code that does not
require access to Matlab is currently being tested and will be made available
later through the above website.

We review a success story regarding Bayesian inference in fisheries
management in the Baltic Sea. The management of salmon fisheries is currently
based on the results of a complex Bayesian population dynamic model, and
managers and stakeholders use the probabilities in their discussions. We also
discuss the technical and human challenges in using Bayesian modeling to give
practical advice to the public and to government officials and suggest future
areas in which it can be applied. In particular, large databases in fisheries
science offer flexible ways to use hierarchical models to learn the population
dynamics parameters for those bycatch species that do not have similar large
stockspecific data sets like those that exist for many target species. This
information is required if we are to understand the future ecosystem risks of
fisheries.

Multiple cluster scattering MIMO channel is a useful model for picocellular
MIMO networks. In this paper, orthogonal spacetime block coded transmission
over such a channel is considered, where the effective channel equals the
product of n complex Gaussian matrices. A simple and accurate closedform
approximation to the channel outage capacity has been derived in this setting.
The result is valid for an arbitrary number of clusters n1 of scatterers and
an arbitrary antenna configuration. Numerical results are provided to study the
relative outage performance between the multicluster and the Rayleighfading
MIMO channels for which n=1.

We introduce stamp chains. A stamp chain is a finite integer sequence that is
both an addition chain and an additive 2basis, i.e., a solution to the postage
stamp problem. We provide a simple method for converting known postage stamp
solutions of length k into stamp chains of length k+1. Using stamp chains, we
construct an algorithm that computes u(x^i) for i=1,...,n in less than n1
multiplications, if u is a function that can be computed at zero cost, and if
there exists another zerocost function v such that v(a,b) = u(ab). This can
substantially reduce the computational cost of repeated multiplication, as
illustrated by application examples related to matrix multiplication and data
clustering using subset convolution. In addition, we report the extremal
postage stamp solutions of length k=24.

An inductive probabilistic classification rule must generally obey the
principles of Bayesian predictive inference, such that all observed and
unobserved stochastic quantities are jointly modeled and the parameter
uncertainty is fully acknowledged through the posterior predictive
distribution. Several such rules have been recently considered and their
asymptotic behavior has been characterized under the assumption that the
observed features or variables used for building a classifier are conditionally
independent given a simultaneous labeling of both the training samples and
those from an unknown origin. Here we extend the theoretical results to
predictive classifiers acknowledging feature dependencies either through
graphical models or sparser alternatives defined as stratified graphical
models. We also show through experimentation with both synthetic and real data
that the predictive classifiers based on stratified graphical models have
consistently best accuracy compared with the predictive classifiers based on
either conditionally independent features or on ordinary graphical models.