
Bayesian matrix factorization (BMF) is a powerful tool for producing lowrank
representations of matrices and for predicting missing values and providing
confidence intervals. Scaling up the posterior inference for massivescale
matrices is challenging and requires distributing both data and computation
over many workers, making communication the main computational bottleneck.
Embarrassingly parallel inference would remove the communication needed, by
using completely independent computations on different data subsets, but it
suffers from the inherent unidentifiability of BMF solutions. We introduce a
hierarchical decomposition of the joint posterior distribution, which couples
the subset inferences, allowing for embarrassingly parallel computations in a
sequence of at most three stages. Using an efficient approximate
implementation, we show improvements empirically on both real and simulated
data. Our distributed approach is able to achieve a speedup of almost an order
of magnitude over the full posterior, with a negligible effect on predictive
accuracy. Our method outperforms stateoftheart embarrassingly parallel MCMC
methods in accuracy, and achieves results competitive to other available
distributed and parallel implementations of BMF.

A common approach for Bayesian computation with big data is to partition the
data into smaller pieces, perform local inference for each piece separately,
and finally combine the results to obtain an approximation to the global
posterior. Looking at this from the bottom up, one can perform separate
analyses on individual sources of data and then combine these in a larger
Bayesian model. In either case, the idea of distributed modeling and inference
has both conceptual and computational appeal, but from the Bayesian perspective
there is no general way of handling the prior distribution: if the prior is
included in each separate inference, it will be multiplycounted when the
inferences are combined; but if the prior is itself divided into pieces, it may
not provide enough regularization for each separate computation, thus
eliminating one of the key advantages of Bayesian methods. To resolve this
dilemma, we propose expectation propagation (EP) as a general prototype for
distributed Bayesian inference. The central idea is to factor the likelihood
according to the data partitions, and to iteratively combine each factor with
an approximate model of the prior and all other parts of the data, thus
producing an overall approximation to the global posterior at convergence. In
this paper, we give an introduction to EP and an overview of some recent
developments of the method, with particular emphasis on its use in combining
inferences from partitioned data. In addition to distributed modeling of large
datasets, our unified treatment also includes hierarchical modeling of data
with a naturally partitioned structure. The paper describes a general
algorithmic framework, rather than a specific algorithm, and presents an
example implementation for it.

Hierarchical models are versatile tools for joint modeling of data sets
arising from different, but related, sources. Fully Bayesian inference may,
however, become computationally prohibitive if the sourcespecific data models
are complex, or if the number of sources is very large. To facilitate
computation, we propose an approach, where inference is first made
independently for the parameters of each data set, whereupon the obtained
posterior samples are used as observed data in a substitute hierarchical model,
based on a scaled likelihood function. Compared to direct inference in a full
hierarchical model, the approach has the advantage of being able to speed up
convergence by breaking down the initial large inference problem into smaller
individual subproblems with better convergence properties. Moreover it enables
parallel processing of the possibly complex inferences of the sourcespecific
parameters, which may otherwise create a computational bottleneck if processed
jointly as part of a hierarchical model. The approach is illustrated with both
simulated and real data.

Motivation: Public and private repositories of experimental data are growing
to sizes that require dedicated methods for finding relevant data. To improve
on the state of the art of keyword searches from annotations, methods for
contentbased retrieval have been proposed. In the context of gene expression
experiments, most methods retrieve gene expression profiles, requiring each
experiment to be expressed as a single profile, typically of case vs. control.
A more general, recently suggested alternative is to retrieve experiments whose
models are good for modelling the query dataset. However, for very noisy and
highdimensional query data, this retrieval criterion turns out to be very
noisy as well.
Results: We propose doing retrieval using a denoised model of the query
dataset, instead of the original noisy dataset itself. To this end, we
introduce a general probabilistic framework, where each experiment is modelled
separately and the retrieval is done by finding related models. For retrieval
of gene expression experiments, we use a probabilistic model called product
partition model, which induces a clustering of genes that show similar
expression patterns across a number of samples. The suggested metric for
retrieval using clusterings is the normalized information distance. Empirical
results finally suggest that inference for the full probabilistic model can be
approximated with good performance using computationally faster heuristic
clustering approaches (e.g. $k$means). The method is highly scalable and
straightforward to apply to construct a generalpurpose gene expression
experiment retrieval method.
Availability: The method can be implemented using standard clustering
algorithms and normalized information distance, available in many statistical
software packages.

We introduce a fully lengthbased Bayesian model for the population dynamics
of northern shrimp (Pandalus Borealis). This has the advantage of structuring
the population in terms of a directly observable quantity, requiring no
indirect estimation of age distributions from measurements of size. The
introduced model is intended as a simplistic prototype around which further
developments and refinements can be built. As a case study, we use the model to
analyze the population of Skagerrak and the Norwegian Deep in the years
19882012.