
Bayesian matrix factorization (BMF) is a powerful tool for producing lowrank
representations of matrices and for predicting missing values and providing
confidence intervals. Scaling up the posterior inference for massivescale
matrices is challenging and requires distributing both data and computation
over many workers, making communication the main computational bottleneck.
Embarrassingly parallel inference would remove the communication needed, by
using completely independent computations on different data subsets, but it
suffers from the inherent unidentifiability of BMF solutions. We introduce a
hierarchical decomposition of the joint posterior distribution, which couples
the subset inferences, allowing for embarrassingly parallel computations in a
sequence of at most three stages. Using an efficient approximate
implementation, we show improvements empirically on both real and simulated
data. Our distributed approach is able to achieve a speedup of almost an order
of magnitude over the full posterior, with a negligible effect on predictive
accuracy. Our method outperforms stateoftheart embarrassingly parallel MCMC
methods in accuracy, and achieves results competitive to other available
distributed and parallel implementations of BMF.

A common divideandconquer approach for Bayesian computation with big data
is to partition the data, perform local inference for each piece separately,
and combine the results to obtain a global posterior approximation. While being
conceptually and computationally appealing, this method involves the
problematic need to also split the prior for the local inferences; these
weakened priors may not provide enough regularization for each separate
computation, thus eliminating one of the key advantages of Bayesian methods. To
resolve this dilemma while still retaining the generalizability of the
underlying local inference method, we apply the idea of expectation propagation
(EP) as a framework for distributed Bayesian inference. The central idea is to
iteratively update approximations to the local likelihoods given the state of
the other approximations and the prior. The present paper has two roles: we
review the steps that are needed to keep EP algorithms numerically stable, and
we suggest a general approach, inspired by EP, for approaching data
partitioning problems in a way that achieves the computational benefits of
parallelism while allowing each local update to make use of relevant
information from the other sites. In addition, we demonstrate how the method
can be applied in a hierarchical context to make use of partitioning of both
data and parameters. The paper describes a general algorithmic framework,
rather than a specific algorithm, and presents an example implementation for
it.

Hierarchical models are versatile tools for joint modeling of data sets
arising from different, but related, sources. Fully Bayesian inference may,
however, become computationally prohibitive if the sourcespecific data models
are complex, or if the number of sources is very large. To facilitate
computation, we propose an approach, where inference is first made
independently for the parameters of each data set, whereupon the obtained
posterior samples are used as observed data in a substitute hierarchical model,
based on a scaled likelihood function. Compared to direct inference in a full
hierarchical model, the approach has the advantage of being able to speed up
convergence by breaking down the initial large inference problem into smaller
individual subproblems with better convergence properties. Moreover it enables
parallel processing of the possibly complex inferences of the sourcespecific
parameters, which may otherwise create a computational bottleneck if processed
jointly as part of a hierarchical model. The approach is illustrated with both
simulated and real data.

Motivation: Public and private repositories of experimental data are growing
to sizes that require dedicated methods for finding relevant data. To improve
on the state of the art of keyword searches from annotations, methods for
contentbased retrieval have been proposed. In the context of gene expression
experiments, most methods retrieve gene expression profiles, requiring each
experiment to be expressed as a single profile, typically of case vs. control.
A more general, recently suggested alternative is to retrieve experiments whose
models are good for modelling the query dataset. However, for very noisy and
highdimensional query data, this retrieval criterion turns out to be very
noisy as well.
Results: We propose doing retrieval using a denoised model of the query
dataset, instead of the original noisy dataset itself. To this end, we
introduce a general probabilistic framework, where each experiment is modelled
separately and the retrieval is done by finding related models. For retrieval
of gene expression experiments, we use a probabilistic model called product
partition model, which induces a clustering of genes that show similar
expression patterns across a number of samples. The suggested metric for
retrieval using clusterings is the normalized information distance. Empirical
results finally suggest that inference for the full probabilistic model can be
approximated with good performance using computationally faster heuristic
clustering approaches (e.g. $k$means). The method is highly scalable and
straightforward to apply to construct a generalpurpose gene expression
experiment retrieval method.
Availability: The method can be implemented using standard clustering
algorithms and normalized information distance, available in many statistical
software packages.

We introduce a fully lengthbased Bayesian model for the population dynamics
of northern shrimp (Pandalus Borealis). This has the advantage of structuring
the population in terms of a directly observable quantity, requiring no
indirect estimation of age distributions from measurements of size. The
introduced model is intended as a simplistic prototype around which further
developments and refinements can be built. As a case study, we use the model to
analyze the population of Skagerrak and the Norwegian Deep in the years
19882012.