
A common approach for Bayesian computation with big data is to partition the
data into smaller pieces, perform local inference for each piece separately,
and finally combine the results to obtain an approximation to the global
posterior. Looking at this from the bottom up, one can perform separate
analyses on individual sources of data and then combine these in a larger
Bayesian model. In either case, the idea of distributed modeling and inference
has both conceptual and computational appeal, but from the Bayesian perspective
there is no general way of handling the prior distribution: if the prior is
included in each separate inference, it will be multiplycounted when the
inferences are combined; but if the prior is itself divided into pieces, it may
not provide enough regularization for each separate computation, thus
eliminating one of the key advantages of Bayesian methods. To resolve this
dilemma, we propose expectation propagation (EP) as a general prototype for
distributed Bayesian inference. The central idea is to factor the likelihood
according to the data partitions, and to iteratively combine each factor with
an approximate model of the prior and all other parts of the data, thus
producing an overall approximation to the global posterior at convergence. In
this paper, we give an introduction to EP and an overview of some recent
developments of the method, with particular emphasis on its use in combining
inferences from partitioned data. In addition to distributed modeling of large
datasets, our unified treatment also includes hierarchical modeling of data
with a naturally partitioned structure. The paper describes a general
algorithmic framework, rather than a specific algorithm, and presents an
example implementation for it.

Linear Mixed Models (LMMs) are important tools in statistical genetics. When
used for feature selection, they allow to find a sparse set of genetic traits
that best predict a continuous phenotype of interest, while simultaneously
correcting for various confounding factors such as age, ethnicity and
population structure. Formulated as models for linear regression, LMMs have
been restricted to continuous phenotypes. We introduce the Sparse Probit Linear
Mixed Model (ProbitLMM), where we generalize the LMM modeling paradigm to
binary phenotypes. As a technical challenge, the model no longer possesses a
closedform likelihood function. In this paper, we present a scalable
approximate inference algorithm that lets us fit the model to highdimensional
data sets. We show on three realworld examples from different domains that in
the setup of binary labels, our algorithm leads to better prediction accuracies
and also selects features which show less correlation with the confounding
factors.

Maximum entropy modeling is a flexible and popular framework for formulating
statistical models given partial knowledge. In this paper, rather than the
traditional method of optimizing over the continuous density directly, we learn
a smooth and invertible transformation that maps a simple distribution to the
desired maximum entropy distribution. Doing so is nontrivial in that the
objective being maximized (entropy) is a function of the density itself. By
exploiting recent developments in normalizing flow networks, we cast the
maximum entropy problem into a finitedimensional constrained optimization, and
solve the problem by combining stochastic optimization with the augmented
Lagrangian method. Simulation results demonstrate the effectiveness of our
method, and applications to finance and computer vision show the flexibility
and accuracy of using maximum entropy flow networks.

Reconstruction of neuroanatomy is a fundamental problem in neuroscience.
Stochastic expression of colors in individual cells is a promising tool,
although its use in the nervous system has been limited due to various sources
of variability in expression. Moreover, the intermingled anatomy of neuronal
trees is challenging for existing segmentation algorithms. Here, we propose a
method to automate the segmentation of neurons in such (potentially
pseudocolored) images. The method uses spatiocolor relations between the
voxels, generates supervoxels to reduce the problem size by four orders of
magnitude before the final segmentation, and is parallelizable over the
supervoxels. To quantify performance and gain insight, we generate simulated
images, where the noise level and characteristics, the density of expression,
and the number of fluorophore types are variable. We also present segmentations
of real Brainbow images of the mouse hippocampus, which reveal many of the
dendritic segments.

A body of recent work in modeling neural activity focuses on recovering
lowdimensional latent features that capture the statistical structure of
largescale neural populations. Most such approaches have focused on linear
generative models, where inference is computationally tractable. Here, we
propose fLDS, a general class of nonlinear generative models that permits the
firing rate of each neuron to vary as an arbitrary smooth function of a latent,
linear dynamical state. This extra flexibility allows the model to capture a
richer set of neural variability than a purely linear model, but retains an
easily visualizable lowdimensional latent space. To fit this class of
nonconjugate models we propose a variational inference scheme, along with a
novel approximate posterior capable of capturing rich temporal correlations
across time. We show that our techniques permit inference in a wide class of
generative models.We also show in application to two neural datasets that,
compared to stateoftheart neural population models, fLDS captures a much
larger proportion of neural variability with a small number of latent
dimensions, providing superior predictive performance and interpretability.

Kernel methods are one of the mainstays of machine learning, but the problem
of kernel learning remains challenging, with only a few heuristics and very
little theory. This is of particular importance in methods based on estimation
of kernel mean embeddings of probability measures. For characteristic kernels,
which include most commonly used ones, the kernel mean embedding uniquely
determines its probability measure, so it can be used to design a powerful
statistical testing framework, which includes nonparametric twosample and
independence tests. In practice, however, the performance of these tests can be
very sensitive to the choice of kernel and its lengthscale parameters. To
address this central issue, we propose a new probabilistic model for kernel
mean embeddings, the Bayesian Kernel Embedding model, combining a Gaussian
process prior over the Reproducing Kernel Hilbert Space containing the mean
embedding with a conjugate likelihood function, thus yielding a closed form
posterior over the mean embedding. The posterior mean of our model is closely
related to recently proposed shrinkage estimators for kernel mean embeddings,
while the posterior uncertainty is a new, interesting feature with various
possible applications. Critically for the purposes of kernel learning, our
model gives a simple, closed form marginal pseudolikelihood of the observed
data given the kernel hyperparameters. This marginal pseudolikelihood can
either be optimized to inform the hyperparameter choice or fully Bayesian
inference can be used.

The computational and storage complexity of kernel machines presents the
primary barrier to their scaling to large, modern, datasets. A common way to
tackle the scalability issue is to use the conjugate gradient algorithm, which
relieves the constraints on both storage (the kernel matrix need not be stored)
and computation (both stochastic gradients and parallelization can be used).
Even so, conjugate gradient is not without its own issues: the conditioning of
kernel matrices is often such that conjugate gradients will have poor
convergence in practice. Preconditioning is a common approach to alleviating
this issue. Here we propose preconditioned conjugate gradients for kernel
machines, and develop a broad range of preconditioners particularly useful for
kernel matrices. We describe a scalable approach to both solving kernel
machines and learning their hyperparameters. We show this approach is exact in
the limit of iterations and outperforms stateoftheart approximations for a
given computational budget.

Linear dimensionality reduction methods are a cornerstone of analyzing high
dimensional data, due to their simple geometric interpretations and typically
attractive computational properties. These methods capture many data features
of interest, such as covariance, dynamical structure, correlation between data
sets, inputoutput relationships, and margin between data classes. Methods have
been developed with a variety of names and motivations in many fields, and
perhaps as a result the connections between all these methods have not been
highlighted. Here we survey methods from this disparate literature as
optimization programs over matrix manifolds. We discuss principal component
analysis, factor analysis, linear multidimensional scaling, Fisher's linear
discriminant analysis, canonical correlations analysis, maximum autocorrelation
factors, slow feature analysis, sufficient dimensionality reduction,
undercomplete independent component analysis, linear regression, distance
metric learning, and more. This optimization framework gives insight to some
rarely discussed shortcomings of wellknown methods, such as the suboptimality
of certain eigenvector solutions. Modern techniques for optimization over
matrix manifolds enable a generic linear dimensionality reduction solver, which
accepts as input data and an objective to be optimized, and returns, as output,
an optimal lowdimensional projection of the data. This simple optimization
framework further allows straightforward generalizations and novel variants of
classical methods, which we demonstrate here by creating an
orthogonalprojection canonical correlations analysis. More broadly, this
survey and generic solver suggest that linear dimensionality reduction can move
toward becoming a blackbox, objectiveagnostic numerical technology.

Neuroprosthetic braincomputer interfaces function via an algorithm which
decodes neural activity of the user into movements of an end effector, such as
a cursor or robotic arm. In practice, the decoder is often learned by updating
its parameters while the user performs a task. When the user's intention is not
directly observable, recent methods have demonstrated value in training the
decoder against a surrogate for the user's intended movement. We describe how
training a decoder in this way is a novel variant of an imitation learning
problem, where an oracle or expert is employed for supervised training in lieu
of direct observations, which are not available. Specifically, we describe how
a generic imitation learning metaalgorithm, dataset aggregation (DAgger, [1]),
can be adapted to train a generic braincomputer interface. By deriving
existing learning algorithms for braincomputer interfaces in this framework,
we provide a novel analysis of regret (an important metric of learning
efficacy) for braincomputer interfaces. This analysis allows us to
characterize the space of algorithmic variants and bounds on their regret
rates. Existing approaches for decoder learning have been performed in the
cursor control setting, but the available design principles for these decoders
are such that it has been impossible to scale them to naturalistic settings.
Leveraging our findings, we then offer an algorithm that combines imitation
learning with optimal control, which should allow for training of arbitrary
effectors for which optimal control can generate goaloriented control. We
demonstrate this novel and general BCI algorithm with simulated neuroprosthetic
control of a 26 degreeoffreedom model of an arm, a sophisticated and
realistic end effector.

Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework  GPatt 
enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive kernels
and fast exact inference techniques. Without human intervention  no hand
crafting of kernel features, and no sophisticated initialisation procedures 
we show that GPatt can solve large scale pattern extrapolation, inpainting, and
kernel discovery problems, including a problem with 383400 training points. We
find that GPatt significantly outperforms popular alternative scalable Gaussian
process methods in speed and accuracy. Moreover, we discover profound
differences between each of these methods, suggesting expressive kernels,
nonparametric representations, and exact inference are useful for modelling
large scale multidimensional patterns.

While Gaussian probability densities are omnipresent in applied mathematics,
Gaussian cumulative probabilities are hard to calculate in any but the
univariate case. We study the utility of Expectation Propagation (EP) as an
approximate integration method for this problem. For rectangular integration
regions, the approximation is highly accurate. We also extend the derivations
to the more general case of polyhedral integration regions. However, we find
that in this polyhedral case, EP's answer, though often accurate, can be almost
arbitrarily wrong. We consider these unexpected results empirically and
theoretically, both for the problem of Gaussian probabilities and for EP more
generally. These results elucidate an interesting and nonobvious feature of EP
not yet studied in detail.

Exact Gaussian Process (GP) regression has O(N^3) runtime for data size N,
making it intractable for large N. Many algorithms for improving GP scaling
approximate the covariance with lower rank matrices. Other work has exploited
structure inherent in particular covariance functions, including GPs with
implied Markov structure, and equispaced inputs (both enable O(N) runtime).
However, these GP advances have not been extended to the multidimensional input
setting, despite the preponderance of multidimensional applications. This paper
introduces and tests novel extensions of structured GPs to multidimensional
inputs. We present new methods for additive GPs, showing a novel connection
between the classic backfitting method and the Bayesian framework. To achieve
optimal accuracycomplexity tradeoff, we extend this model with a novel variant
of projection pursuit regression. Our primary result  projection pursuit
Gaussian Process Regression  shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
nonGaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.