
Divideandconquer based methods for Bayesian inference provide a general
approach for tractable posterior inference when the sample size is large. These
methods divide the data into smaller subsets, sample from the posterior
distribution of parameters in parallel on all the subsets, and combine
posterior samples from all the subsets to approximate the full data posterior
distribution. The smaller size of any subset compared to the full data implies
that posterior sampling on any subset is computationally more efficient than
sampling from the true posterior distribution. Since the combination step takes
negligible time relative to sampling, posterior computations can be scaled to
massive data by dividing the full data into a sufficiently large number of data
subsets. One such approach relies on the geometry of posterior distributions
estimated across different subsets and combines them through their barycenter
in a Wasserstein space of probability measures. We provide theoretical
guarantees on the accuracy of approximation that are valid in many
applications. We show that the geometric method approximates the full data
posterior distribution better than its competitors across diverse simulations
and reproduces known results when applied to a movie ratings database.

Bayesian sparse factor models have proven useful for characterizing
dependence in multivariate data, but scaling computation to large numbers of
samples and dimensions is problematic. We propose expandable factor analysis
for scalable inference in factor models when the number of factors is unknown.
The method relies on a continuous shrinkage prior for efficient maximum a
posteriori estimation of a lowrank and sparse loadings matrix. The structure
of the prior leads to an estimation algorithm that accommodates uncertainty in
the number of factors. We propose an information criterion to select the
hyperparameters of the prior. Expandable factor analysis has better false
discovery rates and true positive rates than its competitors across diverse
simulations. We apply the proposed approach to a gene expression study of aging
in mice, illustrating superior results relative to four competing methods.

Flexible hierarchical Bayesian modeling of massive data is challenging due to
poorly scaling computations in large sample size settings. This article is
motivated by spatial process models for analyzing geostatistical data, which
typically entail computations that become prohibitive as the number of spatial
locations becomes large. We propose a threestep divideandconquer strategy
within the Bayesian paradigm to achieve massive scalability for any spatial
process model. We partition the data into a large number of subsets, apply a
readily available Bayesian spatial process model on every subset in parallel,
and optimally combine the posterior distributions estimated across all the
subsets into a pseudoposterior distribution that conditions on the entire
data. The combined pseudo posterior distribution is used for predicting the
responses at arbitrary locations and for performing posterior inference on the
model parameters and the residual spatial surface. We call this approach
"Distributed Kriging" (DISK). It offers significant advantages in applications
where the entire data are or can be stored on multiple machines. Under the
standard theoretical setup, we show that if the number of subsets is not too
large, then the Bayes risk of estimating the true residual spatial surface
using the DISK posterior distribution decays to zero at a nearly optimal rate.
While DISK is a general approach to distributed nonparametric regression, we
focus on its applications in spatial statistics and demonstrate its empirical
performance using a stationary fullrank and a nonstationary lowrank model
based on Gaussian process (GP) prior. A variety of simulations and a
geostatistical analysis of the Pacific Ocean sea surface temperature data
validate our theoretical results.

There is a lack of simple and scalable algorithms for uncertainty
quantification. Bayesian methods quantify uncertainty through posterior and
predictive distributions, but it is difficult to rapidly estimate summaries of
these distributions, such as quantiles and intervals. Variational Bayes
approximations are widely used, but may badly underestimate posterior
covariance. Typically, the focus of Bayesian inference is on point and interval
estimates for onedimensional functionals of interest. In small scale problems,
Markov chain Monte Carlo algorithms remain the gold standard, but such
algorithms face major problems in scaling up to big data. Various modifications
have been proposed based on parallelization and approximations based on
subsamples, but such approaches are either highly complex or lack theoretical
support and/or good performance outside of narrow settings. We propose a very
simple and general posterior interval estimation algorithm, which is based on
running Markov chain Monte Carlo in parallel for subsets of the data and
averaging quantiles estimated from each subset. We provide strong theoretical
guarantees and illustrate performance in several applications.

The United States Bureau of Labor Statistics collects data using survey
instruments under informative sampling designs that assign probabilities of
inclusion to be correlated with the response. The bureau extensively uses
Bayesian hierarchical models and posterior sampling to impute missing items in
respondentlevel data and to infer population parameters. Posterior sampling
for survey data collected based on informative designs are computationally
expensive and do not support production schedules of the bureau. Motivated by
this problem, we propose a new method to scale Bayesian computations in
informative sampling designs. Our method divides the data into smaller subsets,
performs posterior sampling in parallel for every subset, and combines the
collection of posterior samples from all the subsets through their mean in the
Wasserstein space of order 2. Theoretically, we construct conditions on a class
of sampling designs where posterior consistency of the proposed method is
achieved. Empirically, we demonstrate that our method is competitive with
traditional methods while being significantly faster in many simulations and in
the Current Employment Statistics survey conducted by the bureau.

We propose a novel approach to Bayesian analysis that is provably robust to
outliers in the data and often has computational advantages over standard
methods. Our technique is based on splitting the data into nonoverlapping
subgroups, evaluating the posterior distribution given each independent
subgroup, and then combining the resulting measures. The main novelty of our
approach is the proposed aggregation step, which is based on the evaluation of
a median in the space of probability measures equipped with a suitable
collection of distances that can be quickly and efficiently evaluated in
practice. We present both theoretical and numerical evidence illustrating the
improvements achieved by our method.