-
We study full Bayesian procedures for sparse linear regression when errors
have a symmetric but otherwise unknown distribution. The unknown error
distribution is endowed with a symmetrized Dirichlet process mixture of
Gaussians. For the prior on regression coefficients, a mixture of point masses
at zero and continuous distributions is considered. We study behavior of the
posterior with diverging number of predictors. Conditions are provided for
consistency in the mean Hellinger distance. The compatibility and restricted
eigenvalue conditions yield the minimax convergence rate of the regression
coefficients in $\ell_1$- and $\ell_2$-norms, respectively. The convergence
rate is adaptive to both the unknown sparsity level and the unknown symmetric
error density under compatibility conditions. In addition, strong model
selection consistency and a semi-parametric Bernstein-von Mises theorem are
proven under slightly stronger conditions.
-
Data lying in a high dimensional ambient space are commonly thought to have a
much lower intrinsic dimension. In particular, the data may be concentrated
near a lower-dimensional subspace or manifold. There is an immense literature
focused on approximating the unknown subspace, and in exploiting such
approximations in clustering, data compression, and building of predictive
models. Most of the literature relies on approximating subspaces using a
locally linear, and potentially multiscale, dictionary. In this article, we
propose a simple and general alternative, which instead uses pieces of spheres,
or spherelets, to locally approximate the unknown subspace. Theory is developed
showing that spherelets can produce lower covering numbers and MSEs for many
manifolds. We develop spherical principal components analysis (SPCA). Results
relative to state-of-the-art competitors show gains in ability to accurately
approximate the subspace with fewer components. In addition, unlike most
competitors, our approach can be used for data denoising and can efficiently
embed new data without retraining. The methods are illustrated with standard
toy manifold learning examples, and applications to multiple real data sets.
-
Divide-and-conquer based methods for Bayesian inference provide a general
approach for tractable posterior inference when the sample size is large. These
methods divide the data into smaller subsets, sample from the posterior
distribution of parameters in parallel on all the subsets, and combine
posterior samples from all the subsets to approximate the full data posterior
distribution. The smaller size of any subset compared to the full data implies
that posterior sampling on any subset is computationally more efficient than
sampling from the true posterior distribution. Since the combination step takes
negligible time relative to sampling, posterior computations can be scaled to
massive data by dividing the full data into a sufficiently large number of data
subsets. One such approach relies on the geometry of posterior distributions
estimated across different subsets and combines them through their barycenter
in a Wasserstein space of probability measures. We provide theoretical
guarantees on the accuracy of approximation that are valid in many
applications. We show that the geometric method approximates the full data
posterior distribution better than its competitors across diverse simulations
and reproduces known results when applied to a movie ratings database.
-
Bayesian sparse factor models have proven useful for characterizing
dependence in multivariate data, but scaling computation to large numbers of
samples and dimensions is problematic. We propose expandable factor analysis
for scalable inference in factor models when the number of factors is unknown.
The method relies on a continuous shrinkage prior for efficient maximum a
posteriori estimation of a low-rank and sparse loadings matrix. The structure
of the prior leads to an estimation algorithm that accommodates uncertainty in
the number of factors. We propose an information criterion to select the
hyperparameters of the prior. Expandable factor analysis has better false
discovery rates and true positive rates than its competitors across diverse
simulations. We apply the proposed approach to a gene expression study of aging
in mice, illustrating superior results relative to four competing methods.
-
We propose a new approach for assigning weights to models using a
divergence-based method ({\em D-probabilities}), relying on evaluating
parametric models relative to a nonparametric Bayesian reference using
Kullback-Leibler divergence. D-probabilities are useful in goodness-of-fit
assessments, in comparing imperfect models, and in providing model weights to
be used in model aggregation. D-probabilities avoid some of the disadvantages
of Bayesian model probabilities, such as large sensitivity to prior choice, and
tend to place higher weight on a greater diversity of models. In an application
to linear model selection against a Gaussian process reference, we provide
simple analytic forms for routine implementation and show that D-probabilities
automatically penalize model complexity. Some asymptotic properties are
described, and we provide interesting probabilistic interpretations of the
proposed model weights. The framework is illustrated through simulation
examples and an ozone data application.
-
Asymptotic theory of tail index estimation has been studied extensively in
the frequentist literature on extreme values, but rarely in the Bayesian
context. We investigate whether popular Bayesian kernel mixture models are able
to support heavy tailed distributions and consistently estimate the tail index.
We show that posterior inconsistency in tail index is surprisingly common for
both parametric and nonparametric mixture models. We then present a set of
sufficient conditions under which posterior consistency in tail index can be
achieved, and verify these conditions for Pareto mixture models under general
mixing priors.
-
This article is motivated by soccer positional passing networks collected
across multiple games. We refer to these data as replicated spatial passing
networks---to accurately model such data it is necessary to take into account
the spatial positions of the passer and receiver for each passing event. This
spatial registration and replicates that occur across games represent key
differences with usual social network data. As a key step before investigating
how the passing dynamics influence team performance, we focus on developing
methods for summarizing different team's passing strategies. Our proposed
approach relies on a novel multiresolution data representation framework and
Poisson nonnegative block term decomposition model, which automatically
produces coarse-to-fine low-rank network motifs. The proposed methods are
applied to detailed passing record data collected from the 2014 FIFA World Cup.
-
Discrete random structures are important tools in Bayesian nonparametrics and
the resulting models have proven effective in density estimation, clustering,
topic modeling and prediction, among others. In this paper, we consider nested
processes and study the dependence structures they induce. Dependence ranges
between homogeneity, corresponding to full exchangeability, and maximum
heterogeneity, corresponding to (unconditional) independence across samples.
The popular nested Dirichlet process is shown to degenerate to the fully
exchangeable case when there are ties across samples at the observed or latent
level. To overcome this drawback, inherent to nesting general discrete random
measures, we introduce a novel class of latent nested processes. These are
obtained by adding common and group-specific completely random measures and,
then, normalising to yield dependent random probability measures. We provide
results on the partition distributions induced by latent nested processes, and
develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for
distributional homogeneity across groups is obtained as a by product. The
results and their inferential implications are showcased on synthetic and real
data.
-
Motivation: Although there is a rich literature on methods for assessing the
impact of functional predictors, the focus has been on approaches for dimension
reduction that can fail dramatically in certain applications. Examples of
standard approaches include functional linear models, functional principal
components regression, and cluster-based approaches, such as latent trajectory
analysis. This article is motivated by applications in which the dynamics in a
predictor, across times when the value is relatively extreme, are particularly
informative about the response. For example, physicians are interested in
relating the dynamics of blood pressure changes during surgery to post-surgery
adverse outcomes, and it is thought that the dynamics are more important when
blood pressure is significantly elevated or lowered.
Methods: We propose a novel class of extrema-weighted feature (XWF)
extraction models. Key components in defining XWFs include the marginal density
of the predictor, a function up-weighting values at high quantiles of this
marginal, and functionals characterizing local dynamics. Algorithms are
proposed for fitting of XWF-based regression and classification models, and are
compared with current methods for functional predictors in simulations and a
blood pressure during surgery application.
Results: XWFs find features of intraoperative blood pressure trajectories
that are predictive of postoperative mortality. By their nature, most of these
features cannot be found by previous methods.
-
High throughput screening of compounds (chemicals) is an essential part of
drug discovery [7], involving thousands to millions of compounds, with the
purpose of identifying candidate hits. Most statistical tools, including the
industry standard B-score method, work on individual compound plates and do not
exploit cross-plate correlation or statistical strength among plates. We
present a new statistical framework for high throughput screening of compounds
based on Bayesian nonparametric modeling. The proposed approach is able to
identify candidate hits from multiple plates simultaneously, sharing
statistical strength among plates and providing more robust estimates of
compound activity. It can flexibly accommodate arbitrary distributions of
compound activities and is applicable to any plate geometry. The algorithm
provides a principled statistical approach for hit identification and false
discovery rate control. Experiments demonstrate significant improvements in hit
identification sensitivity and specificity over the B-score method, which is
highly sensitive to threshold choice. The framework is implemented as an
efficient R extension package BHTSpack and is suitable for large scale data
sets.
-
There has been considerable interest in making Bayesian inference more
scalable. In big data settings, most literature focuses on reducing the
computing time per iteration, with less focused on reducing the number of
iterations needed in Markov chain Monte Carlo (MCMC). This article focuses on
data augmentation MCMC (DA-MCMC), a widely used technique. DA-MCMC samples tend
to become highly autocorrelated in large data samples, due to a miscalibration
problem in which conditional posterior distributions given augmented data are
too concentrated. This makes it necessary to collect very long MCMC paths to
obtain acceptably low MC error. To combat this inefficiency, we propose a
family of calibrated data augmentation algorithms, which appropriately adjust
the variance of conditional posterior distributions. A Metropolis-Hastings step
is used to eliminate bias in the stationary distribution of the resulting
sampler. Compared to existing alternatives, this approach can dramatically
reduce MC error by reducing autocorrelation and increasing the effective number
of DA-MCMC samples per computing time. The approach is simple and applicable to
a broad variety of existing data augmentation algorithms, and we focus on three
popular models: probit, logistic and Poisson log-linear. Dramatic gains in
computational efficiency are shown in applications.
-
Many modern applications collect highly imbalanced categorical data, with
some categories relatively rare. Bayesian hierarchical models combat data
sparsity by borrowing information, while also quantifying uncertainty. However,
posterior computation presents a fundamental barrier to routine use; a single
class of algorithms does not work well in all settings and practitioners waste
time trying different types of MCMC approaches. This article was motivated by
an application to quantitative advertising in which we encountered extremely
poor computational performance for common data augmentation MCMC algorithms but
obtained excellent performance for adaptive Metropolis. To obtain a deeper
understanding of this behavior, we give strong theory results on computational
complexity in an infinitely imbalanced asymptotic regime. Our results show
computational complexity of Metropolis is logarithmic in sample size, while
data augmentation is polynomial in sample size. The root cause of poor
performance of data augmentation is a discrepancy between the rates at which
the target density and MCMC step sizes concentrate. In general, MCMC algorithms
that have a similar discrepancy will fail in large samples - a result with
substantial practical impact.
-
With the routine collection of massive-dimensional predictors in many
application areas, screening methods that rapidly identify a small subset of
promising predictors have become commonplace. We propose a new MOdular Bayes
Screening (MOBS) approach, which involves several novel characteristics that
can potentially lead to improved performance. MOBS first applies a Bayesian
mixture model to the marginal distribution of the response, obtaining posterior
samples of mixture weights, cluster-specific parameters, and cluster
allocations for each subject. Hypothesis tests are then introduced,
corresponding to whether or not to include a given predictor, with posterior
probabilities for each hypothesis available analytically conditionally on
unknowns sampled in the first stage and tuning parameters controlling borrowing
of information across tests. By marginalizing over the first stage posterior
samples, we avoid under-estimation of uncertainty typical of two-stage methods.
We greatly simplify the model specification and reduce computational complexity
by using {\em modularization}. We provide basic theoretical support for this
approach, and illustrate excellent performance relative to competitors in
simulation studies and the ability to capture complex shifts beyond simple
differences in means. The method is illustrated with applications to genomics
by using a very high-dimensional cis-eQTL dataset with roughly 38 million SNPs.
-
There has been substantial recent interest in record linkage, attempting to
group the records pertaining to the same entities from a large database lacking
unique identifiers. This can be viewed as a type of "microclustering," with few
observations per cluster and a very large number of clusters. A variety of
methods have been proposed, but there is a lack of literature providing
theoretical guarantees on performance. We show that the problem is
fundamentally hard from a theoretical perspective, and even in idealized cases,
accurate entity resolution is effectively impossible when the number of
entities is small relative to the number of records and/or the separation among
records from different entities is not extremely large. To characterize the
fundamental difficulty, we focus on entity resolution based on multivariate
Gaussian mixture models, but our conclusions apply broadly and are supported by
simulation studies inspired by human rights applications. These results suggest
conservatism in interpretation of the results of record linkage, support
collection of additional data to more accurately disambiguate the entities, and
motivate a focus on coarser inference. For example, results from a simulation
study suggest that sometimes one may obtain accurate results for population
size estimation even when fine scale entity resolution is inaccurate.
-
Variational inference (VI) provides fast approximations of a Bayesian
posterior in part because it formulates posterior approximation as an
optimization problem: to find the closest distribution to the exact posterior
over some family of distributions. For practical reasons, the family of
distributions in VI is usually constrained so that it does not include the
exact posterior, even as a limit point. Thus, no matter how long VI is run, the
resulting approximation will not approach the exact posterior. We propose to
instead consider a more flexible approximating family consisting of all
possible finite mixtures of a parametric base distribution (e.g., Gaussian).
For efficient inference, we borrow ideas from gradient boosting to develop an
algorithm we call boosting variational inference (BVI). BVI iteratively
improves the current approximation by mixing it with a new component from the
base distribution family and thereby yields progressively more accurate
posterior approximations as more computing time is spent. Unlike a number of
common VI variants including mean-field VI, BVI is able to capture
multimodality, general posterior covariance, and nonstandard posterior shapes.
-
There is increasing interest in learning how human brain networks vary as a
function of a continuous trait, but flexible and efficient procedures to
accomplish this goal are limited. We develop a Bayesian semiparametric model,
which combines low-rank factorizations and flexible Gaussian process priors to
learn changes in the conditional expectation of a network-valued random
variable across the values of a continuous predictor, while including
subject-specific random effects. The formulation leads to a general framework
for inference on changes in brain network structures across human traits,
facilitating borrowing of information and coherently characterizing
uncertainty. We provide an efficient Gibbs sampler for posterior computation
along with simple procedures for inference, prediction and goodness-of-fit
assessments. The model is applied to learn how human brain networks vary across
individuals with different intelligence scores. Results provide interesting
insights on the association between intelligence and brain connectivity, while
demonstrating good predictive performance.
-
There is a lack of simple and scalable algorithms for uncertainty
quantification. Bayesian methods quantify uncertainty through posterior and
predictive distributions, but it is difficult to rapidly estimate summaries of
these distributions, such as quantiles and intervals. Variational Bayes
approximations are widely used, but may badly underestimate posterior
covariance. Typically, the focus of Bayesian inference is on point and interval
estimates for one-dimensional functionals of interest. In small scale problems,
Markov chain Monte Carlo algorithms remain the gold standard, but such
algorithms face major problems in scaling up to big data. Various modifications
have been proposed based on parallelization and approximations based on
subsamples, but such approaches are either highly complex or lack theoretical
support and/or good performance outside of narrow settings. We propose a very
simple and general posterior interval estimation algorithm, which is based on
running Markov chain Monte Carlo in parallel for subsets of the data and
averaging quantiles estimated from each subset. We provide strong theoretical
guarantees and illustrate performance in several applications.
-
Studying the neurological, genetic and evolutionary basis of human vocal
communication mechanisms using animal vocalization models is an important field
of neuroscience. The data sets typically comprise structured sequences of
syllables or `songs' produced by animals from different genotypes under
different social contexts. We develop a novel Bayesian semiparametric framework
for inference in such data sets. Our approach is built on a novel class of
mixed effects Markov transition models for the songs that accommodates
exogenous influences of genotype and context as well as animal-specific
heterogeneity. We design efficient Markov chain Monte Carlo algorithms for
posterior computation. Crucial advantages of the proposed approach include its
ability to provide insights into key scientific queries related to global and
local influences of the exogenous predictors on the transition dynamics via
automated tests of hypotheses. The methodology is illustrated using simulation
experiments and the aforementioned motivating application in neuroscience.
-
In studying structural inter-connections in the human brain, it is common to
first estimate fiber bundles connecting different regions of the brain relying
on diffusion MRI. These fiber bundles act as highways for neural activity and
communication, snaking through the brain and connecting different regions.
Current statistical methods for analyzing these fibers reduce the rich
information into an adjacency matrix, with the elements containing a count of
the number of fibers or a mean diffusion feature (such as fractional
anisotropy) along the fibers. The goal of this article is to avoid discarding
the rich functional data on the shape, size and orientation of fibers,
developing flexible models for characterizing the population distribution of
fibers between brain regions of interest within and across different
individuals. We start by decomposing each fiber in each individual's brain into
a corresponding rotation matrix, shape and translation from a global reference
curve. These components can be viewed as data lying on a product space composed
of different Euclidean spaces and manifolds. To non-parametrically model the
distribution within and across individuals, we rely on a hierarchical mixture
of product kernels specific to the component spaces. Taking a Bayesian approach
to inference, we develop an efficient method for posterior sampling. The
approach automatically produces clusters of fibers within and across
individuals, and yields interesting new insights into variation in fiber
curves, while providing a useful starting point for more elaborate models
relating fibers to covariates and neuropsychiatric traits.
-
Our focus is on realistically modeling and forecasting dynamic networks of
face-to-face contacts among individuals. Important aspects of such data that
lead to problems with current methods include the tendency of the contacts to
move between periods of slow and rapid changes, and the dynamic heterogeneity
in the actors' connectivity behaviors. Motivated by this application, we
develop a novel method for Locally Adaptive DYnamic (LADY) network inference.
The proposed model relies on a dynamic latent space representation in which
each actor's position evolves in time via stochastic differential equations.
Using a state space representation for these stochastic processes and
P\'olya-gamma data augmentation, we develop an efficient MCMC algorithm for
posterior inference along with tractable procedures for online updating and
forecasting of future networks. We evaluate performance in simulation studies,
and consider an application to face-to-face contacts among individuals in a
primary school.
-
Network data are increasingly collected along with other variables of
interest. Our motivation is drawn from neurophysiology studies measuring brain
connectivity networks for a sample of individuals along with their membership
to a low or high creative reasoning group. It is of paramount importance to
develop statistical methods for testing of global and local changes in the
structural interconnections among brain regions across groups. We develop a
general Bayesian procedure for inference and testing of group differences in
the network structure, which relies on a nonparametric representation for the
conditional probability mass function associated with a network-valued random
variable. By leveraging a mixture of low-rank factorizations, we allow simple
global and local hypothesis testing adjusting for multiplicity. An efficient
Gibbs sampler is defined for posterior computation. We provide theoretical
results on the flexibility of the model and assess testing performance in
simulations. The approach is applied to provide novel insights on the
relationships between human brain networks and creativity.
-
In cargo logistics, a key performance measure is transport risk, defined as
the deviation of the actual arrival time from the planned arrival time. Neither
earliness nor tardiness is desirable for customer and freight forwarders. In
this paper, we investigate ways to assess and forecast transport risks using a
half-year of air cargo data, provided by a leading forwarder on 1336 routes
served by 20 airlines. Interestingly, our preliminary data analysis shows a
strong multimodal feature in the transport risks, driven by unobserved events,
such as cargo missing flights. To accommodate this feature, we introduce a
Bayesian nonparametric model -- the probit stick-breaking process (PSBP)
mixture model -- for flexible estimation of the conditional (i.e.,
state-dependent) density function of transport risk. We demonstrate that using
simpler methods, such as OLS linear regression, can lead to misleading
inferences. Our model provides a tool for the forwarder to offer customized
price and service quotes. It can also generate baseline airline performance to
enable fair supplier evaluation. Furthermore, the method allows us to separate
recurrent risks from disruption risks. This is important, because hedging
strategies for these two kinds of risks are often drastically different.
-
Replicated network data are increasingly available in many research fields.
In connectomic applications, inter-connections among brain regions are
collected for each patient under study, motivating statistical models which can
flexibly characterize the probabilistic generative mechanism underlying these
network-valued data. Available models for a single network are not designed
specifically for inference on the entire probability mass function of a
network-valued random variable and therefore lack flexibility in characterizing
the distribution of relevant topological structures. We propose a flexible
Bayesian nonparametric approach for modeling the population distribution of
network-valued data. The joint distribution of the edges is defined via a
mixture model which reduces dimensionality and efficiently incorporates network
information within each mixture component by leveraging latent space
representations. The formulation leads to an efficient Gibbs sampler and
provides simple and coherent strategies for inference and goodness-of-fit
assessments. We provide theoretical results on the flexibility of our model and
illustrate improved performance --- compared to state-of-the-art models --- in
simulations and application to human brain networks.
-
We propose a novel approach to Bayesian analysis that is provably robust to
outliers in the data and often has computational advantages over standard
methods. Our technique is based on splitting the data into non-overlapping
subgroups, evaluating the posterior distribution given each independent
subgroup, and then combining the resulting measures. The main novelty of our
approach is the proposed aggregation step, which is based on the evaluation of
a median in the space of probability measures equipped with a suitable
collection of distances that can be quickly and efficiently evaluated in
practice. We present both theoretical and numerical evidence illustrating the
improvements achieved by our method.
-
Two-component mixture priors provide a traditional way to induce sparsity in
high-dimensional Bayes models. However, several aspects of such a prior,
including computational complexities in high-dimensions, interpretation of
exact zeros and non-sparse posterior summaries under standard loss functions,
has motivated an amazing variety of continuous shrinkage priors, which can be
expressed as global-local scale mixtures of Gaussians. Interestingly, we
demonstrate that many commonly used shrinkage priors, including the Bayesian
Lasso, do not have adequate posterior concentration in high-dimensional
settings.