• We study full Bayesian procedures for sparse linear regression when errors have a symmetric but otherwise unknown distribution. The unknown error distribution is endowed with a symmetrized Dirichlet process mixture of Gaussians. For the prior on regression coefficients, a mixture of point masses at zero and continuous distributions is considered. We study behavior of the posterior with diverging number of predictors. Conditions are provided for consistency in the mean Hellinger distance. The compatibility and restricted eigenvalue conditions yield the minimax convergence rate of the regression coefficients in $\ell_1$- and $\ell_2$-norms, respectively. The convergence rate is adaptive to both the unknown sparsity level and the unknown symmetric error density under compatibility conditions. In addition, strong model selection consistency and a semi-parametric Bernstein-von Mises theorem are proven under slightly stronger conditions.
  • Data lying in a high dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional subspace or manifold. There is an immense literature focused on approximating the unknown subspace, and in exploiting such approximations in clustering, data compression, and building of predictive models. Most of the literature relies on approximating subspaces using a locally linear, and potentially multiscale, dictionary. In this article, we propose a simple and general alternative, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown subspace. Theory is developed showing that spherelets can produce lower covering numbers and MSEs for many manifolds. We develop spherical principal components analysis (SPCA). Results relative to state-of-the-art competitors show gains in ability to accurately approximate the subspace with fewer components. In addition, unlike most competitors, our approach can be used for data denoising and can efficiently embed new data without retraining. The methods are illustrated with standard toy manifold learning examples, and applications to multiple real data sets.
  • Divide-and-conquer based methods for Bayesian inference provide a general approach for tractable posterior inference when the sample size is large. These methods divide the data into smaller subsets, sample from the posterior distribution of parameters in parallel on all the subsets, and combine posterior samples from all the subsets to approximate the full data posterior distribution. The smaller size of any subset compared to the full data implies that posterior sampling on any subset is computationally more efficient than sampling from the true posterior distribution. Since the combination step takes negligible time relative to sampling, posterior computations can be scaled to massive data by dividing the full data into a sufficiently large number of data subsets. One such approach relies on the geometry of posterior distributions estimated across different subsets and combines them through their barycenter in a Wasserstein space of probability measures. We provide theoretical guarantees on the accuracy of approximation that are valid in many applications. We show that the geometric method approximates the full data posterior distribution better than its competitors across diverse simulations and reproduces known results when applied to a movie ratings database.
  • Bayesian sparse factor models have proven useful for characterizing dependence in multivariate data, but scaling computation to large numbers of samples and dimensions is problematic. We propose expandable factor analysis for scalable inference in factor models when the number of factors is unknown. The method relies on a continuous shrinkage prior for efficient maximum a posteriori estimation of a low-rank and sparse loadings matrix. The structure of the prior leads to an estimation algorithm that accommodates uncertainty in the number of factors. We propose an information criterion to select the hyperparameters of the prior. Expandable factor analysis has better false discovery rates and true positive rates than its competitors across diverse simulations. We apply the proposed approach to a gene expression study of aging in mice, illustrating superior results relative to four competing methods.
  • We propose a new approach for assigning weights to models using a divergence-based method ({\em D-probabilities}), relying on evaluating parametric models relative to a nonparametric Bayesian reference using Kullback-Leibler divergence. D-probabilities are useful in goodness-of-fit assessments, in comparing imperfect models, and in providing model weights to be used in model aggregation. D-probabilities avoid some of the disadvantages of Bayesian model probabilities, such as large sensitivity to prior choice, and tend to place higher weight on a greater diversity of models. In an application to linear model selection against a Gaussian process reference, we provide simple analytic forms for routine implementation and show that D-probabilities automatically penalize model complexity. Some asymptotic properties are described, and we provide interesting probabilistic interpretations of the proposed model weights. The framework is illustrated through simulation examples and an ozone data application.
  • Asymptotic theory of tail index estimation has been studied extensively in the frequentist literature on extreme values, but rarely in the Bayesian context. We investigate whether popular Bayesian kernel mixture models are able to support heavy tailed distributions and consistently estimate the tail index. We show that posterior inconsistency in tail index is surprisingly common for both parametric and nonparametric mixture models. We then present a set of sufficient conditions under which posterior consistency in tail index can be achieved, and verify these conditions for Pareto mixture models under general mixing priors.
  • This article is motivated by soccer positional passing networks collected across multiple games. We refer to these data as replicated spatial passing networks---to accurately model such data it is necessary to take into account the spatial positions of the passer and receiver for each passing event. This spatial registration and replicates that occur across games represent key differences with usual social network data. As a key step before investigating how the passing dynamics influence team performance, we focus on developing methods for summarizing different team's passing strategies. Our proposed approach relies on a novel multiresolution data representation framework and Poisson nonnegative block term decomposition model, which automatically produces coarse-to-fine low-rank network motifs. The proposed methods are applied to detailed passing record data collected from the 2014 FIFA World Cup.
  • Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.
  • Motivation: Although there is a rich literature on methods for assessing the impact of functional predictors, the focus has been on approaches for dimension reduction that can fail dramatically in certain applications. Examples of standard approaches include functional linear models, functional principal components regression, and cluster-based approaches, such as latent trajectory analysis. This article is motivated by applications in which the dynamics in a predictor, across times when the value is relatively extreme, are particularly informative about the response. For example, physicians are interested in relating the dynamics of blood pressure changes during surgery to post-surgery adverse outcomes, and it is thought that the dynamics are more important when blood pressure is significantly elevated or lowered. Methods: We propose a novel class of extrema-weighted feature (XWF) extraction models. Key components in defining XWFs include the marginal density of the predictor, a function up-weighting values at high quantiles of this marginal, and functionals characterizing local dynamics. Algorithms are proposed for fitting of XWF-based regression and classification models, and are compared with current methods for functional predictors in simulations and a blood pressure during surgery application. Results: XWFs find features of intraoperative blood pressure trajectories that are predictive of postoperative mortality. By their nature, most of these features cannot be found by previous methods.
  • High throughput screening of compounds (chemicals) is an essential part of drug discovery [7], involving thousands to millions of compounds, with the purpose of identifying candidate hits. Most statistical tools, including the industry standard B-score method, work on individual compound plates and do not exploit cross-plate correlation or statistical strength among plates. We present a new statistical framework for high throughput screening of compounds based on Bayesian nonparametric modeling. The proposed approach is able to identify candidate hits from multiple plates simultaneously, sharing statistical strength among plates and providing more robust estimates of compound activity. It can flexibly accommodate arbitrary distributions of compound activities and is applicable to any plate geometry. The algorithm provides a principled statistical approach for hit identification and false discovery rate control. Experiments demonstrate significant improvements in hit identification sensitivity and specificity over the B-score method, which is highly sensitive to threshold choice. The framework is implemented as an efficient R extension package BHTSpack and is suitable for large scale data sets.
  • There has been considerable interest in making Bayesian inference more scalable. In big data settings, most literature focuses on reducing the computing time per iteration, with less focused on reducing the number of iterations needed in Markov chain Monte Carlo (MCMC). This article focuses on data augmentation MCMC (DA-MCMC), a widely used technique. DA-MCMC samples tend to become highly autocorrelated in large data samples, due to a miscalibration problem in which conditional posterior distributions given augmented data are too concentrated. This makes it necessary to collect very long MCMC paths to obtain acceptably low MC error. To combat this inefficiency, we propose a family of calibrated data augmentation algorithms, which appropriately adjust the variance of conditional posterior distributions. A Metropolis-Hastings step is used to eliminate bias in the stationary distribution of the resulting sampler. Compared to existing alternatives, this approach can dramatically reduce MC error by reducing autocorrelation and increasing the effective number of DA-MCMC samples per computing time. The approach is simple and applicable to a broad variety of existing data augmentation algorithms, and we focus on three popular models: probit, logistic and Poisson log-linear. Dramatic gains in computational efficiency are shown in applications.
  • Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of MCMC approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for common data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we give strong theory results on computational complexity in an infinitely imbalanced asymptotic regime. Our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. In general, MCMC algorithms that have a similar discrepancy will fail in large samples - a result with substantial practical impact.
  • With the routine collection of massive-dimensional predictors in many application areas, screening methods that rapidly identify a small subset of promising predictors have become commonplace. We propose a new MOdular Bayes Screening (MOBS) approach, which involves several novel characteristics that can potentially lead to improved performance. MOBS first applies a Bayesian mixture model to the marginal distribution of the response, obtaining posterior samples of mixture weights, cluster-specific parameters, and cluster allocations for each subject. Hypothesis tests are then introduced, corresponding to whether or not to include a given predictor, with posterior probabilities for each hypothesis available analytically conditionally on unknowns sampled in the first stage and tuning parameters controlling borrowing of information across tests. By marginalizing over the first stage posterior samples, we avoid under-estimation of uncertainty typical of two-stage methods. We greatly simplify the model specification and reduce computational complexity by using {\em modularization}. We provide basic theoretical support for this approach, and illustrate excellent performance relative to competitors in simulation studies and the ability to capture complex shifts beyond simple differences in means. The method is illustrated with applications to genomics by using a very high-dimensional cis-eQTL dataset with roughly 38 million SNPs.
  • There has been substantial recent interest in record linkage, attempting to group the records pertaining to the same entities from a large database lacking unique identifiers. This can be viewed as a type of "microclustering," with few observations per cluster and a very large number of clusters. A variety of methods have been proposed, but there is a lack of literature providing theoretical guarantees on performance. We show that the problem is fundamentally hard from a theoretical perspective, and even in idealized cases, accurate entity resolution is effectively impossible when the number of entities is small relative to the number of records and/or the separation among records from different entities is not extremely large. To characterize the fundamental difficulty, we focus on entity resolution based on multivariate Gaussian mixture models, but our conclusions apply broadly and are supported by simulation studies inspired by human rights applications. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine scale entity resolution is inaccurate.
  • Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. We propose to instead consider a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). For efficient inference, we borrow ideas from gradient boosting to develop an algorithm we call boosting variational inference (BVI). BVI iteratively improves the current approximation by mixing it with a new component from the base distribution family and thereby yields progressively more accurate posterior approximations as more computing time is spent. Unlike a number of common VI variants including mean-field VI, BVI is able to capture multimodality, general posterior covariance, and nonstandard posterior shapes.
  • There is increasing interest in learning how human brain networks vary as a function of a continuous trait, but flexible and efficient procedures to accomplish this goal are limited. We develop a Bayesian semiparametric model, which combines low-rank factorizations and flexible Gaussian process priors to learn changes in the conditional expectation of a network-valued random variable across the values of a continuous predictor, while including subject-specific random effects. The formulation leads to a general framework for inference on changes in brain network structures across human traits, facilitating borrowing of information and coherently characterizing uncertainty. We provide an efficient Gibbs sampler for posterior computation along with simple procedures for inference, prediction and goodness-of-fit assessments. The model is applied to learn how human brain networks vary across individuals with different intelligence scores. Results provide interesting insights on the association between intelligence and brain connectivity, while demonstrating good predictive performance.
  • There is a lack of simple and scalable algorithms for uncertainty quantification. Bayesian methods quantify uncertainty through posterior and predictive distributions, but it is difficult to rapidly estimate summaries of these distributions, such as quantiles and intervals. Variational Bayes approximations are widely used, but may badly underestimate posterior covariance. Typically, the focus of Bayesian inference is on point and interval estimates for one-dimensional functionals of interest. In small scale problems, Markov chain Monte Carlo algorithms remain the gold standard, but such algorithms face major problems in scaling up to big data. Various modifications have been proposed based on parallelization and approximations based on subsamples, but such approaches are either highly complex or lack theoretical support and/or good performance outside of narrow settings. We propose a very simple and general posterior interval estimation algorithm, which is based on running Markov chain Monte Carlo in parallel for subsets of the data and averaging quantiles estimated from each subset. We provide strong theoretical guarantees and illustrate performance in several applications.
  • Studying the neurological, genetic and evolutionary basis of human vocal communication mechanisms using animal vocalization models is an important field of neuroscience. The data sets typically comprise structured sequences of syllables or `songs' produced by animals from different genotypes under different social contexts. We develop a novel Bayesian semiparametric framework for inference in such data sets. Our approach is built on a novel class of mixed effects Markov transition models for the songs that accommodates exogenous influences of genotype and context as well as animal-specific heterogeneity. We design efficient Markov chain Monte Carlo algorithms for posterior computation. Crucial advantages of the proposed approach include its ability to provide insights into key scientific queries related to global and local influences of the exogenous predictors on the transition dynamics via automated tests of hypotheses. The methodology is illustrated using simulation experiments and the aforementioned motivating application in neuroscience.
  • In studying structural inter-connections in the human brain, it is common to first estimate fiber bundles connecting different regions of the brain relying on diffusion MRI. These fiber bundles act as highways for neural activity and communication, snaking through the brain and connecting different regions. Current statistical methods for analyzing these fibers reduce the rich information into an adjacency matrix, with the elements containing a count of the number of fibers or a mean diffusion feature (such as fractional anisotropy) along the fibers. The goal of this article is to avoid discarding the rich functional data on the shape, size and orientation of fibers, developing flexible models for characterizing the population distribution of fibers between brain regions of interest within and across different individuals. We start by decomposing each fiber in each individual's brain into a corresponding rotation matrix, shape and translation from a global reference curve. These components can be viewed as data lying on a product space composed of different Euclidean spaces and manifolds. To non-parametrically model the distribution within and across individuals, we rely on a hierarchical mixture of product kernels specific to the component spaces. Taking a Bayesian approach to inference, we develop an efficient method for posterior sampling. The approach automatically produces clusters of fibers within and across individuals, and yields interesting new insights into variation in fiber curves, while providing a useful starting point for more elaborate models relating fibers to covariates and neuropsychiatric traits.
  • Our focus is on realistically modeling and forecasting dynamic networks of face-to-face contacts among individuals. Important aspects of such data that lead to problems with current methods include the tendency of the contacts to move between periods of slow and rapid changes, and the dynamic heterogeneity in the actors' connectivity behaviors. Motivated by this application, we develop a novel method for Locally Adaptive DYnamic (LADY) network inference. The proposed model relies on a dynamic latent space representation in which each actor's position evolves in time via stochastic differential equations. Using a state space representation for these stochastic processes and P\'olya-gamma data augmentation, we develop an efficient MCMC algorithm for posterior inference along with tractable procedures for online updating and forecasting of future networks. We evaluate performance in simulation studies, and consider an application to face-to-face contacts among individuals in a primary school.
  • Network data are increasingly collected along with other variables of interest. Our motivation is drawn from neurophysiology studies measuring brain connectivity networks for a sample of individuals along with their membership to a low or high creative reasoning group. It is of paramount importance to develop statistical methods for testing of global and local changes in the structural interconnections among brain regions across groups. We develop a general Bayesian procedure for inference and testing of group differences in the network structure, which relies on a nonparametric representation for the conditional probability mass function associated with a network-valued random variable. By leveraging a mixture of low-rank factorizations, we allow simple global and local hypothesis testing adjusting for multiplicity. An efficient Gibbs sampler is defined for posterior computation. We provide theoretical results on the flexibility of the model and assess testing performance in simulations. The approach is applied to provide novel insights on the relationships between human brain networks and creativity.
  • In cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for customer and freight forwarders. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.
  • Replicated network data are increasingly available in many research fields. In connectomic applications, inter-connections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for inference on the entire probability mass function of a network-valued random variable and therefore lack flexibility in characterizing the distribution of relevant topological structures. We propose a flexible Bayesian nonparametric approach for modeling the population distribution of network-valued data. The joint distribution of the edges is defined via a mixture model which reduces dimensionality and efficiently incorporates network information within each mixture component by leveraging latent space representations. The formulation leads to an efficient Gibbs sampler and provides simple and coherent strategies for inference and goodness-of-fit assessments. We provide theoretical results on the flexibility of our model and illustrate improved performance --- compared to state-of-the-art models --- in simulations and application to human brain networks.
  • We propose a novel approach to Bayesian analysis that is provably robust to outliers in the data and often has computational advantages over standard methods. Our technique is based on splitting the data into non-overlapping subgroups, evaluating the posterior distribution given each independent subgroup, and then combining the resulting measures. The main novelty of our approach is the proposed aggregation step, which is based on the evaluation of a median in the space of probability measures equipped with a suitable collection of distances that can be quickly and efficiently evaluated in practice. We present both theoretical and numerical evidence illustrating the improvements achieved by our method.
  • Two-component mixture priors provide a traditional way to induce sparsity in high-dimensional Bayes models. However, several aspects of such a prior, including computational complexities in high-dimensions, interpretation of exact zeros and non-sparse posterior summaries under standard loss functions, has motivated an amazing variety of continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians. Interestingly, we demonstrate that many commonly used shrinkage priors, including the Bayesian Lasso, do not have adequate posterior concentration in high-dimensional settings.