• ### Bayes Shrinkage at GWAS scale: Convergence and Approximation Theory of a Scalable MCMC Algorithm for the Horseshoe Prior(1705.00841)

Oct. 15, 2018 stat.CO
The horseshoe prior is frequently employed in Bayesian analysis of high-dimensional models, and has been shown to achieve minimax optimal risk properties when the truth is sparse. While optimization-based algorithms for the extremely popular Lasso and elastic net procedures can scale to dimension in the hundreds of thousands, algorithms for the horseshoe that use Markov chain Monte Carlo (MCMC) for computation are limited to problems an order of magnitude smaller. This is due to high computational cost per step and growth of the variance of time-averaging estimators as a function of dimension. We propose two new MCMC algorithms for computation in these models that have improved performance compared to existing alternatives. One of the algorithms also approximates an expensive matrix product to give orders of magnitude speedup in high-dimensional applications. We prove that the exact algorithm is geometrically ergodic, and give guarantees for the accuracy of the approximate algorithm using perturbation theory. Versions of the approximation algorithm that gradually decrease the approximation error as the chain extends are shown to be exact. The scalability of the algorithm is illustrated in simulations with problem size as large as $N=5,000$ observations and $p=50,000$ predictors, and an application to a genome-wide association study with $N=2,267$ and $p=98,385$. The empirical results also show that the new algorithm yields estimates with lower mean squared error, intervals with better coverage, and elucidates features of the posterior that were often missed by previous algorithms in high dimensions, including bimodality of posterior marginals indicating uncertainty about which covariates belong in the model.
• ### Model-free inference on extreme dependence via waiting times(1512.07848)

May 21, 2018 math.PR, math.ST, stat.TH, stat.ME
A variety of methods have been proposed for inference about extreme dependence for multivariate or spatially-indexed stochastic processes and time series. Most of these proceed by first transforming data to some specific extreme value marginal distribution, often the unit Fr\'echet, then fitting a family of max-stable processes to the transformed data and exploring dependence within the framework of that model. The marginal transformation, model selection, and model fitting are all possible sources of misspecification in this approach. We propose an alternative model-free approach, based on the idea that substantial information on the strength of tail dependence and its temporal structure are encoded in the distribution of the waiting times between exceedances of high thresholds at different locations. We propose quantifying the strength of extremal dependence and assessing uncertainty by using statistics based on these waiting times. The method does not rely on any specific underlying model for the process, nor on asymptotic distribution theory. The method is illustrated by applications to climatological, financial, and electrophysiology data. To put the proposed approach within the context of the existing literature, we construct a class of spacetime-indexed stochastic processes whose waiting time distributions are available in closed form by endowing the support points in de Haan's spectral representation of max-stable processes with random birth times, velocities, and lifetimes, and applying Smith's model to these processes. We show that waiting times in this model are stochatically decreasing in mean speed, and the sample mean of the waiting times obeys a central limit theorem with a uniform convergence rate under mild conditions. This indicates that our procedure can be implemented in this setting using standard $t$ statistics and associated hypothesis tests.
• ### Exact Limits of Inference in Coalescent Models(1711.05724)

Jan. 16, 2018 math.ST, stat.TH
Recovery of population size history from molecular sequence data is an important problem in population genetics. Inference commonly relies on a coalescent model linking the population size history to genealogies. The high computational cost of estimating parameters from these models usually compels researchers to select a subset of the available data or to rely on non-sufficient summary statistics for statistical inference. We consider the problem of recovering the true population size history from two possible alternatives on the basis of coalescent time data. We give exact expressions for the probability of selecting the correct alternative in a variety of biologically interesting cases as a function of the separation between the alternative size histories, the number of individuals, loci, and the sampling times. The results are applied to human population history. This work has significant implications for optimal design when the inferential goal is to test scientific hypotheses about population size trajectories in coalescent models with and without recombination.
• ### A Decision Theoretic Approach to A/B Testing(1710.03410)

Oct. 10, 2017 math.ST, stat.TH, stat.ME
A/B testing is ubiquitous within the machine learning and data science operations of internet companies. Generically, the idea is to perform a statistical test of the hypothesis that a new feature is better than the existing platform---for example, it results in higher revenue. If the p value for the test is below some pre-defined threshold---often, 0.05---the new feature is implemented. The difficulty of choosing an appropriate threshold has been noted before, particularly because dependent tests are often done sequentially, leading some to propose control of the false discovery rate (FDR) rather than use of a single, universal threshold. However, it is still necessary to make an arbitrary choice of the level at which to control FDR. Here we suggest a decision-theoretic approach to determining whether to adopt a new feature, which enables automated selection of an appropriate threshold. Our method has the basic ingredients of any decision-theory problem: a loss function, action space, and a notion of optimality, for which we choose Bayes risk. However, the loss function and the action space differ from the typical choices made in the literature, which has focused on the theory of point estimation. We give some basic results for Bayes-optimal thresholding rules for the feature adoption decision, and give some examples using eBay data. The results suggest that the 0.05 p-value threshold may be too conservative in some settings, but that its widespread use may reflect an ad-hoc means of controlling multiplicity in the common case of repeatedly testing variants of an experiment when the threshold is not reached.
• ### Scaling up Data Augmentation MCMC via Calibration(1703.03123)

Sept. 26, 2017 stat.ME
There has been considerable interest in making Bayesian inference more scalable. In big data settings, most literature focuses on reducing the computing time per iteration, with less focused on reducing the number of iterations needed in Markov chain Monte Carlo (MCMC). This article focuses on data augmentation MCMC (DA-MCMC), a widely used technique. DA-MCMC samples tend to become highly autocorrelated in large data samples, due to a miscalibration problem in which conditional posterior distributions given augmented data are too concentrated. This makes it necessary to collect very long MCMC paths to obtain acceptably low MC error. To combat this inefficiency, we propose a family of calibrated data augmentation algorithms, which appropriately adjust the variance of conditional posterior distributions. A Metropolis-Hastings step is used to eliminate bias in the stationary distribution of the resulting sampler. Compared to existing alternatives, this approach can dramatically reduce MC error by reducing autocorrelation and increasing the effective number of DA-MCMC samples per computing time. The approach is simple and applicable to a broad variety of existing data augmentation algorithms, and we focus on three popular models: probit, logistic and Poisson log-linear. Dramatic gains in computational efficiency are shown in applications.
• ### Optimal approximating Markov chains for Bayesian inference(1508.03387)

Aug. 29, 2017 stat.CO, math.ST, stat.TH
The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It is common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to computational optimality in these approximating Markov Chains, or when such approximations are justified relative to obtaining shorter paths from the exact kernel. We give simple, sharp bounds for uniform approximations of uniformly mixing Markov chains. We then suggest a notion of optimality that incorporates computation time and approximation error, and use our bounds to make generalizations about properties of good approximations in the uniformly mixing setting. The relevance of these properties is demonstrated in applications to a minibatching-based approximate MCMC algorithm for large $n$ logistic regression and low-rank approximations for Gaussian processes.
• ### MCMC for Imbalanced Categorical Data(1605.05798)

June 26, 2017 stat.CO, math.ST, stat.TH, cs.CC
Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of MCMC approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for common data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we give strong theory results on computational complexity in an infinitely imbalanced asymptotic regime. Our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. In general, MCMC algorithms that have a similar discrepancy will fail in large samples - a result with substantial practical impact.
• ### Coupling and Decoupling to bound an approximating Markov Chain(1706.02040)

June 7, 2017 math.PR
This simple note lays out a few observations which are well known in many ways but may not have been said in quite this way before. The basic idea is that when comparing two different Markov chains it is useful to couple them is such a way that they agree as often as possible. We construct such a coupling and analyze it by a simple dominating chain which registers if the two processes agree or disagree. We find that this imagery is useful when thinking about such problems. We are particularly interested in comparing the invariant measures and long time averages of the processes. However, since the paths agree for long runs, it also provides estimates on various stopping times such as hitting or exit times. We also show that certain bounds are tight. Finally, we provide a simple application to a Markov Chain Monte Carlo algorithm and show numerically that the results of the paper show a good level of approximation at considerable speed up by using an approximating chain rather than the original sampling chain.
• ### Theoretical Limits of Record Linkage and Microclustering(1703.04955)

March 15, 2017 math.ST, stat.TH
There has been substantial recent interest in record linkage, attempting to group the records pertaining to the same entities from a large database lacking unique identifiers. This can be viewed as a type of "microclustering," with few observations per cluster and a very large number of clusters. A variety of methods have been proposed, but there is a lack of literature providing theoretical guarantees on performance. We show that the problem is fundamentally hard from a theoretical perspective, and even in idealized cases, accurate entity resolution is effectively impossible when the number of entities is small relative to the number of records and/or the separation among records from different entities is not extremely large. To characterize the fundamental difficulty, we focus on entity resolution based on multivariate Gaussian mixture models, but our conclusions apply broadly and are supported by simulation studies inspired by human rights applications. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine scale entity resolution is inaccurate.
• ### An algorithm for removing sensitive information: application to race-independent recidivism prediction(1703.04957)

March 15, 2017 stat.AP
Predictive modeling is increasingly being employed to assist human decision-makers. One purported advantage of replacing or augmenting human judgment with computer models in high stakes settings-- such as sentencing, hiring, policing, college admissions, and parole decisions-- is the perceived "neutrality" of computers. It is argued that because computer models do not hold personal prejudice, the predictions they produce will be equally free from prejudice. There is growing recognition that employing algorithms does not remove the potential for bias, and can even amplify it if the training data were generated by a process that is itself biased. In this paper, we provide a probabilistic notion of algorithmic bias. We propose a method to eliminate bias from predictive models by removing all information regarding protected variables from the data to which the models will ultimately be trained. Unlike previous work in this area, our framework is general enough to accommodate data on any measurement scale. Motivated by models currently in use in the criminal justice system that inform decisions on pre-trial release and parole, we apply our proposed method to a dataset on the criminal histories of individuals at the time of sentencing to produce "race-neutral" predictions of re-arrest. In the process, we demonstrate that a common approach to creating "race-neutral" models-- omitting race as a covariate-- still results in racially disparate predictions. We then demonstrate that the application of our proposed method to these data removes racial disparities from predictions with minimal impact on predictive accuracy.
• ### Estimating the observable population size from biased samples: a new approach to population estimation with capture heterogeneity(1606.02235)

June 7, 2016 stat.ME
Capture-recapture methods aim to estimate the size of a closed population on the basis of multiple incomplete enumerations of individuals. In many applications, the individual probability of being recorded is heterogeneous in the population. Previous studies have suggested that it is not possible to reliably estimate the total population size when capture heterogeneity exists. Here we approach population estimation in the presence of capture heterogeneity as a latent length biased nonparametric density estimation problem on the unit interval. We show that in this setting it is generally impossible to estimate the density on the entire unit interval in finite samples, and that estimators of the population size have high and sometimes unbounded risk when the density has significant mass near zero. As an alternative, we propose estimating the population of individuals with capture probability exceeding some threshold. We provide methods for selecting an appropriate threshold, and show that this approach results in estimators with substantially lower risk than estimators of the total population size, with correspondingly smaller uncertainty, even when the parameter of interest is the total population. The alternative paradigm is demonstrated in extensive simulation studies and an application to snowshoe hare multiple recapture data.
• ### Optimal Gaussian approximations to the posterior for log-linear models with Diaconis-Ylvisaker priors(1511.00764)

Nov. 3, 2015 stat.ME
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis-Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. Here we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis-Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
• ### Tensor decompositions and sparse log-linear models(1404.0396)

April 1, 2014 math.ST, stat.TH
Contingency table analysis routinely relies on log linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a low rank tensor factorization of the probability mass function for multivariate categorical data, while log linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. We derive several results relating the support of a log-linear model to the nonnegative rank of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate advantages of the new decompositions in simulations and an application to functional disability data.