• We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm--and the p-value thresholds intrinsic to it--as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to "ban" p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly.
  • Replication is complicated in psychological research because studies of a given psychological phenomenon can never be direct or exact replications of one another, and thus effect sizes vary from one study of the phenomenon to the next--an issue of clear importance for replication. Current large scale replication projects represent an important step forward for assessing replicability, but provide only limited information because they have thus far been designed in a manner such that heterogeneity either cannot be assessed or is intended to be eliminated. Consequently, the non-trivial degree of heterogeneity found in these projects represents a lower bound on heterogeneity. We recommend enriching large scale replication projects going forward by em- bracing heterogeneity. We argue this is key for assessing replicability: if effect sizes are sufficiently heterogeneous--even if the sign of the effect is consistent--the phenomenon in question does not seem particularly replicable and the theory underlying it seems poorly constructed and in need of enrichment. Uncovering why and revising theory in light of it will lead to improved theory that explains heterogeneity and in- creases replicability. Given this, large scale replication projects can play an important role not only in assessing replicability but also in advancing theory.
  • Bayesian data analysis is about more than just computing a posterior distribution, and Bayesian visualization is about more than trace plots of Markov chains. Practical Bayesian data analysis, like all data analysis, is an iterative process of model building, inference, model checking and evaluation, and model expansion. Visualization is helpful in each of these stages of the Bayesian workflow and it is indispensable when drawing inferences from the types of modern, high-dimensional models that are used by applied researchers.
  • Verifying the correctness of Bayesian computation is challenging. This is especially true for complex models that are common in practice, as these require sophisticated model implementations and algorithms. In this paper we introduce \emph{simulation-based calibration} (SBC), a general procedure for validating inferences from Bayesian algorithms capable of generating posterior samples. This procedure not only identifies inaccurate computation and inconsistencies in model implementations but also provides graphical summaries that can indicate the nature of the problems that arise. We argue that SBC is a critical part of a robust Bayesian workflow, as well as being a useful tool for those developing computational algorithms and statistical software.
  • We analyzed 2012 and 2016 YouGov pre-election polls in order to understand how different population groups voted in the 2012 and 2016 elections. We broke the data down by demographics and state. We display our findings with a series of graphs and maps. The R code associated with this project is available at https://github.com/rtrangucci/mrp_2016_election/.
  • A common approach for Bayesian computation with big data is to partition the data into smaller pieces, perform local inference for each piece separately, and finally combine the results to obtain an approximation to the global posterior. Looking at this from the bottom up, one can perform separate analyses on individual sources of data and then combine these in a larger Bayesian model. In either case, the idea of distributed modeling and inference has both conceptual and computational appeal, but from the Bayesian perspective there is no general way of handling the prior distribution: if the prior is included in each separate inference, it will be multiply-counted when the inferences are combined; but if the prior is itself divided into pieces, it may not provide enough regularization for each separate computation, thus eliminating one of the key advantages of Bayesian methods. To resolve this dilemma, we propose expectation propagation (EP) as a general prototype for distributed Bayesian inference. The central idea is to factor the likelihood according to the data partitions, and to iteratively combine each factor with an approximate model of the prior and all other parts of the data, thus producing an overall approximation to the global posterior at convergence. In this paper, we give an introduction to EP and an overview of some recent developments of the method, with particular emphasis on its use in combining inferences from partitioned data. In addition to distributed modeling of large datasets, our unified treatment also includes hierarchical modeling of data with a naturally partitioned structure. The paper describes a general algorithmic framework, rather than a specific algorithm, and presents an example implementation for it.
  • While it's always possible to compute a variational approximation to a posterior distribution, it can be difficult to discover problems with this approximation". We propose two diagnostic algorithms to alleviate this problem. The Pareto-smoothed importance sampling (PSIS) diagnostic gives a goodness of fit measurement for joint distributions, while simultaneously improving the error in the estimate. The variational simulation-based calibration (VSBC) assesses the average performance of point estimates.
  • Cluster sampling is common in survey practice, and the corresponding inference has been predominantly design-based. We develop a Bayesian framework for cluster sampling and account for the design effect in the outcome modeling. We consider a two-stage cluster sampling design where the clusters are first selected with probability proportional to cluster size, and then units are randomly sampled inside selected clusters. Challenges arise when the sizes of nonsampled cluster are unknown. We propose nonparametric and parametric Bayesian approaches for predicting the unknown cluster sizes, with this inference performed simultaneously with the model for survey outcome. Simulation studies show that the integrated Bayesian approach outperforms classical methods with efficiency gains. We use Stan for computing and apply the proposal to the Fragile Families and Child Wellbeing study as an illustration of complex survey inference in health surveys.
  • The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.
  • A key sticking point of Bayesian analysis is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys' priors, reference priors, maximum entropy priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire Bayesian analysis, from inference to prediction to model evaluation.
  • We combine Bayesian prediction and weighted inference as a unified approach to survey inference. The general principles of Bayesian analysis imply that models for survey outcomes should be conditional on all variables that affect the probability of inclusion. We incorporate the weighting variables under the framework of multilevel regression and poststratification, as a byproduct generating model-based weights after smoothing. We investigate deep interactions and introduce structured prior distributions for smoothing and stability of estimates. The computation is done via Stan and implemented in the open source R package "rstanarm" ready for public use. Simulation studies illustrate that model-based prediction and weighting inference outperform classical weighting. We apply the proposal to the New York Longitudinal Study of Wellbeing. The new approach generates robust weights and increases efficiency for finite population inference, especially for subsets of the population.
  • We show that publishing results using the statistical significance filter---publishing only when the p-value is less than 0.05---leads to a vicious cycle of overoptimistic expectation of the replicability of results. First, we show analytically that when true statistical power is relatively low, computing power based on statistically significant results will lead to overestimates of power. Then, we present a case study using 10 experimental comparisons drawn from a recently published meta-analysis in psycholinguistics (J\"ager et al., 2017). We show that the statistically significant results yield an illusion of replicability. This illusion holds even if the researcher doesn't conduct any formal power analysis but just uses statistical significance to informally assess robustness (i.e., replicability) of results.
  • Stata users have access to two easy-to-use implementations of Bayesian inference: Stata's native {\tt bayesmh} function and StataStan, which calls the general Bayesian engine Stan. We compare these on two models that are important for education research: the Rasch model and the hierarchical Rasch model. Stan (as called from Stata) fits a more general range of models than can be fit by {\tt bayesmh} and is also more scalable, in that it could easily fit models with at least ten times more parameters than could be fit using Stata's native Bayesian implementation. In addition, Stan runs between two and ten times faster than {\tt bayesmh} as measured in effective sample size per second: that is, compared to Stan, it takes Stata's built-in Bayesian engine twice to ten times as long to get inferences with equivalent precision. We attribute Stan's advantage in flexibility to its general modeling language, and its advantages in scalability and speed to an efficient sampling algorithm: Hamiltonian Monte Carlo using the no-U-turn sampler. In order to further investigate scalability, we also compared to the package Jags, which performed better than Stata's native Bayesian engine but not as well as StataStan. Given its advantages in speed, generality, and scalability, and that Stan is open-source and can be run directly from Stata using StataStan, we recommend that Stata users adopt Stan as their Bayesian inference engine of choice.
  • Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be noisy when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates and convergence diagnostics.
  • Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparing of predictive errors between two models. We implement the computations in an R package called 'loo' and demonstrate using models fit with the Bayesian inference package Stan.
  • We present an increasingly stringent set of replications of Ghitza & Gelman (2013), a multilevel regression and poststratification analysis of polls from the 2008 U.S. presidential election campaign, focusing on a set of plots showing the estimated Republican vote share for whites and for all voters, as a function of income level in each of the states. We start with a nearly-exact duplication that uses the posted code and changes only the model-fitting algorithm; we then replicate using already-analyzed data from 2004; and finally we set up preregistered replications using two surveys from 2008 that we had not previously looked at. We have already learned from our preliminary, non-preregistered replication, which has revealed a potential problem with the published analysis of Ghitza & Gelman (2013); it appears that our model may not sufficiently account for nonsampling error, and that some of the patterns presented in that earlier paper may simply reflect noise. In addition to the substantive interest in validating earlier findings about demographics, geography, and voting, the present project serves as a demonstration of preregistration in a setting where the subject matter is historical (and thus the replication data exist before the preregistration plan is written) and where the analysis is exploratory (and thus a replication cannot be simply deemed successful or unsuccessful based on the statistical significance of some particular comparison).
  • Probabilistic modeling is iterative. A scientist posits a simple model, fits it to her data, refines it according to her analysis, and repeats. However, fitting complex models to large data is a bottleneck in this process. Deriving algorithms for new models can be both mathematically and computationally challenging, which makes it difficult to efficiently cycle through the steps. To this end, we develop automatic differentiation variational inference (ADVI). Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models-no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use.
  • We often wish to use external data to improve the precision of an inference, but concerns arise when the different datasets have been collected under different conditions so that we do not want to simply pool the information. This is the well-known problem of meta-analysis, for which Bayesian methods have long been used to achieve partial pooling. Here we consider the challenge when the external data are averages rather than raw data. We provide a Bayesian solution by using simulation to approximate the likelihood of the external summary, and by allowing the parameters in the model to vary under the different conditions. Inferences are constructed using importance sampling from an approximate distribution determined by an expectation propagation-like algorithm. We demonstrate with the problem that motivated this research, a hierarchical nonlinear model in pharmacometrics, implementing the computation in Stan.
  • In a recent article in PNAS, Case and Deaton show a figure illustrating "a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2013." The authors state that their numbers "are not age-adjusted within the 10-y 45-54 age group." They calculated the mortality rate each year by dividing the total number of deaths for the age group by the population of the age group. We suspected an aggregation bias. After adjusting for changes in age composition, we find there is no longer a steady increase in mortality rates for this age group. Instead there is an increasing trend from 1999-2005 and a constant trend thereafter. Moreover, stratifying age-adjusted mortality rates by sex shows a marked increase only for women and not men, contrary to the article's headline. We stress that this does not change a key finding of the Case and Deaton paper: the comparison of non-Hispanic U.S. middle-aged whites to other countries and other ethnic groups. These comparisons hold up after our age adjustment. While we do not believe that age-adjustment invalidates comparisons between countries, it does affect claims concerning the absolute increase in mortality among U.S. middle-aged white non-Hispanics. Breaking down the trends in this group by region of the country shows other interesting patterns: since 1999 there has been an increase in death rates among women in the south. In contrast, death rates for both sexes have been declining in the northeast, the region where mortality rates were lowest to begin with. These graphs demonstrate the value of this sort of data exploration, and we are grateful to Case and Deaton for focusing attention on these mortality trends.
  • Quantifying long-term historical climate is fundamental to understanding recent climate change. Most instrumentally recorded climate data are only available for the past 200 years, so proxy observations from natural archives are often considered. We describe a model-based approach to reconstructing climate defined in terms of raw tree-ring measurement data that simultaneously accounts for non-climatic and climatic variability. In this approach we specify a joint model for the tree-ring data and climate variable that we fit using Bayesian inference. We consider a range of prior densities and compare the modeling approach to current methodology using an example case of Scots pine from Tornetrask, Sweden to reconstruct growing season temperature. We describe how current approaches translate into particular model assumptions. We explore how changes to various components in the model-based approach affect the resulting reconstruction. We show that minor changes in model specification can have little effect on model fit but lead to large changes in the predictions. In particular, the periods of relatively warmer and cooler temperatures are robust between models, but the magnitude of the resulting temperatures are highly model dependent. Such sensitivity may not be apparent with traditional approaches because the underlying statistical model is often hidden or poorly described.
  • It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference in the presence of inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family and Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency, which works because we induce regularization for small cells and thus this is a way of automatically smoothing the highly variable weights.
  • We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The advantage of these reformulations is that the replacement terms do not oppose each other. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgment of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling, and socioeconomic stratification.
  • The Millennium Villages Project (MVP) is a ten-year integrated rural development project implemented in ten sub-Saharan African sites. At its conclusion we will conduct an evaluation of its causal effect on a variety of development outcomes, measured via household surveys in treatment and comparison areas. Outcomes are measured by six survey modules, with sample sizes for each demographic group determined by budget, logistics, and the group's vulnerability. We design a sampling plan that aims to reduce effort for survey enumerators and maximize precision for all outcomes. We propose two-stage sampling designs, sampling households at the first stage, followed by a second stage sample that differs across demographic groups. Two-stage designs are usually constructed by simple random sampling (SRS) of households and proportional within-household sampling, or probability proportional to size sampling (PPS) of households with fixed sampling within each. No measure of household size is proportional for all demographic groups, putting PPS schemes at a disadvantage. The SRS schemes have the disadvantage that multiple individuals sampled per household decreases efficiency due to intra-household correlation. We conduct a simulation study (using both design- and model-based survey inference) to understand these tradeoffs and recommend a sampling plan for the Millennium Villages Project. Similar design issues arise in other studies with surveys that target different demographic groups.
  • Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI). The user only provides a Bayesian model and a dataset; nothing else. We make no conjugacy assumptions and support a broad class of models. The algorithm automatically determines an appropriate variational family and optimizes the variational objective. We implement ADVI in Stan (code available now), a probabilistic programming framework. We compare ADVI to MCMC sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images. With ADVI we can use variational inference on any model we write in Stan.
  • The only acceptable form of polling in the multi-billion dollar survey research field utilizes representative samples. We argue that with proper statistical adjustment, non-representative polling can provide accurate predictions, and often in a much more timely and cost-effective fashion. We demonstrate this by applying multilevel regression and post-stratification (MRP) to a 2012 election survey on the Xbox gaming platform. Not only do the transformed top-line projections from this data closely trend standard indicators, but we use the unique nature of the data's size and panel to answer a meaningful political puzzle. We find that reported swings in public opinion polls are generally not due to actual shifts in vote intention, but rather are the result of temporary periods of relatively low response rates among supporters of the reportedly slumping candidate. This work shows great promise for using non-representative polling to measure public opinion and the first product of this new polling technique raises the possibility that decades of large, reported swings in public opinion-including the perennial "convention bounce"-are mostly artifacts of sampling bias.