
We discuss problems the null hypothesis significance testing (NHST) paradigm
poses for replication and more broadly in the biomedical and social sciences as
well as how these problems remain unresolved by proposals involving modified
pvalue thresholds, confidence intervals, and Bayes factors. We then discuss
our own proposal, which is to abandon statistical significance. We recommend
dropping the NHST paradigmand the pvalue thresholds intrinsic to itas the
default statistical paradigm for research, publication, and discovery in the
biomedical and social sciences. Specifically, we propose that the pvalue be
demoted from its threshold screening role and instead, treated continuously, be
considered along with currently subordinate factors (e.g., related prior
evidence, plausibility of mechanism, study design and data quality, real world
costs and benefits, novelty of finding, and other factors that vary by research
domain) as just one among many pieces of evidence. We have no desire to "ban"
pvalues or other purely statistical measures. Rather, we believe that such
measures should not be thresholded and that, thresholded or not, they should
not take priority over the currently subordinate factors. We also argue that it
seldom makes sense to calibrate evidence as a function of pvalues or other
purely statistical measures. We offer recommendations for how our proposal can
be implemented in the scientific publication process as well as in statistical
decision making more broadly.

Replication is complicated in psychological research because studies of a
given psychological phenomenon can never be direct or exact replications of one
another, and thus effect sizes vary from one study of the phenomenon to the
nextan issue of clear importance for replication. Current large scale
replication projects represent an important step forward for assessing
replicability, but provide only limited information because they have thus far
been designed in a manner such that heterogeneity either cannot be assessed or
is intended to be eliminated. Consequently, the nontrivial degree of
heterogeneity found in these projects represents a lower bound on
heterogeneity. We recommend enriching large scale replication projects going
forward by em bracing heterogeneity. We argue this is key for assessing
replicability: if effect sizes are sufficiently heterogeneouseven if the sign
of the effect is consistentthe phenomenon in question does not seem
particularly replicable and the theory underlying it seems poorly constructed
and in need of enrichment. Uncovering why and revising theory in light of it
will lead to improved theory that explains heterogeneity and in creases
replicability. Given this, large scale replication projects can play an
important role not only in assessing replicability but also in advancing
theory.

Bayesian data analysis is about more than just computing a posterior
distribution, and Bayesian visualization is about more than trace plots of
Markov chains. Practical Bayesian data analysis, like all data analysis, is an
iterative process of model building, inference, model checking and evaluation,
and model expansion. Visualization is helpful in each of these stages of the
Bayesian workflow and it is indispensable when drawing inferences from the
types of modern, highdimensional models that are used by applied researchers.

Verifying the correctness of Bayesian computation is challenging. This is
especially true for complex models that are common in practice, as these
require sophisticated model implementations and algorithms. In this paper we
introduce \emph{simulationbased calibration} (SBC), a general procedure for
validating inferences from Bayesian algorithms capable of generating posterior
samples. This procedure not only identifies inaccurate computation and
inconsistencies in model implementations but also provides graphical summaries
that can indicate the nature of the problems that arise. We argue that SBC is a
critical part of a robust Bayesian workflow, as well as being a useful tool for
those developing computational algorithms and statistical software.

We analyzed 2012 and 2016 YouGov preelection polls in order to understand
how different population groups voted in the 2012 and 2016 elections. We broke
the data down by demographics and state. We display our findings with a series
of graphs and maps. The R code associated with this project is available at
https://github.com/rtrangucci/mrp_2016_election/.

A common approach for Bayesian computation with big data is to partition the
data into smaller pieces, perform local inference for each piece separately,
and finally combine the results to obtain an approximation to the global
posterior. Looking at this from the bottom up, one can perform separate
analyses on individual sources of data and then combine these in a larger
Bayesian model. In either case, the idea of distributed modeling and inference
has both conceptual and computational appeal, but from the Bayesian perspective
there is no general way of handling the prior distribution: if the prior is
included in each separate inference, it will be multiplycounted when the
inferences are combined; but if the prior is itself divided into pieces, it may
not provide enough regularization for each separate computation, thus
eliminating one of the key advantages of Bayesian methods. To resolve this
dilemma, we propose expectation propagation (EP) as a general prototype for
distributed Bayesian inference. The central idea is to factor the likelihood
according to the data partitions, and to iteratively combine each factor with
an approximate model of the prior and all other parts of the data, thus
producing an overall approximation to the global posterior at convergence. In
this paper, we give an introduction to EP and an overview of some recent
developments of the method, with particular emphasis on its use in combining
inferences from partitioned data. In addition to distributed modeling of large
datasets, our unified treatment also includes hierarchical modeling of data
with a naturally partitioned structure. The paper describes a general
algorithmic framework, rather than a specific algorithm, and presents an
example implementation for it.

While it's always possible to compute a variational approximation to a
posterior distribution, it can be difficult to discover problems with this
approximation". We propose two diagnostic algorithms to alleviate this problem.
The Paretosmoothed importance sampling (PSIS) diagnostic gives a goodness of
fit measurement for joint distributions, while simultaneously improving the
error in the estimate. The variational simulationbased calibration (VSBC)
assesses the average performance of point estimates.

Cluster sampling is common in survey practice, and the corresponding
inference has been predominantly designbased. We develop a Bayesian framework
for cluster sampling and account for the design effect in the outcome modeling.
We consider a twostage cluster sampling design where the clusters are first
selected with probability proportional to cluster size, and then units are
randomly sampled inside selected clusters. Challenges arise when the sizes of
nonsampled cluster are unknown. We propose nonparametric and parametric
Bayesian approaches for predicting the unknown cluster sizes, with this
inference performed simultaneously with the model for survey outcome.
Simulation studies show that the integrated Bayesian approach outperforms
classical methods with efficiency gains. We use Stan for computing and apply
the proposal to the Fragile Families and Child Wellbeing study as an
illustration of complex survey inference in health surveys.

The widely recommended procedure of Bayesian model averaging is flawed in the
Mopen setting in which the true datagenerating process is not one of the
candidate models being fit. We take the idea of stacking from the point
estimation literature and generalize to the combination of predictive
distributions, extending the utility function to any proper scoring rule, using
Pareto smoothed importance sampling to efficiently compute the required
leaveoneout posterior distributions and regularization to get more stability.
We compare stacking of predictive distributions to several alternatives:
stacking of means, Bayesian model averaging (BMA), pseudoBMA using AICtype
weighting, and a variant of pseudoBMA that is stabilized using the Bayesian
bootstrap. Based on simulations and realdata applications, we recommend
stacking of predictive distributions, with BBpseudoBMA as an approximate
alternative when computation cost is an issue.

A key sticking point of Bayesian analysis is the choice of prior
distribution, and there is a vast literature on potential defaults including
uniform priors, Jeffreys' priors, reference priors, maximum entropy priors, and
weakly informative priors. These methods, however, often manifest a key
conceptual tension in prior modeling: a model encoding true prior information
should be chosen without reference to the model of the measurement process, but
almost all common prior modeling techniques are implicitly motivated by a
reference likelihood. In this paper we resolve this apparent paradox by placing
the choice of prior into the context of the entire Bayesian analysis, from
inference to prediction to model evaluation.

We combine Bayesian prediction and weighted inference as a unified approach
to survey inference. The general principles of Bayesian analysis imply that
models for survey outcomes should be conditional on all variables that affect
the probability of inclusion. We incorporate the weighting variables under the
framework of multilevel regression and poststratification, as a byproduct
generating modelbased weights after smoothing. We investigate deep
interactions and introduce structured prior distributions for smoothing and
stability of estimates. The computation is done via Stan and implemented in the
open source R package "rstanarm" ready for public use. Simulation studies
illustrate that modelbased prediction and weighting inference outperform
classical weighting. We apply the proposal to the New York Longitudinal Study
of Wellbeing. The new approach generates robust weights and increases
efficiency for finite population inference, especially for subsets of the
population.

We show that publishing results using the statistical significance
filterpublishing only when the pvalue is less than 0.05leads to a
vicious cycle of overoptimistic expectation of the replicability of results.
First, we show analytically that when true statistical power is relatively low,
computing power based on statistically significant results will lead to
overestimates of power. Then, we present a case study using 10 experimental
comparisons drawn from a recently published metaanalysis in psycholinguistics
(J\"ager et al., 2017). We show that the statistically significant results
yield an illusion of replicability. This illusion holds even if the researcher
doesn't conduct any formal power analysis but just uses statistical
significance to informally assess robustness (i.e., replicability) of results.

Stata users have access to two easytouse implementations of Bayesian
inference: Stata's native {\tt bayesmh} function and StataStan, which calls the
general Bayesian engine Stan. We compare these on two models that are important
for education research: the Rasch model and the hierarchical Rasch model. Stan
(as called from Stata) fits a more general range of models than can be fit by
{\tt bayesmh} and is also more scalable, in that it could easily fit models
with at least ten times more parameters than could be fit using Stata's native
Bayesian implementation. In addition, Stan runs between two and ten times
faster than {\tt bayesmh} as measured in effective sample size per second: that
is, compared to Stan, it takes Stata's builtin Bayesian engine twice to ten
times as long to get inferences with equivalent precision. We attribute Stan's
advantage in flexibility to its general modeling language, and its advantages
in scalability and speed to an efficient sampling algorithm: Hamiltonian Monte
Carlo using the noUturn sampler. In order to further investigate scalability,
we also compared to the package Jags, which performed better than Stata's
native Bayesian engine but not as well as StataStan.
Given its advantages in speed, generality, and scalability, and that Stan is
opensource and can be run directly from Stata using StataStan, we recommend
that Stata users adopt Stan as their Bayesian inference engine of choice.

Importance weighting is a general way to adjust Monte Carlo integration to
account for draws from the wrong distribution, but the resulting estimate can
be noisy when the importance ratios have a heavy right tail. This routinely
occurs when there are aspects of the target distribution that are not well
captured by the approximating distribution, in which case more stable estimates
can be obtained by modifying extreme importance ratios. We present a new method
for stabilizing importance weights using a generalized Pareto distribution fit
to the upper tail of the distribution of the simulated importance ratios. The
method, which empirically performs better than existing methods for stabilizing
importance sampling estimates, includes stabilized effective sample size
estimates, Monte Carlo error estimates and convergence diagnostics.

Leaveoneout crossvalidation (LOO) and the widely applicable information
criterion (WAIC) are methods for estimating pointwise outofsample prediction
accuracy from a fitted Bayesian model using the loglikelihood evaluated at the
posterior simulations of the parameter values. LOO and WAIC have various
advantages over simpler estimates of predictive error such as AIC and DIC but
are less used in practice because they involve additional computational steps.
Here we lay out fast and stable computations for LOO and WAIC that can be
performed using existing simulation draws. We introduce an efficient
computation of LOO using Paretosmoothed importance sampling (PSIS), a new
procedure for regularizing importance weights. Although WAIC is asymptotically
equal to LOO, we demonstrate that PSISLOO is more robust in the finite case
with weak priors or influential observations. As a byproduct of our
calculations, we also obtain approximate standard errors for estimated
predictive errors and for comparing of predictive errors between two models. We
implement the computations in an R package called 'loo' and demonstrate using
models fit with the Bayesian inference package Stan.

We present an increasingly stringent set of replications of Ghitza & Gelman
(2013), a multilevel regression and poststratification analysis of polls from
the 2008 U.S. presidential election campaign, focusing on a set of plots
showing the estimated Republican vote share for whites and for all voters, as a
function of income level in each of the states.
We start with a nearlyexact duplication that uses the posted code and
changes only the modelfitting algorithm; we then replicate using
alreadyanalyzed data from 2004; and finally we set up preregistered
replications using two surveys from 2008 that we had not previously looked at.
We have already learned from our preliminary, nonpreregistered replication,
which has revealed a potential problem with the published analysis of Ghitza &
Gelman (2013); it appears that our model may not sufficiently account for
nonsampling error, and that some of the patterns presented in that earlier
paper may simply reflect noise.
In addition to the substantive interest in validating earlier findings about
demographics, geography, and voting, the present project serves as a
demonstration of preregistration in a setting where the subject matter is
historical (and thus the replication data exist before the preregistration plan
is written) and where the analysis is exploratory (and thus a replication
cannot be simply deemed successful or unsuccessful based on the statistical
significance of some particular comparison).

Probabilistic modeling is iterative. A scientist posits a simple model, fits
it to her data, refines it according to her analysis, and repeats. However,
fitting complex models to large data is a bottleneck in this process. Deriving
algorithms for new models can be both mathematically and computationally
challenging, which makes it difficult to efficiently cycle through the steps.
To this end, we develop automatic differentiation variational inference (ADVI).
Using our method, the scientist only provides a probabilistic model and a
dataset, nothing else. ADVI automatically derives an efficient variational
inference algorithm, freeing the scientist to refine and explore many models.
ADVI supports a broad class of modelsno conjugacy assumptions are required. We
study ADVI across ten different models and apply it to a dataset with millions
of observations. ADVI is integrated into Stan, a probabilistic programming
system; it is available for immediate use.

We often wish to use external data to improve the precision of an inference,
but concerns arise when the different datasets have been collected under
different conditions so that we do not want to simply pool the information.
This is the wellknown problem of metaanalysis, for which Bayesian methods
have long been used to achieve partial pooling. Here we consider the challenge
when the external data are averages rather than raw data. We provide a Bayesian
solution by using simulation to approximate the likelihood of the external
summary, and by allowing the parameters in the model to vary under the
different conditions. Inferences are constructed using importance sampling from
an approximate distribution determined by an expectation propagationlike
algorithm. We demonstrate with the problem that motivated this research, a
hierarchical nonlinear model in pharmacometrics, implementing the computation
in Stan.

In a recent article in PNAS, Case and Deaton show a figure illustrating "a
marked increase in the allcause mortality of middleaged white nonHispanic
men and women in the United States between 1999 and 2013." The authors state
that their numbers "are not ageadjusted within the 10y 4554 age group." They
calculated the mortality rate each year by dividing the total number of deaths
for the age group by the population of the age group.
We suspected an aggregation bias. After adjusting for changes in age
composition, we find there is no longer a steady increase in mortality rates
for this age group. Instead there is an increasing trend from 19992005 and a
constant trend thereafter. Moreover, stratifying ageadjusted mortality rates
by sex shows a marked increase only for women and not men, contrary to the
article's headline.
We stress that this does not change a key finding of the Case and Deaton
paper: the comparison of nonHispanic U.S. middleaged whites to other
countries and other ethnic groups. These comparisons hold up after our age
adjustment. While we do not believe that ageadjustment invalidates comparisons
between countries, it does affect claims concerning the absolute increase in
mortality among U.S. middleaged white nonHispanics. Breaking down the trends
in this group by region of the country shows other interesting patterns: since
1999 there has been an increase in death rates among women in the south. In
contrast, death rates for both sexes have been declining in the northeast, the
region where mortality rates were lowest to begin with. These graphs
demonstrate the value of this sort of data exploration, and we are grateful to
Case and Deaton for focusing attention on these mortality trends.

Quantifying longterm historical climate is fundamental to understanding
recent climate change. Most instrumentally recorded climate data are only
available for the past 200 years, so proxy observations from natural archives
are often considered. We describe a modelbased approach to reconstructing
climate defined in terms of raw treering measurement data that simultaneously
accounts for nonclimatic and climatic variability. In this approach we specify
a joint model for the treering data and climate variable that we fit using
Bayesian inference. We consider a range of prior densities and compare the
modeling approach to current methodology using an example case of Scots pine
from Tornetrask, Sweden to reconstruct growing season temperature. We describe
how current approaches translate into particular model assumptions. We explore
how changes to various components in the modelbased approach affect the
resulting reconstruction. We show that minor changes in model specification can
have little effect on model fit but lead to large changes in the predictions.
In particular, the periods of relatively warmer and cooler temperatures are
robust between models, but the magnitude of the resulting temperatures are
highly model dependent. Such sensitivity may not be apparent with traditional
approaches because the underlying statistical model is often hidden or poorly
described.

It has historically been a challenge to perform Bayesian inference in a
designbased survey context. The present paper develops a Bayesian model for
sampling inference in the presence of inverseprobability weights. We use a
hierarchical approach in which we model the distribution of the weights of the
nonsampled units in the population and simultaneously include them as
predictors in a nonparametric Gaussian process regression. We use simulation
studies to evaluate the performance of our procedure and compare it to the
classical designbased estimator. We apply our method to the Fragile Family and
Child Wellbeing Study. Our studies find the Bayesian nonparametric finite
population estimator to be more robust than the classical designbased
estimator without loss in efficiency, which works because we induce
regularization for small cells and thus this is a way of automatically
smoothing the highly variable weights.

We argue that the words "objectivity" and "subjectivity" in statistics
discourse are used in a mostly unhelpful way, and we propose to replace each of
them with broader collections of attributes, with objectivity replaced by
transparency, consensus, impartiality, and correspondence to observable
reality, and subjectivity replaced by awareness of multiple perspectives and
context dependence. The advantage of these reformulations is that the
replacement terms do not oppose each other. Instead of debating over whether a
given statistical method is subjective or objective (or normatively debating
the relative merits of subjectivity and objectivity in statistical practice),
we can recognize desirable attributes such as transparency and acknowledgment
of multiple perspectives as complementary goals. We demonstrate the
implications of our proposal with recent applied examples from pharmacology,
election polling, and socioeconomic stratification.

The Millennium Villages Project (MVP) is a tenyear integrated rural
development project implemented in ten subSaharan African sites. At its
conclusion we will conduct an evaluation of its causal effect on a variety of
development outcomes, measured via household surveys in treatment and
comparison areas. Outcomes are measured by six survey modules, with sample
sizes for each demographic group determined by budget, logistics, and the
group's vulnerability. We design a sampling plan that aims to reduce effort for
survey enumerators and maximize precision for all outcomes. We propose
twostage sampling designs, sampling households at the first stage, followed by
a second stage sample that differs across demographic groups. Twostage designs
are usually constructed by simple random sampling (SRS) of households and
proportional withinhousehold sampling, or probability proportional to size
sampling (PPS) of households with fixed sampling within each. No measure of
household size is proportional for all demographic groups, putting PPS schemes
at a disadvantage. The SRS schemes have the disadvantage that multiple
individuals sampled per household decreases efficiency due to intrahousehold
correlation. We conduct a simulation study (using both design and modelbased
survey inference) to understand these tradeoffs and recommend a sampling plan
for the Millennium Villages Project. Similar design issues arise in other
studies with surveys that target different demographic groups.

Variational inference is a scalable technique for approximate Bayesian
inference. Deriving variational inference algorithms requires tedious
modelspecific calculations; this makes it difficult to automate. We propose an
automatic variational inference algorithm, automatic differentiation
variational inference (ADVI). The user only provides a Bayesian model and a
dataset; nothing else. We make no conjugacy assumptions and support a broad
class of models. The algorithm automatically determines an appropriate
variational family and optimizes the variational objective. We implement ADVI
in Stan (code available now), a probabilistic programming framework. We compare
ADVI to MCMC sampling across hierarchical generalized linear models,
nonconjugate matrix factorization, and a mixture model. We train the mixture
model on a quarter million images. With ADVI we can use variational inference
on any model we write in Stan.

The only acceptable form of polling in the multibillion dollar survey
research field utilizes representative samples. We argue that with proper
statistical adjustment, nonrepresentative polling can provide accurate
predictions, and often in a much more timely and costeffective fashion. We
demonstrate this by applying multilevel regression and poststratification
(MRP) to a 2012 election survey on the Xbox gaming platform. Not only do the
transformed topline projections from this data closely trend standard
indicators, but we use the unique nature of the data's size and panel to answer
a meaningful political puzzle. We find that reported swings in public opinion
polls are generally not due to actual shifts in vote intention, but rather are
the result of temporary periods of relatively low response rates among
supporters of the reportedly slumping candidate. This work shows great promise
for using nonrepresentative polling to measure public opinion and the first
product of this new polling technique raises the possibility that decades of
large, reported swings in public opinionincluding the perennial "convention
bounce"are mostly artifacts of sampling bias.