
Stochastic neural net weights are used in a variety of contexts, including
regularization, Bayesian neural nets, exploration in reinforcement learning,
and evolution strategies. Unfortunately, due to the large number of weights,
all the examples in a minibatch typically share the same weight perturbation,
thereby limiting the variance reduction effect of large minibatches. We
introduce flipout, an efficient method for decorrelating the gradients within a
minibatch by implicitly sampling pseudoindependent weight perturbations for
each example. Empirically, flipout achieves the ideal linear variance reduction
for fully connected networks, convolutional networks, and RNNs. We find
significant speedups in training neural networks with multiplicative Gaussian
perturbations. We show that flipout is effective at regularizing LSTMs, and
outperforms previous methods. Flipout also enables us to vectorize evolution
strategies: in our experiments, a single GPU with flipout can handle the same
throughput as at least 40 CPU cores using existing methods, equivalent to a
factorof4 cost reduction on Amazon Web Services.

Variational inference is an umbrella term for algorithms which cast Bayesian
inference as optimization. Classically, variational inference uses the
KullbackLeibler divergence to define the optimization. Though this divergence
has been widely used, the resultant posterior approximation can suffer from
undesirable statistical properties. To address this, we reexamine variational
inference from its roots as an optimization problem. We use operators, or
functions of functions, to design variational objectives. As one example, we
design a variational objective with a LangevinStein operator. We develop a
black box algorithm, operator variational inference (OPVI), for optimizing any
operator objective. Importantly, operators enable us to make explicit the
statistical and computational tradeoffs for variational inference. We can
characterize different properties of variational objectives, such as objectives
that admit data subsamplingallowing inference to scale to massive dataas
well as objectives that admit variational programsa rich class of posterior
approximations that does not require a tractable density. We illustrate the
benefits of OPVI on a mixture model and a generative model of images.

A common approach for Bayesian computation with big data is to partition the
data into smaller pieces, perform local inference for each piece separately,
and finally combine the results to obtain an approximation to the global
posterior. Looking at this from the bottom up, one can perform separate
analyses on individual sources of data and then combine these in a larger
Bayesian model. In either case, the idea of distributed modeling and inference
has both conceptual and computational appeal, but from the Bayesian perspective
there is no general way of handling the prior distribution: if the prior is
included in each separate inference, it will be multiplycounted when the
inferences are combined; but if the prior is itself divided into pieces, it may
not provide enough regularization for each separate computation, thus
eliminating one of the key advantages of Bayesian methods. To resolve this
dilemma, we propose expectation propagation (EP) as a general prototype for
distributed Bayesian inference. The central idea is to factor the likelihood
according to the data partitions, and to iteratively combine each factor with
an approximate model of the prior and all other parts of the data, thus
producing an overall approximation to the global posterior at convergence. In
this paper, we give an introduction to EP and an overview of some recent
developments of the method, with particular emphasis on its use in combining
inferences from partitioned data. In addition to distributed modeling of large
datasets, our unified treatment also includes hierarchical modeling of data
with a naturally partitioned structure. The paper describes a general
algorithmic framework, rather than a specific algorithm, and presents an
example implementation for it.

Implicit probabilistic models are a flexible class for modeling data. They
define a process to simulate observations, and unlike traditional models, they
do not require a tractable likelihood function. In this paper, we develop two
families of models: hierarchical implicit models and deep implicit models. They
combine the idea of implicit densities with hierarchical Bayesian modeling and
deep neural networks. The use of implicit models with Bayesian analysis has
been limited by our ability to perform accurate and scalable inference. We
develop likelihoodfree variational inference (LFVI). Key to LFVI is specifying
a variational family that is also implicit. This matches the model's
flexibility and allows for accurate approximation of the posterior. Our work
scales up implicit models to sizes previously not possible and advances their
modeling design. We demonstrate diverse applications: a largescale physical
simulator for predatorprey populations in ecology; a Bayesian generative
adversarial network for discrete data; and a deep implicit model for text
generation.

We propose Edward, a Turingcomplete probabilistic programming language.
Edward defines two compositional representationsrandom variables and
inference. By treating inference as a first class citizen, on a par with
modeling, we show that probabilistic programming can be as flexible and
computationally efficient as traditional deep learning. For flexibility, Edward
makes it easy to fit the same model using a variety of composable inference
methods, ranging from point estimation to variational inference to MCMC. In
addition, Edward can reuse the modeling representation as part of inference,
facilitating the design of rich variational models and generative adversarial
networks. For efficiency, Edward is integrated into TensorFlow, providing
significant speedups over existing probabilistic systems. For example, we show
on a benchmark logistic regression task that Edward is at least 35x faster than
Stan and 6x faster than PyMC3. Further, Edward incurs no runtime overhead: it
is as fast as handwritten TensorFlow.

Variational inference enables Bayesian analysis for complex probabilistic
models with massive data sets. It posits a family of approximating
distributions and finds the member closest to the posterior. While successful,
variational inference methods can run into pathologies; for example, they
typically underestimate posterior uncertainty. In this paper we propose CHIVI,
a complementary algorithm to traditional variational inference. CHIVI is a
black box algorithm that minimizes the $\chi$divergence from the posterior to
the family of approximating distributions and provides an upper bound of the
model evidence. We studied CHIVI in several scenarios. On Bayesian probit
regression and Gaussian process classification it yielded better classification
error rates than expectation propagation (EP) and classical variational
inference (VI). When modeling basketball data with a Cox process, it gave
better estimates of posterior uncertainty. Finally, we show how to use the
CHIVI upper bound and classical VI lower bound to sandwich estimate the model
evidence.

Probabilistic modeling is a powerful approach for analyzing empirical
information. We describe Edward, a library for probabilistic modeling. Edward's
design reflects an iterative process pioneered by George Box: build a model of
a phenomenon, make inferences about the model given data, and criticize the
model's fit to the data. Edward supports a broad class of probabilistic models,
efficient algorithms for inference, and many techniques for model criticism.
The library builds on top of TensorFlow to support distributed training and
hardware such as GPUs. Edward enables the development of complex probabilistic
models and their algorithms at a massive scale.

The goal of causal inference is to understand the outcome of alternative
courses of action. However, all causal inference requires assumptions. Such
assumptions can be more influential than in typical tasks for probabilistic
modeling, and testing those assumptions is important to assess the validity of
causal inference. We develop model criticism for Bayesian causal inference,
building on the idea of posterior predictive checks to assess model fit. Our
approach involves decomposing the problem, separately criticizing the model of
treatment assignments and the model of outcomes. Conditioned on the assumption
of unconfoundednessthat the treatments are assigned independently of the
potential outcomeswe show how to check any additional modeling assumption.
Our approach provides a foundation for diagnosing modelbased causal
inferences.

Discussion paper on "Fast Approximate Inference for Arbitrarily Large
Semiparametric Regression Models via Message Passing" by Wand
[arXiv:1602.07412].

Iterative procedures for parameter estimation based on stochastic gradient
descent allow the estimation to scale to massive data sets. However, in both
theory and practice, they suffer from numerical instability. Moreover, they are
statistically inefficient as estimators of the true parameter value. To address
these two issues, we propose a new iterative procedure termed averaged implicit
SGD (AISGD). For statistical efficiency, AISGD employs averaging of the
iterates, which achieves the optimal Cram\'{e}rRao bound under strong
convexity, i.e., it is an optimal unbiased estimator of the true parameter
value. For numerical stability, AISGD employs an implicit update at each
iteration, which is related to proximal operators in optimization. In practice,
AISGD achieves competitive performance with other stateoftheart procedures.
Furthermore, it is more stable than averaging procedures that do not employ
proximal updates, and is simple to implement as it requires fewer tunable
hyperparameters than procedures that do employ proximal updates.

Black box variational inference allows researchers to easily prototype and
evaluate an array of models. Recent advances allow such algorithms to scale to
high dimensions. However, a central question remains: How to specify an
expressive variational distribution that maintains efficient computation? To
address this, we develop hierarchical variational models (HVMs). HVMs augment a
variational approximation with a prior on its parameters, which allows it to
capture complex structure for both discrete and continuous latent variables.
The algorithm we develop is black box, can be used for any HVM, and has the
same computational efficiency as the original approximation. We study HVMs on a
variety of deep discrete latent variable models. HVMs generalize other
expressive variational distributions and maintains higher fidelity to the
posterior.

Variational inference is a powerful tool for approximate inference, and it
has been recently applied for representation learning with deep generative
models. We develop the variational Gaussian process (VGP), a Bayesian
nonparametric variational family, which adapts its shape to match complex
posterior distributions. The VGP generates approximate posterior samples by
generating latent inputs and warping them through random nonlinear mappings;
the distribution over random mappings is learned during inference, enabling the
transformed outputs to adapt to varying complexity. We prove a universal
approximation theorem for the VGP, demonstrating its representative power for
learning any model. For inference we present a variational objective inspired
by autoencoders and perform black box inference over a wide class of models.
The VGP achieves new stateoftheart results for unsupervised learning,
inferring models such as the deep latent Gaussian model and the recently
proposed DRAW.

Method of moment estimators exhibit appealing statistical properties, such as
asymptotic unbiasedness, for nonconvex problems. However, they typically
require a large number of samples and are extremely sensitive to model
misspecification. In this paper, we apply the framework of Mestimation to
develop both a generalized method of moments procedure and a principled method
for regularization. Our proposed Mestimator obtains optimal sample efficiency
rates (in the class of momentbased estimators) and the same wellknown rates
on prediction accuracy as other spectral estimators. It also makes it
straightforward to incorporate regularization into the sample moment
conditions. We demonstrate empirically the gains in sample efficiency from our
approach on hidden Markov models.

Probabilistic modeling is iterative. A scientist posits a simple model, fits
it to her data, refines it according to her analysis, and repeats. However,
fitting complex models to large data is a bottleneck in this process. Deriving
algorithms for new models can be both mathematically and computationally
challenging, which makes it difficult to efficiently cycle through the steps.
To this end, we develop automatic differentiation variational inference (ADVI).
Using our method, the scientist only provides a probabilistic model and a
dataset, nothing else. ADVI automatically derives an efficient variational
inference algorithm, freeing the scientist to refine and explore many models.
ADVI supports a broad class of modelsno conjugacy assumptions are required. We
study ADVI across ten different models and apply it to a dataset with millions
of observations. ADVI is integrated into Stan, a probabilistic programming
system; it is available for immediate use.

We develop a general variational inference method that preserves dependency
among the latent variables. Our method uses copulas to augment the families of
distributions used in meanfield and structured approximations. Copulas model
the dependency that is not captured by the original variational distribution,
and thus the augmented variational family guarantees better approximations to
the posterior. With stochastic optimization, inference on the augmented
distribution is scalable. Furthermore, our strategy is generic: it can be
applied to any inference procedure that currently uses the meanfield or
structured approach. Copula variational inference has many advantages: it
reduces bias; it is less sensitive to local optima; it is less sensitive to
hyperparameters; and it helps characterize and interpret the dependency among
the latent variables.

We develop methods for parameter estimation in settings with largescale data
sets, where traditional methods are no longer tenable. Our methods rely on
stochastic approximations, which are computationally efficient as they maintain
one iterate as a parameter estimate, and successively update that iterate based
on a single data point. When the update is based on a noisy gradient, the
stochastic approximation is known as standard stochastic gradient descent,
which has been fundamental in modern applications with large data sets.
Additionally, our methods are numerically stable because they employ implicit
updates of the iterates. Intuitively, an implicit update is a shrinked version
of a standard one, where the shrinkage factor depends on the observed Fisher
information at the corresponding data point. This shrinkage prevents numerical
divergence of the iterates, which can be caused either by excess noise or
outliers. Our sgd package in R offers the most extensive and robust
implementation of stochastic gradient descent methods. We demonstrate that sgd
dominates alternative software in runtime for several estimation problems with
massive data sets. Our applications include the wide class of generalized
linear models as well as Mestimation for robust regression.

We develop a robust convex algorithm to select the regularization parameter
in model selection. In practice this would be automated in order to save
practitioners time from having to tune it manually. In particular, we implement
and test the convex method for $K$fold cross validation on ridge regression,
although the same concept extends to more complex models. We then compare its
performance with standard methods.

We examine how symplectic cohomology may be used as an invariant on
symplectic structures, and investigate the nonuniqueness of these structures
on Liouville domains, a field which has seen much development in the past
decade. Notably, we prove the existence of infinitely many nonstandard
symplectic structures on finite type Liouville manifolds for dimensions $n\geq
6$. To do this, we build up notions of Liouville domains, Lefschetz fibrations,
and symplectic cohomology.

This paper examines the broad structure on Stein manifolds and how it
generalizes the notion of a domain of holomorphy in $\mathbb C^n$. Along with
this generalization, we see that Stein manifolds share key properties from
domains of holomorphy, and we prove one of these major consequences. In
particular, we investigate an equivalence, similar to domains of holomorphy and
pseudoconvexity, on the class of manifolds. Then, we examine the canonical
symplectic structure of Stein manifolds inherited from this equivalence, and
how its symplectic topology develops.