
Many applications of Bayesian data analysis involve sensitive information,
motivating methods which ensure that privacy is protected. We introduce a
general privacypreserving framework for Variational Bayes (VB), a widely used
optimizationbased Bayesian inference method. Our framework respects
differential privacy, the goldstandard privacy criterion, and encompasses a
large class of probabilistic models, called the Conjugate Exponential (CE)
family. We observe that we can straightforwardly privatise VB's approximate
posterior distributions for models in the CE family, by perturbing the expected
sufficient statistics of the completedata likelihood. For a broadlyused class
of nonCE models, those with binomial likelihoods, we show how to bring such
models into the CE family, such that inferences in the modified model resemble
the private variational Bayes algorithm as closely as possible, using the
PolyaGamma data augmentation scheme. The iterative nature of variational Bayes
presents a further challenge since iterations increase the amount of noise
needed. We overcome this by combining: (1) an improved composition method for
differential privacy, called the moments accountant, which provides a tight
bound on the privacy cost of multiple VB iterations and thus significantly
decreases the amount of additive noise; and (2) the privacy amplification
effect of subsampling minibatches from largescale data in stochastic
learning. We empirically demonstrate the effectiveness of our method in CE and
nonCE models including latent Dirichlet allocation, Bayesian logistic
regression, and sigmoid belief networks, evaluated on realworld datasets.

We develop a privatised stochastic variational inference method for Latent
Dirichlet Allocation (LDA). The iterative nature of stochastic variational
inference presents challenges: multiple iterations are required to obtain
accurate posterior distributions, yet each iteration increases the amount of
noise that must be added to achieve a reasonable degree of privacy. We propose
a practical algorithm that overcomes this challenge by combining: (1) an
improved composition method for differential privacy, called the moments
accountant, which provides a tight bound on the privacy cost of multiple
variational inference iterations and thus significantly decreases the amount of
additive noise; and (2) privacy amplification resulting from subsampling of
largescale data. Focusing on conjugate exponential family models, in our
private variational inference, all the posterior distributions will be
privatised by simply perturbing expected sufficient statistics. Using Wikipedia
data, we illustrate the effectiveness of our algorithm for largescale data.

We present extraction of tree structures, such as airways, from image data as
a graph refinement task. To this end, we propose a graph autoencoder model
that uses an encoder based on graph neural networks (GNNs) to learn embeddings
from input node features and a decoder to predict connections between nodes.
Performance of the GNN model is compared with meanfield networks in their
ability to extract airways from 3D chest CT scans.

We present tree extraction in 3D images as a graph refinement task, of
obtaining a subgraph from an overcomplete input graph. To this end, we
formulate an approximate Bayesian inference framework on undirected graphs
using mean field approximation (MFA). Mean field networks are used for
inference based on the interpretation that iterations of MFA can be seen as
feedforward operations in a neural network. This allows us to learn the model
parameters from training data using backpropagation algorithm. We demonstrate
usefulness of the model to extract airway trees from 3D chest CT data. We first
obtain probability images using a voxel classifier that distinguishes airways
from background and use Bayesian smoothing to model individual airway branches.
This yields us joint Gaussian density estimates of position, orientation and
scale as node features of the input graph. Performance of the method is
compared with two methods: the first uses probability images from a trained
voxel classifier with region growing, which is similar to one of the best
performing methods at EXACT'09 airway challenge, and the second method is based
on Bayesian smoothing on these probability images. Using centerline distance as
error measure the presented method shows significant improvement compared to
these two methods.

We propose Graphical Generative Adversarial Networks (GraphicalGAN) to model
structured data. GraphicalGAN conjoins the power of Bayesian networks on
compactly representing the dependency structures among random variables and
that of generative adversarial networks on learning expressive dependency
functions. We introduce a structured recognition model to infer the posterior
distribution of latent variables given observations. We propose two alternative
divergence minimization approaches to learn the generative model and
recognition model jointly. The first one treats all variables as a whole, while
the second one utilizes the structural information by checking the individual
local factors defined by the generative model and works better in practice.
Finally, we present two important instances of GraphicalGAN, i.e. Gaussian
Mixture GAN (GMGAN) and State Space GAN (SSGAN), which can successfully learn
the discrete and temporal structures on visual datasets, respectively.

Variational inference relies on flexible approximate posterior distributions.
Normalizing flows provide a general recipe to construct flexible variational
posteriors. We introduce Sylvester normalizing flows, which can be seen as a
generalization of planar flows. Sylvester normalizing flows remove the
wellknown singleunit bottleneck from planar flows, making a single
transformation much more flexible. We compare the performance of Sylvester
normalizing flows against planar flows and inverse autoregressive flows and
demonstrate that they compare favorably on several datasets.

The effectiveness of Convolutional Neural Networks stems in large part from
their ability to exploit the translation invariance that is inherent in many
learning problems. Recently, it was shown that CNNs can exploit other
invariances, such as rotation invariance, by using group convolutions instead
of planar convolutions. However, for reasons of performance and ease of
implementation, it has been necessary to limit the group convolution to
transformations that can be applied to the filters without interpolation. Thus,
for images with square pixels, only integer translations, rotations by
multiples of 90 degrees, and reflections are admissible.
Whereas the square tiling provides a 4fold rotational symmetry, a hexagonal
tiling of the plane has a 6fold rotational symmetry. In this paper we show how
one can efficiently implement planar convolution and group convolution over
hexagonal lattices, by reusing existing highly optimized convolution routines.
We find that, due to the reduced anisotropy of hexagonal filters, planar
HexaConv provides better accuracy than planar convolution with square filters,
given a fixed parameter budget. Furthermore, we find that the increased degree
of symmetry of the hexagonal grid increases the effectiveness of group
convolutions, by allowing for more parameter sharing. We show that our method
significantly outperforms conventional CNNs on the AID aerial scene
classification dataset, even outperforming ImageNet pretrained models.

Multiple instance learning (MIL) is a variation of supervised learning where
a single class label is assigned to a bag of instances. In this paper, we state
the MIL problem as learning the Bernoulli distribution of the bag label where
the bag label probability is fully parameterized by neural networks.
Furthermore, we propose a neural networkbased permutationinvariant
aggregation operator that corresponds to the attention mechanism. Notably, an
application of the proposed attentionbased operator provides insight into the
contribution of each instance to the bag label. We show empirically that our
approach achieves comparable performance to the best MIL methods on benchmark
MIL datasets and it outperforms other methods on a MNISTbased MIL dataset and
two reallife histopathology datasets without sacrificing interpretability.

Many different methods to train deep generative models have been introduced
in the past. In this paper, we propose to extend the variational autoencoder
(VAE) framework with a new type of prior which we call "Variational Mixture of
Posteriors" prior, or VampPrior for short. The VampPrior consists of a mixture
distribution (e.g., a mixture of Gaussians) with components given by
variational posteriors conditioned on learnable pseudoinputs. We further
extend this prior to a two layer hierarchical model and show that this
architecture with a coupled prior and posterior, learns significantly better
models. The model also avoids the usual local optima issues related to useless
latent dimensions that plague VAEs. We provide empirical studies on six
datasets, namely, static and binary MNIST, OMNIGLOT, Caltech 101 Silhouettes,
Frey Faces and Histopathology patches, and show that applying the hierarchical
VampPrior delivers stateoftheart results on all datasets in the unsupervised
permutation invariant setting and the best results or comparable to SOTA
methods for the approach with convolutional networks.

Convolutional Neural Networks (CNNs) have become the method of choice for
learning problems involving 2D planar images. However, a number of problems of
recent interest have created a demand for models that can analyze spherical
images. Examples include omnidirectional vision for drones, robots, and
autonomous cars, molecular regression problems, and global weather and climate
modelling. A naive application of convolutional networks to a planar projection
of the spherical signal is destined to fail, because the spacevarying
distortions introduced by such a projection will make translational weight
sharing ineffective.
In this paper we introduce the building blocks for constructing spherical
CNNs. We propose a definition for the spherical crosscorrelation that is both
expressive and rotationequivariant. The spherical correlation satisfies a
generalized Fourier theorem, which allows us to compute it efficiently using a
generalized (noncommutative) Fast Fourier Transform (FFT) algorithm. We
demonstrate the computational efficiency, numerical accuracy, and effectiveness
of spherical CNNs applied to 3D model recognition and atomization energy
regression.

Interacting systems are prevalent in nature, from dynamical systems in
physics to complex societal dynamics. The interplay of components can give rise
to complex behavior, which can often be explained using a simple model of the
system's constituent parts. In this work, we introduce the neural relational
inference (NRI) model: an unsupervised model that learns to infer interactions
while simultaneously learning the dynamics purely from observational data. Our
model takes the form of a variational autoencoder, in which the latent code
represents the underlying interaction graph and the reconstruction is based on
graph neural networks. In experiments on simulated physical systems, we show
that our NRI model can accurately recover groundtruth interactions in an
unsupervised manner. We further demonstrate that we can find an interpretable
structure and predict complex dynamics in real motion capture and sports
tracking data.

The success of convolutional networks in learning problems involving planar
signals such as images is due to their ability to exploit the translation
symmetry of the data distribution through weight sharing. Many areas of science
and egineering deal with signals with other symmetries, such as rotation
invariant data on the sphere. Examples include climate and weather science,
astrophysics, and chemistry. In this paper we present spherical convolutional
networks. These networks use convolutions on the sphere and rotation group,
which results in rotational weight sharing and rotation equivariance. Using a
synthetic spherical MNIST dataset, we show that spherical convolutional
networks are very effective at dealing with rotationally invariant
classification problems.

Compression and computational efficiency in deep learning have become a
problem of great significance. In this work, we argue that the most principled
and effective way to attack this problem is by taking a Bayesian point of view,
where through sparsity inducing priors we prune large parts of the network. We
introduce two novelties in this paper: 1) we use hierarchical priors to prune
nodes instead of individual weights, and 2) we use the posterior uncertainties
to determine the optimal fixed point precision to encode the weights. Both
factors significantly contribute to achieving the state of the art in terms of
compression rates, while still staying competitive with methods designed to
optimize for speed or energy efficiency.

We investigate the problem of learning representations that are invariant to
certain nuisance or sensitive factors of variation in the data while retaining
as much of the remaining information as possible. Our model is based on a
variational autoencoding architecture with priors that encourage independence
between sensitive and latent factors of variation. Any subsequent processing,
such as classification, can then be performed on this purged latent
representation. To remove any remaining dependencies we incorporate an
additional penalty term based on the "Maximum Mean Discrepancy" (MMD) measure.
We discuss how these architectures can be efficiently trained on data and show
in experiments that this method is more effective than previous work in
removing unwanted sources of variation while maintaining informative latent
representations.

In this paper, we propose a new volumepreserving flow and show that it
performs similarly to the linear general normalizing flow. The idea is to
enrich a linear Inverse Autoregressive Flow by introducing multiple
lowertriangular matrices with ones on the diagonal and combining them using a
convex combination. In the experimental studies on MNIST and Histopathology
data we show that the proposed approach outperforms other volumepreserving
flows and is competitive with current stateoftheart linear normalizing flow.

Much of the recent research on solving iterative inference problems focuses
on moving away from handchosen inference algorithms and towards learned
inference. In the latter, the inference process is unrolled in time and
interpreted as a recurrent neural network (RNN) which allows for joint learning
of model and inference parameters with backpropagation through time. In this
framework, the RNN architecture is directly derived from a handchosen
inference algorithm, effectively limiting its capabilities. We propose a
learning framework, called Recurrent Inference Machines (RIM), in which we turn
algorithm construction the other way round: Given data and a task, train an RNN
to learn an inference algorithm. Because RNNs are Turing complete [1, 2] they
are capable to implement any inference algorithm. The framework allows for an
abstraction which removes the need for domain knowledge. We demonstrate in
several image restoration experiments that this abstraction is effective,
allowing us to achieve stateoftheart performance on image denoising and
superresolution tasks and superior acrosstask generalization.

The vast majority of natural sensory data is temporally redundant. Video
frames or audio samples which are sampled at nearby points in time tend to have
similar values. Typically, deep learning algorithms take no advantage of this
redundancy to reduce computation. This can be an obscene waste of energy. We
present a variant on backpropagation for neural networks in which computation
scales with the rate of change of the data  not the rate at which we process
the data. We do this by having neurons communicate a combination of their
state, and their temporal change in state. Intriguingly, this simple
communication rule give rise to units that resemble biologicallyinspired leaky
integrateandfire neurons, and to a weightupdate rule that is equivalent to a
form of SpikeTiming Dependent Plasticity (STDP), a synaptic learning rule
observed in the brain. We demonstrate that on MNIST and a temporal variant of
MNIST, our algorithm performs about as well as a Multilayer Perceptron trained
with backpropagation, despite only communicating discrete values between
layers.

We reinterpret multiplicative noise in neural networks as auxiliary random
variables that augment the approximate posterior in a variational setting for
Bayesian neural networks. We show that through this interpretation it is both
efficient and straightforward to improve the approximation by employing
normalizing flows while still allowing for local reparametrizations and a
tractable lower bound. In experiments we show that with this new approximation
we can significantly improve upon classical mean field for Bayesian neural
networks on both predictive accuracy as well as predictive uncertainty.

We present a method for visualising the response of a deep neural network to
a specific input. For image data for instance our method will highlight areas
that provide evidence in favor of, and against choosing a certain class. The
method overcomes several shortcomings of previous methods and provides great
additional insight into the decision making process of convolutional networks,
which is important both to improve models and to accelerate the adoption of
such methods in e.g. medicine. In experiments on ImageNet data, we illustrate
how the method works and can be applied in different ways to understand deep
neural nets.

In this paper we revisit matrix completion for recommender systems from the
point of view of link prediction on graphs. Interaction data such as movie
ratings can be represented by a bipartite useritem graph with labeled edges
representing observed ratings. Building on recent progress in deep learning on
graphstructured data, we propose a graph autoencoder framework based on
differentiable message passing on the bipartite interaction graph. This
framework can be viewed as an important first step towards endtoend learning
in settings where the interaction data is integrated into larger graphs such as
social networks or knowledge graphs, circumventing the need for multistage
frameworks. Our model achieves competitive performance on standard
collaborative filtering benchmarks, significantly outperforming related methods
in a recommendation task with side information.

Knowledge bases play a crucial role in many applications, for example
question answering and information retrieval. Despite the great effort invested
in creating and maintaining them, even the largest representatives (e.g., Yago,
DBPedia or Wikidata) are highly incomplete. We introduce relational graph
convolutional networks (RGCNs) and apply them to two standard knowledge base
completion tasks: link prediction (recovery of missing facts,
i.e.~subjectpredicateobject triples) and entity classification (recovery of
missing attributes of entities). RGCNs are a generalization of graph
convolutional networks, a recent class of neural networks operating on graphs,
and are developed specifically to deal with highly multirelational data,
characteristic of realistic knowledge bases. Our methods achieve competitive
performance on standard benchmarks for both tasks, demonstrating especially
promising results on the challenging FB15k237 subset of Freebase.

Learning individuallevel causal effects from observational data, such as
inferring the most effective medication for a specific patient, is a problem of
growing importance for policy makers. The most important aspect of inferring
causal effects from observational data is the handling of confounders, factors
that affect both an intervention and its outcome. A carefully designed
observational study attempts to measure all important confounders. However,
even if one does not have direct access to all confounders, there may exist
noisy and uncertain measurement of proxies for confounders. We build on recent
advances in latent variable modelling to simultaneously estimate the unknown
latent space summarizing the confounders and the causal effect. Our method is
based on Variational Autoencoders (VAE) which follow the causal structure of
inference with proxies. We show our method is significantly more robust than
existing methods, and matches the stateoftheart on previous benchmarks
focused on individual treatment effects.

The success of deep learning in numerous application domains created the de
sire to run and train them on mobile devices. This however, conflicts with
their computationally, memory and energy intense nature, leading to a growing
interest in compression. Recent work by Han et al. (2015a) propose a pipeline
that involves retraining, pruning and quantization of neural network weights,
obtaining stateoftheart compression rates. In this paper, we show that
competitive compression rates can be achieved by using a version of soft
weightsharing (Nowlan & Hinton, 1992). Our method achieves both quantization
and pruning in one simple (re)training procedure. This point of view also
exposes the relation between compression and the minimum description length
(MDL) principle.

We present a scalable approach for semisupervised learning on
graphstructured data that is based on an efficient variant of convolutional
neural networks which operate directly on graphs. We motivate the choice of our
convolutional architecture via a localized firstorder approximation of
spectral graph convolutions. Our model scales linearly in the number of graph
edges and learns hidden layer representations that encode both local graph
structure and features of nodes. In a number of experiments on citation
networks and on a knowledge graph dataset we demonstrate that our approach
outperforms related methods by a significant margin.

This article presents the prediction difference analysis method for
visualizing the response of a deep neural network to a specific input. When
classifying images, the method highlights areas in a given input image that
provide evidence for or against a certain class. It overcomes several
shortcoming of previous methods and provides great additional insight into the
decision making process of classifiers. Making neural network decisions
interpretable through visualization is important both to improve models and to
accelerate the adoption of blackbox classifiers in application areas such as
medicine. We illustrate the method in experiments on natural images (ImageNet
data), as well as medical images (MRI brain scans).