
Methods of transfer learning try to combine knowledge from several related
tasks (or domains) to improve performance on a test task. Inspired by causal
methodology, we relax the usual covariate shift assumption and assume that it
holds true for a subset of predictor variables: the conditional distribution of
the target variable given this subset of predictors is invariant over all
tasks. We show how this assumption can be motivated from ideas in the field of
causality. We focus on the problem of Domain Generalization, in which no
examples from the test task are observed. We prove that in an adversarial
setting using this subset for prediction is optimal in Domain Generalization;
we further provide examples, in which the tasks are sufficiently diverse and
the estimator therefore outperforms pooling the data, even on average. If
examples from the test task are available, we also provide a method to transfer
knowledge from the training tasks and exploit all available features for
prediction. However, we provide no guarantees for this method. We introduce a
practical method which allows for automatic inference of the above subset and
provide corresponding code. We present results on synthetic data sets and a
gene deletion data set.

Structural causal models (SCMs), also known as (nonparametric) structural
equation models (SEMs), are widely used for causal modeling purposes. A large
body of theoretical results is available for the special case in which cycles
are absent (i.e., acyclic SCMs, also known as recursive SEMs). However, in many
application domains cycles are abundantly present, for example in the form of
feedback loops. In this paper, we provide a general and rigorous theory of
cyclic SCMs. The paper consists of two parts: the first part gives a rigorous
treatment of structural causal models, dealing with measuretheoretic and other
complications that arise in the presence of cycles. In contrast with the
acyclic case, in cyclic SCMs solutions may no longer exist, or if they exist,
they may no longer be unique, or even measurable in general. We give several
sufficient and necessary conditions for the existence of (unique) measurable
solutions. We show how causal reasoning proceeds in these models and how this
differs from the acyclic case. Moreover, we give an overview of the Markov
properties that hold for cyclic SCMs. In the second part, we address the
question of how one can marginalize an SCM (possibly with cycles) to a subset
of the endogenous variables. We show that under a certain condition, one can
effectively remove a subset of the endogenous variables from the model, leading
to a more parsimonious marginal SCM that preserves the causal and
counterfactual semantics of the original SCM on the remaining variables.
Moreover, we show how the marginalization relates to the latent projection and
to latent confounders, i.e. latent common causes.

We lay theoretical foundations for new database release mechanisms that allow
thirdparties to construct consistent estimators of population statistics,
while ensuring that the privacy of each individual contributing to the database
is protected. The proposed framework rests on two main ideas. First, releasing
(an estimate of) the kernel mean embedding of the data generating random
variable instead of the database itself still allows thirdparties to construct
consistent estimators of a wide class of population statistics. Second, the
algorithm can satisfy the definition of differential privacy by basing the
released kernel mean embedding on entirely synthetic data points, while
controlling accuracy through the metric available in a Reproducing Kernel
Hilbert Space. We describe two instantiations of the proposed framework,
suitable under different scenarios, and prove theoretical results guaranteeing
differential privacy of the resulting algorithms and the consistency of
estimators constructed from their outputs.

The modulation transfer function (MTF) is widely used to characterise the
performance of optical systems. Measuring it is costly and it is thus rarely
available for a given lens specimen. Instead, MTFs based on simulations or, at
best, MTFs measured on other specimens of the same lens are used. Fortunately,
images recorded through an optical system contain ample information about its
MTF, only that it is confounded with the statistics of the images. This work
presents a method to estimate the MTF of camera lens systems directly from
photographs, without the need for expensive equipment. We use a custom grid
display to accurately measure the point response of lenses to acquire ground
truth training data. We then use the same lenses to record natural images and
employ a datadriven supervised learning approach using a convolutional neural
network to estimate the MTF on small image patches, aggregating the information
into MTF charts over the entire field of view. It generalises to unseen lenses
and can be applied for single photographs, with the performance improving if
multiple photographs are available.

Clustering is a cornerstone of unsupervised learning which can be thought as
disentangling multiple generative mechanisms underlying the data. In this paper
we introduce an algorithmic framework to train mixtures of implicit generative
models which we particularize for variational autoencoders. Relying on an
additional set of discriminators, we propose a competitive procedure in which
the models only need to approximate the portion of the data distribution from
which they can produce realistic samples. As a byproduct, each model is simpler
to train, and a clustering interpretation arises naturally from the
partitioning of the training points among the models. We empirically show that
our approach splits the training distribution in a reasonable way and increases
the quality of the generated samples.

We consider a bivariate time series $(X_t,Y_t)$ that is given by a simple
linear autoregressive model. Assuming that the equations describing each
variable as a linear combination of past values are considered structural
equations, there is a clear meaning of how intervening on one particular $X_t$
influences $Y_{t'}$ at later times $t'>t$. In the present work, we describe
conditions under which one can define a causal model between variables that are
coarsegrained in time, thus admitting statements like `setting $X$ to $x$
changes $Y$ in a certain way' without referring to specific time instances. We
show that particularly simple statements follow in the frequency domain, thus
providing meaning to interventions on frequencies.

Two popular examples of firstorder optimization methods over linear spaces
are coordinate descent and matching pursuit algorithms, with their randomized
variants. While the former targets the optimization by moving along
coordinates, the latter considers a generalized notion of directions.
Exploiting the connection between the two algorithms, we present a unified
analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates
on smooth objectives and linear convergence on strongly convex objectives. As a
byproduct of our affine invariant analysis of matching pursuit, our rates for
steepest coordinate descent are the tightest known. Furthermore, we show the
first accelerated convergence rate $\mathcal{O}(1/t^2)$ for matching pursuit on
convex objectives.

We study machine learningbased assistants that support coordination between
humans in congested facilities via congestion forecasts. In our theoretical
analysis, we use game theory to study how an assistant's forecast that
influences the outcome relates to Nash equilibria, and how they can be reached
quickly in congestion gamelike settings. Using information theory, we
investigate approximations to given social choice functions under privacy
constraints w.r.t. assistants. And we study dynamics and training for a
specific exponential smoothingbased assistant via a linear dynamical systems
and causal analysis. We report experiments conducted on a real congested
cafeteria with about 400 daily customers where we evaluate this assistant and
prediction baselines to gain further insight.

The amount of digitally available but heterogeneous information about the
world is remarkable, and new technologies such as selfdriving cars, smart
homes, or the internet of things may further increase it. In this paper we
present preliminary ideas about certain aspects of the problem of how such
heterogeneous information can be harnessed by autonomous agents. After
discussing potentials and limitations of some existing approaches, we
investigate how \emph{experiments} can help to obtain a better understanding of
the problem. Specifically, we present a simple agent that integrates video data
from a different agent, and implement and evaluate a version of it on the novel
experimentation platform \emph{Malmo}. The focus of a second investigation is
on how information about the hardware of different agents, the agents' sensory
data, and \emph{causal} information can be utilized for knowledge transfer
between agents and subsequently more dataefficient decision making. Finally,
we discuss potential future steps w.r.t.\ theory and experimentation, and
formulate open questions.

The goal in extreme multilabel classification is to learn a classifier which
can assign a small subset of relevant labels to an instance from an extremely
large set of target labels. Datasets in extreme classification exhibit a long
tail of labels which have small number of positive training instances. In this
work, we pose the learning task in extreme classification with large number of
taillabels as learning in the presence of adversarial perturbations. This view
motivates a robust optimization framework and equivalence to a corresponding
regularized objective.
Under the proposed robustness framework, we demonstrate efficacy of Hamming
loss for taillabel detection in extreme classification. The equivalent
regularized objective, in combination with proximal gradient based
optimization, performs better than stateoftheart methods on propensity
scored versions of precision@k and nDCG@k(upto 20% relative improvement over
PFastreXML  a leading treebased approach and 60% relative improvement over
SLEEC  a leading labelembedding approach). Furthermore, we also highlight the
suboptimality of a sparse solver in a widely used package for largescale
linear classification, which is interesting in its own right. We also
investigate the spectral properties of label graphs for providing novel
insights towards understanding the conditions governing the performance of
Hamming loss based onevsrest scheme vis\`avis label embedding methods.

Generative adversarial networks (GANs) have been shown to produce realistic
samples from highdimensional distributions, but training them is considered
hard. A possible explanation for training instabilities is the inherent
imbalance between the networks: While the discriminator is trained directly on
both real and fake samples, the generator only has control over the fake
samples it produces since the real data distribution is fixed by the choice of
a given dataset. We propose a simple modification that gives the generator
control over the real samples which leads to a tempered learning process for
both generator and discriminator. The real data distribution passes through a
lens before being revealed to the discriminator, balancing the generator and
discriminator by gradually revealing more detailed features necessary to
produce highquality results. The proposed module automatically adjusts the
learning process to the current strength of the networks, yet is generic and
easy to add to any GAN variant. In a number of experiments, we show that this
can improve quality, stability and/or convergence speed across a range of
different GAN architectures (DCGAN, LSGAN, WGANGP).

We address the problem of inferring the causal relation between two variables
by comparing the leastsquares errors of the predictions in both possible
causal directions. Under the assumption of an independence between the function
relating cause and effect, the conditional noise distribution, and the
distribution of the cause, we show that the errors are smaller in causal
direction if both variables are equally scaled and the causal relation is close
to deterministic. Based on this, we provide an easily applicable algorithm that
only requires a regression in both possible causal directions and a comparison
of the errors. The performance of the algorithm is compared with different
related causal inference methods in various artificial and realworld data
sets.

Statistical learning relies upon data sampled from a distribution, and we
usually do not care what actually generated it in the first place. From the
point of view of causal modeling, the structure of each distribution is induced
by physical mechanisms that give rise to dependencies between observables.
Mechanisms, however, can be meaningful autonomous modules of generative models
that make sense beyond a particular entailed data distribution, lending
themselves to transfer between problems. We develop an algorithm to recover a
set of independent (inverse) mechanisms from a set of transformed data points.
The approach is unsupervised and based on a set of experts that compete for
data generated by the mechanisms, driving specialization. We analyze the
proposed method in a series of experiments on image data. Each expert learns to
map a subset of the transformed data back to a reference distribution. The
learned mechanisms generalize to novel domains. We discuss implications for
transfer learning and links to recent trends in generative modeling.

Over the past four years, neural networks have proven vulnerable to
adversarial images: targeted but imperceptible image perturbations lead to
drastically different predictions. We show that adversarial vulnerability
increases with the gradients of the training objective when seen as a function
of the inputs. For most current network architectures, we prove that the
$\ell_1$norm of these gradients grows as the square root of the inputsize.
These nets therefore become increasingly vulnerable with growing image size.
Over the course of our analysis we rediscover and generalize
doublebackpropagation, a technique that penalizes large gradients in the loss
surface to reduce adversarial vulnerability and increase generalization
performance. We show that this regularizationscheme is equivalent at first
order to training with adversarial noise. Finally, we demonstrate that
replacing strided by averagepooling layers decreases adversarial
vulnerability. Our proofs rely on the network's weightdistribution at
initialization, but extensive experiments confirm their conclusions after
training.

Recent work on fairness in machine learning has focused on various
statistical discrimination criteria and how they trade off. Most of these
criteria are observational: They depend only on the joint distribution of
predictor, protected attribute, features, and outcome. While convenient to work
with, observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively.
Going beyond observational criteria, we frame the problem of discrimination
based on protected attributes in the language of causal reasoning. This
viewpoint shifts attention from "What is the right fairness criterion?" to
"What do we want to assume about the causal data generating process?" Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what was
before a matter of opinion. Second, our approach exposes previously ignored
subtleties and why they are fundamental to the problem. Finally, we put forward
natural causal nondiscrimination criteria and develop algorithms that satisfy
them.

Difference imaging or image subtraction is a method that measures
differential photometry by matching the pointing and pointspread function
(PSF) between image frames. It is used for the detection of timevariable
phenomena. Here we present a new category of methodCPM Difference Imaging,
in which differences are not measured between matched images but instead
between image frames and a datadriven predictive model that has been designed
only to predict the pointing, PSF, and detector effects but not astronomical
variability. In CPM Difference Imaging each pixel is modelled by the Causal
Pixel Model (CPM) originally built for modeling Kepler data, in which pixel
values are predicted by a linear combination of other pixels at the same epoch
but far enough away such that these pixels are causally disconnected,
astrophysically. It does not require that the user have any explicit model or
description of the pointing or pointspread function of any of the images. Its
principal drawback is thatin its current formit requires an imaging
campaign with many epochs and fairly stable telescope pointing. The method is
applied to simulated data and also the K2 Campaign 9 microlensing data. We show
that CPM Difference Imaging can detect variable objects and produce precise
differentiate photometry in a crowded field. CPM Difference Imaging is capable
of producing image differences at nearly photonnoise precision.

As handheld video cameras are now commonplace and available in every
smartphone, images and videos can be recorded almost everywhere at anytime.
However, taking a quick shot frequently yields a blurry result due to unwanted
camera shake during recording or moving objects in the scene. Removing these
artifacts from the blurry recordings is a highly illposed problem as neither
the sharp image nor the motion blur kernel is known. Propagating information
between multiple consecutive blurry observations can help restore the desired
sharp image or video. Solutions for blind deconvolution based on neural
networks rely on a massive amount of groundtruth data which is hard to
acquire. In this work, we propose an efficient approach to produce a
significant amount of realistic training data and introduce a novel recurrent
network architecture to deblur frames taking temporal information into account,
which can efficiently handle arbitrary spatial and temporal input sizes. We
demonstrate the versatility of our approach in a comprehensive comparison on a
number of challening realworld examples.

Single image superresolution is the task of inferring a highresolution
image from a single lowresolution input. Traditionally, the performance of
algorithms for this task is measured using pixelwise reconstruction measures
such as peak signaltonoise ratio (PSNR) which have been shown to correlate
poorly with the human perception of image quality. As a result, algorithms
minimizing these metrics tend to produce oversmoothed images that lack
highfrequency textures and do not look natural despite yielding high PSNR
values.
We propose a novel application of automated texture synthesis in combination
with a perceptual loss focusing on creating realistic textures rather than
optimizing for a pixelaccurate reproduction of ground truth images during
training. By using feedforward fully convolutional neural networks in an
adversarial training setting, we achieve a significant boost in image quality
at high magnification ratios. Extensive experiments on a number of datasets
show the effectiveness of our approach, yielding stateoftheart results in
both quantitative and qualitative benchmarks.

Complex systems can be modelled at various levels of detail. Ideally, causal
models of the same system should be consistent with one another in the sense
that they agree in their predictions of the effects of interventions. We
formalise this notion of consistency in the case of Structural Equation Models
(SEMs) by introducing exact transformations between SEMs. This provides a
general language to consider, for instance, the different levels of description
in the following three scenarios: (a) models with large numbers of variables
versus models in which the `irrelevant' or unobservable variables have been
marginalised out; (b) microlevel models versus macrolevel models in which the
macrovariables are aggregate features of the microvariables; (c) dynamical
time series models versus models of their stationary behaviour. Our analysis
stresses the importance of well specified interventions in the causal modelling
process and sheds light on the interpretation of cyclic SEMs.

This paper introduces a probabilistic framework for kshot image
classification. The goal is to generalise from an initial largescale
classification task to a separate task comprising new classes and small numbers
of examples. The new approach not only leverages the featurebased
representation learned by a neural network from the initial task
(representational transfer), but also information about the form of the classes
(concept transfer). The concept information is encapsulated in a probabilistic
model for the final layer weights of the neural network which then acts as a
prior when probabilistic kshot learning is performed. Surprisingly, simple
probabilistic models and inference schemes outperform many existing kshot
learning approaches and compare favourably with the stateoftheart method in
terms of errorrate. The new probabilistic methods are also able to accurately
model uncertainty, leading to well calibrated classifiers, and they are easily
extensible and flexible, unlike many recent approaches to kshot learning.

Offpolicy modelfree deep reinforcement learning methods using previously
collected data can improve sample efficiency over onpolicy policy gradient
techniques. On the other hand, onpolicy algorithms are often more stable and
easier to use. This paper examines, both theoretically and empirically,
approaches to merging on and offpolicy updates for deep reinforcement
learning. Theoretical results show that offpolicy updates with a value
function estimator can be interpolated with onpolicy policy gradient updates
whilst still satisfying performance bounds. Our analysis uses control variate
methods to produce a family of policy gradient algorithms, with several
recently proposed algorithms being special cases of this family. We then
provide an empirical comparison of these techniques with the remaining
algorithmic details fixed, and show how different mixing of offpolicy gradient
estimates with onpolicy samples contribute to improvements in empirical
performance. The final algorithm provides a generalization and unification of
existing deep policy gradient techniques, has theoretical guarantees on the
bias introduced by offpolicy updates, and improves on the stateoftheart
modelfree deep RL methods on a number of OpenAI Gym continuous control
benchmarks.

Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an
effective method for training generative models of complex data such as natural
images. However, they are notoriously hard to train and can suffer from the
problem of missing modes where the model is not able to produce examples in
certain regions of the space. We propose an iterative procedure, called AdaGAN,
where at every step we add a new component into a mixture model by running a
GAN algorithm on a reweighted sample. This is inspired by boosting algorithms,
where many potentially weak individual predictors are greedily aggregated to
form a strong composite predictor. We prove that such an incremental procedure
leads to convergence to the true distribution in a finite number of steps if
each step is optimal, and convergence at an exponential rate otherwise. We also
illustrate experimentally that this procedure addresses the problem of missing
modes.

Invariance to nuisance transformations is one of the desirable properties of
effective representations. We consider transformations that form a \emph{group}
and propose an approach based on kernel methods to derive local group invariant
representations. Locality is achieved by defining a suitable probability
distribution over the group which in turn induces distributions in the input
feature space. We learn a decision function over these distributions by
appealing to the powerful framework of kernel methods and generate local
invariant random feature maps via kernel approximations. We show uniform
convergence bounds for kernel approximation and provide excess risk bounds for
learning with these features. We evaluate our method on three real datasets,
including Rotated MNIST and CIFAR10, and observe that it outperforms competing
kernel based approaches. The proposed method also outperforms deep CNN on
RotatedMNIST and performs comparably to the recently proposed
groupequivariant CNN.

We introduce a novel framework for adversarial training where the target
distribution is annealed between the uniform distribution and the data
distribution. We posited a conjecture that learning under continuous annealing
in the nonparametric regime is stable irrespective of the divergence measures
in the objective function and proposed an algorithm, dubbed {\ss}GAN, in
corollary. In this framework, the fact that the initial support of the
generative network is the whole ambient space combined with annealing are key
to balancing the minimax game. In our experiments on synthetic data, MNIST, and
CelebA, {\ss}GAN with a fixed annealing schedule was stable and did not suffer
from mode collapse.

We propose to fuse two currently separate research lines on novel therapies
for stroke rehabilitation: braincomputer interface (BCI) training and
transcranial electrical stimulation (TES). Specifically, we show that BCI
technology can be used to learn personalized decoding models that relate the
global configuration of brain rhythms in individual subjects (as measured by
EEG) to their motor performance during 3D reaching movements. We demonstrate
that our models capture substantial acrosssubject heterogeneity, and argue
that this heterogeneity is a likely cause of limited effect sizes observed in
TES for enhancing motor performance. We conclude by discussing how our
personalized models can be used to derive optimal TES parameters, e.g.,
stimulation site and frequency, for individual patients.