
We analyze the variance of stochastic gradients along negative curvature
directions in certain nonconvex machine learning models and show that
stochastic gradients exhibit a strong component along these directions.
Furthermore, we show that  contrary to the case of isotropic noise  this
variance is proportional to the magnitude of the corresponding eigenvalues and
not decreasing in the dimensionality. Based upon this observation we propose a
new assumption under which we show that the injection of explicit, isotropic
noise usually applied to make gradient descent escape saddle points can
successfully be replaced by a simple SGD step. Additionally  and under the
same condition  we derive the first convergence rate for plain SGD to a
secondorder stationary point in a number of iterations that is independent of
the problem dimension.

Dark matter in the universe evolves through gravity to form a complex network
of halos, filaments, sheets and voids, that is known as the cosmic web.
Computational models of the underlying physical processes, such as classical
Nbody simulations, are extremely resource intensive, as they track the action
of gravity in an expanding universe using billions of particles as tracers of
the cosmic matter distribution. Therefore, upcoming cosmology experiments will
face a computational bottleneck that may limit the exploitation of their full
scientific potential. To address this challenge, we demonstrate the application
of a machine learning technique called Generative Adversarial Networks (GAN) to
learn models that can efficiently generate new, physically realistic
realizations of the cosmic web. Our training set is a small, representative
sample of 2D image snapshots from Nbody simulations of size 500 and 100 Mpc.
We show that the GANproduced results are qualitatively and quantitatively very
similar to the originals. Generation of a new cosmic web realization with a GAN
takes a fraction of a second, compared to the many hours needed by the Nbody
technique. We anticipate that GANs will therefore play an important role in
providing extremely fast and precise simulations of cosmic web in the era of
large cosmological surveys, such as Euclid and LSST.

In implicit models, one often interpolates between sampled points in latent
space. As we show in this paper, care needs to be taken to matchup the
distributional assumptions on code vectors with the geometry of the
interpolating paths. Otherwise, typical assumptions about the quality and
semantics of inbetween points may not be justified. Based on our analysis we
propose to modify the prior code distribution to put significantly more
probability mass closer to the origin. As a result, linear interpolation paths
are not only shortest paths, but they are also guaranteed to pass through
highdensity regions, irrespective of the dimensionality of the latent space.
Experiments on standard benchmark image datasets demonstrate clear visual
improvements in the quality of the generated samples and exhibit more
meaningful interpolation paths.

Modeling the Point Spread Function (PSF) of widefield surveys is vital for
many astrophysical applications and cosmological probes including weak
gravitational lensing. The PSF smears the image of any recorded object and
therefore needs to be taken into account when inferring properties of galaxies
from astronomical images. In the case of cosmic shear, the PSF is one of the
dominant sources of systematic errors and must be treated carefully to avoid
biases in cosmological parameters. Recently, forward modeling approaches to
calibrate shear measurements within the MonteCarlo Control Loops ($MCCL$)
framework have been developed. These methods typically require simulating a
large amount of widefield images, thus, the simulations need to be very fast
yet have realistic properties in key features such as the PSF pattern. Hence,
such forward modeling approaches require a very flexible PSF model, which is
quick to evaluate and whose parameters can be estimated reliably from survey
data. We present a PSF model that meets these requirements based on a fast
deeplearning method to estimate its free parameters. We demonstrate our
approach on publicly available SDSS data. We extract the most important
features of the SDSS sample via principal component analysis. Next, we
construct our model based on perturbations of a fixed base profile, ensuring
that it captures these features. We then train a Convolutional Neural Network
to estimate the free parameters of the model from noisy images of the PSF. This
allows us to render a model image of each star, which we compare to the SDSS
stars to evaluate the performance of our method. We find that our approach is
able to accurately reproduce the SDSS PSF at the pixel level, which, due to the
speed of both the model evaluation and the parameter estimation, offers good
prospects for incorporating our method into the $MCCL$ framework.

We consider the problem of training generative models with deep neural
networks as generators, i.e. to map latent codes to data points. Whereas the
dominant paradigm combines simple priors over codes with complex deterministic
models, we argue that it might be advantageous to use more flexible code
distributions. We demonstrate how these distributions can be induced directly
from the data. The benefits include: more powerful generative models, better
modeling of latent structure and explicit control of the degree of
generalization.

We consider the problem of training generative models with deep neural
networks as generators, i.e. to map latent codes to data points. Whereas the
dominant paradigm combines simple priors over codes with complex deterministic
models, we propose instead to use more flexible code distributions. These
distributions are estimated nonparametrically by reversing the generator map
during training. The benefits include: more powerful generative models, better
modeling of latent structure and explicit control of the degree of
generalization.

This study deals with semantic segmentation of highresolution (aerial)
images where a semantic class label is assigned to each pixel via supervised
classification as a basis for automatic map generation. Recently, deep
convolutional neural networks (CNNs) have shown impressive performance and have
quickly become the defacto standard for semantic segmentation, with the added
benefit that taskspecific feature design is no longer necessary. However, a
major downside of deep learning methods is that they are extremely datahungry,
thus aggravating the perennial bottleneck of supervised classification, to
obtain enough annotated training data. On the other hand, it has been observed
that they are rather robust against noise in the training labels. This opens up
the intriguing possibility to avoid annotating huge amounts of training data,
and instead train the classifier from existing legacy data or crowdsourced
maps which can exhibit high levels of noise. The question addressed in this
paper is: can training with largescale, publicly available labels replace a
substantial part of the manual labeling effort and still achieve sufficient
performance? Such data will inevitably contain a significant portion of errors,
but in return virtually unlimited quantities of it are available in larger
parts of the world. We adapt a stateoftheart CNN architecture for semantic
segmentation of buildings and roads in aerial images, and compare its
performance when using different training data sets, ranging from manually
labeled, pixelaccurate ground truth of the same city to automatic training
data derived from OpenStreetMap data from distant locations. We report our
results that indicate that satisfying performance can be obtained with
significantly less manual annotation effort, by exploiting noisy largescale
training data.

We demonstrate the potential of Deep Learning methods for measurements of
cosmological parameters from density fields, focusing on the extraction of
nonGaussian information. We consider weak lensing mass maps as our dataset. We
aim for our method to be able to distinguish between five models, which were
chosen to lie along the $\sigma_8$  $\Omega_m$ degeneracy, and have nearly the
same twopoint statistics. We design and implement a Deep Convolutional Neural
Network (DCNN) which learns the relation between five cosmological models and
the mass maps they generate. We develop a new training strategy which ensures
the good performance of the network for high levels of noise. We compare the
performance of this approach to commonly used nonGaussian statistics, namely
the skewness and kurtosis of the convergence maps. We find that our
implementation of DCNN outperforms the skewness and kurtosis statistics,
especially for high noise levels. The network maintains the mean discrimination
efficiency greater than $85\%$ even for noise levels corresponding to ground
based lensing observations, while the other statistics perform worse in this
setting, achieving efficiency less than $70\%$. This demonstrates the ability
of CNNbased methods to efficiently break the $\sigma_8$  $\Omega_m$
degeneracy with weak lensing mass maps alone. We discuss the potential of this
method to be applied to the analysis of real weak lensing data and other
datasets.

We consider the minimization of nonconvex functions that typically arise in
machine learning. Specifically, we focus our attention on a variant of trust
region methods known as cubic regularization. This approach is particularly
attractive because it escapes strict saddle points and it provides stronger
convergence guarantees than first and secondorder as well as classical trust
region methods. However, it suffers from a high computational complexity that
makes it impractical for largescale learning. Here, we propose a novel method
that uses subsampling to lower this computational cost. By the use of
concentration inequalities we provide a sampling scheme that gives sufficiently
accurate gradient and Hessian approximations to retain the strong global and
local convergence guarantees of cubically regularized methods. To the best of
our knowledge this is the first work that gives global convergence guarantees
for a subsampled variant of cubic regularization on nonconvex functions.
Furthermore, we provide experimental results supporting our theory.

Stateoftheart approaches for image captioning require supervised training
data consisting of captions with paired image data. These methods are typically
unable to use unsupervised data such as textual data with no corresponding
images, which is a much more abundant commodity. We here propose a novel way of
using such textual data by artificially generating missing visual information.
We evaluate this learning approach on a newly designed model that detects
visual concepts present in an image and feed them to a reviewerdecoder
architecture with an attention mechanism. Unlike previous approaches that
encode visual concepts using word embeddings, we instead suggest using regional
image features which capture more intrinsic information. The main benefit of
this architecture is that it synthesizes meaningful thought vectors that
capture salient image properties and then applies a soft attentive decoder to
decode the thought vectors and generate image captions. We evaluate our model
on both Microsoft COCO and Flickr30K datasets and demonstrate that this model
combined with our semisupervised learning method can largely improve
performance and help the model to generate more accurate and diverse captions.

We consider the problem of training generative models with a Generative
Adversarial Network (GAN). Although GANs can accurately model complex
distributions, they are known to be difficult to train due to instabilities
caused by a difficult minimax optimization problem. In this paper, we view the
problem of training GANs as finding a mixed strategy in a zerosum game.
Building on ideas from online learning we propose a novel training method named
Chekhov GAN 1 . On the theory side, we show that our method provably converges
to an equilibrium for semishallow GAN architectures, i.e. architectures where
the discriminator is a one layer network and the generator is arbitrary. On the
practical side, we develop an efficient heuristic guided by our theoretical
results, which we apply to commonly used deep GAN architectures. On several
real world tasks our approach exhibits improved stability and performance
compared to standard GAN training.

Deep generative models based on Generative Adversarial Networks (GANs) have
demonstrated impressive sample quality but in order to work they require a
careful choice of architecture, parameter initialization, and selection of
hyperparameters. This fragility is in part due to a dimensional mismatch
between the model distribution and the true distribution, causing their density
ratio and the associated fdivergence to be undefined. We overcome this
fundamental limitation and propose a new regularization approach with low
computational cost that yields a stable GAN training procedure. We demonstrate
the effectiveness of this approach on several datasets including common
benchmark image generation tasks. Our approach turns GAN models into reliable
building blocks for deep learning.

This paper presents a novel approach for multilingual sentiment
classification in short texts. This is a challenging task as the amount of
training data in languages other than English is very limited. Previously
proposed multilingual approaches typically require to establish a
correspondence to English for which powerful classifiers are already available.
In contrast, our method does not require such supervision. We leverage large
amounts of weaklysupervised data in various languages to train a multilayer
convolutional network and demonstrate the importance of using pretraining of
such networks. We thoroughly evaluate our approach on various multilingual
datasets, including the recent SemEval2016 sentiment prediction benchmark
(Task 4), where we achieved stateoftheart performance. We also compare the
performance of our model trained individually for each language to a variant
trained for all languages at once. We show that the latter model reaches
slightly worse  but still acceptable  performance when compared to the single
language model, while benefiting from better generalization properties across
languages.

We propose a novel approach for mitigating radio frequency interference (RFI)
signals in radio data using the latest advances in deep learning. We employ a
special type of Convolutional Neural Network, the UNet, that enables the
classification of clean signal and RFI signatures in 2D timeordered data
acquired from a radio telescope. We train and assess the performance of this
network using the HIDE & SEEK radio data simulation and processing packages, as
well as early Science Verification data acquired with the 7m singledish
telescope at the Bleien Observatory. We find that our UNet implementation is
showing competitive accuracy to classical RFI mitigation algorithms such as
SEEK's SumThreshold implementation. We publish our UNet software package on
GitHub under GPLv3 license.

For many machine learning problems, data is abundant and it may be
prohibitive to make multiple passes through the full training set. In this
context, we investigate strategies for dynamically increasing the effective
sample size, when using iterative methods such as stochastic gradient descent.
Our interest is motivated by the rise of variancereduced methods, which
achieve linear convergence rates that scale favorably for smaller sample sizes.
Exploiting this feature, we show  theoretically and empirically  how to
obtain significant speedups with a novel algorithm that reaches statistical
accuracy on an $n$sample in $2n$, instead of $n \log n$ steps.

Newton's method is a fundamental technique in optimization with quadratic
convergence within a neighborhood around the optimum. However reaching this
neighborhood is often slow and dominates the computational costs. We exploit
two properties specific to empirical risk minimization problems to accelerate
Newton's method, namely, subsampling training data and increasing strong
convexity through regularization. We propose a novel continuation method, where
we define a family of objectives over increasing sample sizes and with
decreasing regularization strength. Solutions on this path are tracked such
that the minimizer of the previous objective is guaranteed to be within the
quadratic convergence region of the next objective to be optimized. Thereby
every Newton iteration is guaranteed to achieve superlinear contractions with
regard to the chosen objective, which becomes a moving target. We provide a
theoretical analysis that motivates our algorithm, called DynaNewton, and
characterizes its speed of convergence. Experiments on a wide range of data
sets and problems consistently confirm the predicted computational savings.

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its
slow convergence can be a computational bottleneck. Variance reduction
techniques such as SAG, SVRG and SAGA have been proposed to overcome this
weakness, achieving linear convergence. However, these methods are either based
on computations of full gradients at pivot points, or on keeping per data point
corrections in memory. Therefore speedups relative to SGD may need a minimal
number of epochs in order to materialize. This paper investigates algorithms
that can exploit neighborhood structure in the training data to share and
reuse information about past stochastic gradients across data points, which
offers advantages in the transient optimization phase. As a sideproduct we
provide a unified convergence analysis for a family of variance reduction
algorithms, which we call memorization algorithms. We provide experimental
results supporting our theory.

Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavytailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a documentlevel prior of entity cooccurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for realtime usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing stateoftheart methods.

QuasiNewton methods are widely used in practise for convex loss minimization
problems. These methods exhibit good empirical performance on a wide variety of
tasks and enjoy superlinear convergence to the optimal solution. For
largescale learning problems, stochastic QuasiNewton methods have been
recently proposed. However, these typically only achieve sublinear convergence
rates and have not been shown to consistently perform well in practice since
noisy Hessian approximations can exacerbate the effect of highvariance
stochastic gradient estimates. In this work we propose Vite, a novel stochastic
QuasiNewton algorithm that uses an existing firstorder technique to reduce
this variance. Without exploiting the specific form of the approximate Hessian,
we show that Vite reaches the optimum at a geometric rate with a constant
stepsize when dealing with smooth strongly convex functions. Empirically, we
demonstrate improvements over existing stochastic QuasiNewton and variance
reduced stochastic gradient methods.