March 15, 2018 math.OC, cs.LG, stat.ML
We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.
• ### Fast Cosmic Web Simulations with Generative Adversarial Networks(1801.09070)

March 9, 2018 astro-ph.CO, stat.ML
Dark matter in the universe evolves through gravity to form a complex network of halos, filaments, sheets and voids, that is known as the cosmic web. Computational models of the underlying physical processes, such as classical N-body simulations, are extremely resource intensive, as they track the action of gravity in an expanding universe using billions of particles as tracers of the cosmic matter distribution. Therefore, upcoming cosmology experiments will face a computational bottleneck that may limit the exploitation of their full scientific potential. To address this challenge, we demonstrate the application of a machine learning technique called Generative Adversarial Networks (GAN) to learn models that can efficiently generate new, physically realistic realizations of the cosmic web. Our training set is a small, representative sample of 2D image snapshots from N-body simulations of size 500 and 100 Mpc. We show that the GAN-produced results are qualitatively and quantitatively very similar to the originals. Generation of a new cosmic web realization with a GAN takes a fraction of a second, compared to the many hours needed by the N-body technique. We anticipate that GANs will therefore play an important role in providing extremely fast and precise simulations of cosmic web in the era of large cosmological surveys, such as Euclid and LSST.
• ### Semantic Interpolation in Implicit Models(1710.11381)

Feb. 2, 2018 cs.LG, stat.ML
In implicit models, one often interpolates between sampled points in latent space. As we show in this paper, care needs to be taken to match-up the distributional assumptions on code vectors with the geometry of the interpolating paths. Otherwise, typical assumptions about the quality and semantics of in-between points may not be justified. Based on our analysis we propose to modify the prior code distribution to put significantly more probability mass closer to the origin. As a result, linear interpolation paths are not only shortest paths, but they are also guaranteed to pass through high-density regions, irrespective of the dimensionality of the latent space. Experiments on standard benchmark image datasets demonstrate clear visual improvements in the quality of the generated samples and exhibit more meaningful interpolation paths.
• ### Fast Point Spread Function Modeling with Deep Learning(1801.07615)

Jan. 23, 2018 stat.ML, astro-ph.IM
Modeling the Point Spread Function (PSF) of wide-field surveys is vital for many astrophysical applications and cosmological probes including weak gravitational lensing. The PSF smears the image of any recorded object and therefore needs to be taken into account when inferring properties of galaxies from astronomical images. In the case of cosmic shear, the PSF is one of the dominant sources of systematic errors and must be treated carefully to avoid biases in cosmological parameters. Recently, forward modeling approaches to calibrate shear measurements within the Monte-Carlo Control Loops ($MCCL$) framework have been developed. These methods typically require simulating a large amount of wide-field images, thus, the simulations need to be very fast yet have realistic properties in key features such as the PSF pattern. Hence, such forward modeling approaches require a very flexible PSF model, which is quick to evaluate and whose parameters can be estimated reliably from survey data. We present a PSF model that meets these requirements based on a fast deep-learning method to estimate its free parameters. We demonstrate our approach on publicly available SDSS data. We extract the most important features of the SDSS sample via principal component analysis. Next, we construct our model based on perturbations of a fixed base profile, ensuring that it captures these features. We then train a Convolutional Neural Network to estimate the free parameters of the model from noisy images of the PSF. This allows us to render a model image of each star, which we compare to the SDSS stars to evaluate the performance of our method. We find that our approach is able to accurately reproduce the SDSS PSF at the pixel level, which, due to the speed of both the model evaluation and the parameter estimation, offers good prospects for incorporating our method into the $MCCL$ framework.
• ### Flexible Prior Distributions for Deep Generative Models(1710.11383)

Jan. 7, 2018 cs.LG, stat.ML
We consider the problem of training generative models with deep neural networks as generators, i.e. to map latent codes to data points. Whereas the dominant paradigm combines simple priors over codes with complex deterministic models, we argue that it might be advantageous to use more flexible code distributions. We demonstrate how these distributions can be induced directly from the data. The benefits include: more powerful generative models, better modeling of latent structure and explicit control of the degree of generalization.
• ### Generator Reversal(1707.09241)

July 28, 2017 cs.LG, stat.ML
We consider the problem of training generative models with deep neural networks as generators, i.e. to map latent codes to data points. Whereas the dominant paradigm combines simple priors over codes with complex deterministic models, we propose instead to use more flexible code distributions. These distributions are estimated non-parametrically by reversing the generator map during training. The benefits include: more powerful generative models, better modeling of latent structure and explicit control of the degree of generalization.
• ### Learning Aerial Image Segmentation from Online Maps(1707.06879)

July 21, 2017 cs.CV
This study deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data-hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps which can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale, publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled, pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.
• ### Cosmological model discrimination with Deep Learning(1707.05167)

July 18, 2017 astro-ph.CO, stat.ML
We demonstrate the potential of Deep Learning methods for measurements of cosmological parameters from density fields, focusing on the extraction of non-Gaussian information. We consider weak lensing mass maps as our dataset. We aim for our method to be able to distinguish between five models, which were chosen to lie along the $\sigma_8$ - $\Omega_m$ degeneracy, and have nearly the same two-point statistics. We design and implement a Deep Convolutional Neural Network (DCNN) which learns the relation between five cosmological models and the mass maps they generate. We develop a new training strategy which ensures the good performance of the network for high levels of noise. We compare the performance of this approach to commonly used non-Gaussian statistics, namely the skewness and kurtosis of the convergence maps. We find that our implementation of DCNN outperforms the skewness and kurtosis statistics, especially for high noise levels. The network maintains the mean discrimination efficiency greater than $85\%$ even for noise levels corresponding to ground based lensing observations, while the other statistics perform worse in this setting, achieving efficiency less than $70\%$. This demonstrates the ability of CNN-based methods to efficiently break the $\sigma_8$ - $\Omega_m$ degeneracy with weak lensing mass maps alone. We discuss the potential of this method to be applied to the analysis of real weak lensing data and other datasets.
• ### Sub-sampled Cubic Regularization for Non-convex Optimization(1705.05933)

July 1, 2017 math.OC, cs.LG, stat.ML
We consider the minimization of non-convex functions that typically arise in machine learning. Specifically, we focus our attention on a variant of trust region methods known as cubic regularization. This approach is particularly attractive because it escapes strict saddle points and it provides stronger convergence guarantees than first- and second-order as well as classical trust region methods. However, it suffers from a high computational complexity that makes it impractical for large-scale learning. Here, we propose a novel method that uses sub-sampling to lower this computational cost. By the use of concentration inequalities we provide a sampling scheme that gives sufficiently accurate gradient and Hessian approximations to retain the strong global and local convergence guarantees of cubically regularized methods. To the best of our knowledge this is the first work that gives global convergence guarantees for a sub-sampled variant of cubic regularization on non-convex functions. Furthermore, we provide experimental results supporting our theory.
• ### A Semi-supervised Framework for Image Captioning(1611.05321)

June 24, 2017 cs.CV
State-of-the-art approaches for image captioning require supervised training data consisting of captions with paired image data. These methods are typically unable to use unsupervised data such as textual data with no corresponding images, which is a much more abundant commodity. We here propose a novel way of using such textual data by artificially generating missing visual information. We evaluate this learning approach on a newly designed model that detects visual concepts present in an image and feed them to a reviewer-decoder architecture with an attention mechanism. Unlike previous approaches that encode visual concepts using word embeddings, we instead suggest using regional image features which capture more intrinsic information. The main benefit of this architecture is that it synthesizes meaningful thought vectors that capture salient image properties and then applies a soft attentive decoder to decode the thought vectors and generate image captions. We evaluate our model on both Microsoft COCO and Flickr30K datasets and demonstrate that this model combined with our semi-supervised learning method can largely improve performance and help the model to generate more accurate and diverse captions.
• ### An Online Learning Approach to Generative Adversarial Networks(1706.03269)

June 10, 2017 cs.LG, stat.ML
We consider the problem of training generative models with a Generative Adversarial Network (GAN). Although GANs can accurately model complex distributions, they are known to be difficult to train due to instabilities caused by a difficult minimax optimization problem. In this paper, we view the problem of training GANs as finding a mixed strategy in a zero-sum game. Building on ideas from online learning we propose a novel training method named Chekhov GAN 1 . On the theory side, we show that our method provably converges to an equilibrium for semi-shallow GAN architectures, i.e. architectures where the discriminator is a one layer network and the generator is arbitrary. On the practical side, we develop an efficient heuristic guided by our theoretical results, which we apply to commonly used deep GAN architectures. On several real world tasks our approach exhibits improved stability and performance compared to standard GAN training.
• ### Stabilizing Training of Generative Adversarial Networks through Regularization(1705.09367)

May 25, 2017 cs.LG, stat.ML
Deep generative models based on Generative Adversarial Networks (GANs) have demonstrated impressive sample quality but in order to work they require a careful choice of architecture, parameter initialization, and selection of hyper-parameters. This fragility is in part due to a dimensional mismatch between the model distribution and the true distribution, causing their density ratio and the associated f-divergence to be undefined. We overcome this fundamental limitation and propose a new regularization approach with low computational cost that yields a stable GAN training procedure. We demonstrate the effectiveness of this approach on several datasets including common benchmark image generation tasks. Our approach turns GAN models into reliable building blocks for deep learning.
• ### Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification(1703.02504)

March 7, 2017 cs.CL, cs.LG, cs.IR
This paper presents a novel approach for multi-lingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multi-lingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weakly-supervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pre-training of such networks. We thoroughly evaluate our approach on various multi-lingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved state-of-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse - but still acceptable - performance when compared to the single language model, while benefiting from better generalization properties across languages.
• ### Radio frequency interference mitigation using deep convolutional neural networks(1609.09077)

Jan. 13, 2017 astro-ph.IM
We propose a novel approach for mitigating radio frequency interference (RFI) signals in radio data using the latest advances in deep learning. We employ a special type of Convolutional Neural Network, the U-Net, that enables the classification of clean signal and RFI signatures in 2D time-ordered data acquired from a radio telescope. We train and assess the performance of this network using the HIDE & SEEK radio data simulation and processing packages, as well as early Science Verification data acquired with the 7m single-dish telescope at the Bleien Observatory. We find that our U-Net implementation is showing competitive accuracy to classical RFI mitigation algorithms such as SEEK's SumThreshold implementation. We publish our U-Net software package on GitHub under GPLv3 license.
• ### Starting Small -- Learning with Adaptive Sample Sizes(1603.02839)

Oct. 7, 2016 cs.LG
For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when using iterative methods such as stochastic gradient descent. Our interest is motivated by the rise of variance-reduced methods, which achieve linear convergence rates that scale favorably for smaller sample sizes. Exploiting this feature, we show -- theoretically and empirically -- how to obtain significant speed-ups with a novel algorithm that reaches statistical accuracy on an $n$-sample in $2n$, instead of $n \log n$ steps.
• ### DynaNewton - Accelerating Newton's Method for Machine Learning(1605.06561)

May 20, 2016 cs.LG
Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regularization. We propose a novel continuation method, where we define a family of objectives over increasing sample sizes and with decreasing regularization strength. Solutions on this path are tracked such that the minimizer of the previous objective is guaranteed to be within the quadratic convergence region of the next objective to be optimized. Thereby every Newton iteration is guaranteed to achieve super-linear contractions with regard to the chosen objective, which becomes a moving target. We provide a theoretical analysis that motivates our algorithm, called DynaNewton, and characterizes its speed of convergence. Experiments on a wide range of data sets and problems consistently confirm the predicted computational savings.
• ### Variance Reduced Stochastic Gradient Descent with Neighbors(1506.03662)

Feb. 26, 2016 math.OC, cs.LG, stat.ML
Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness, achieving linear convergence. However, these methods are either based on computations of full gradients at pivot points, or on keeping per data point corrections in memory. Therefore speed-ups relative to SGD may need a minimal number of epochs in order to materialize. This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. As a side-product we provide a unified convergence analysis for a family of variance reduction algorithms, which we call memorization algorithms. We provide experimental results supporting our theory.