-
Human visual object recognition is typically rapid and seemingly effortless,
as well as largely independent of viewpoint and object orientation. Until very
recently, animate visual systems were the only ones capable of this remarkable
computational feat. This has changed with the rise of a class of computer
vision algorithms called deep neural networks (DNNs) that achieve human-level
classification performance on object recognition tasks. Furthermore, a growing
number of studies report similarities in the way DNNs and the human visual
system process objects, suggesting that current DNNs may be good models of
human visual object recognition. Yet there clearly exist important
architectural and processing differences between state-of-the-art DNNs and the
primate visual system. The potential behavioural consequences of these
differences are not well understood. We aim to address this issue by comparing
human and DNN generalisation abilities towards image degradations. We find the
human visual system to be more robust to image manipulations like contrast
reduction, additive noise or novel eidolon-distortions. In addition, we find
progressively diverging classification error-patterns between humans and DNNs
when the signal gets weaker, indicating that there may still be marked
differences in the way humans and current DNNs perform visual object
recognition. We envision that our findings as well as our carefully measured
and freely available behavioural datasets provide a new useful benchmark for
the computer vision community to improve the robustness of DNNs and a
motivation for neuroscientists to search for mechanisms in the brain that could
facilitate this robustness.
-
Dozens of new models on fixation prediction are published every year and
compared on open benchmarks such as MIT300 and LSUN. However, progress in the
field can be difficult to judge because models are compared using a variety of
inconsistent metrics. Here we show that no single saliency map can perform well
under all metrics. Instead, we propose a principled approach to solve the
benchmarking problem by separating the notions of saliency models, maps and
metrics. Inspired by Bayesian decision theory, we define a saliency model to be
a probabilistic model of fixation density prediction and a saliency map to be a
metric-specific prediction derived from the model density which maximizes the
expected performance on that metric given the model density. We derive these
optimal saliency maps for the most commonly used saliency metrics (AUC, sAUC,
NSS, CC, SIM, KL-Div) and show that they can be computed analytically or
approximated with high precision. We show that this leads to consistent
rankings in all metrics and avoids the penalties of using one saliency map for
all metrics. Our method allows researchers to have their model compete on many
different metrics with state-of-the-art in those metrics: "good" models will
perform well in all metrics.
-
Quantifying behavior is crucial for many applications in neuroscience.
Videography provides easy methods for the observation and recording of animal
behavior in diverse settings, yet extracting particular aspects of a behavior
for further analysis can be highly time consuming. In motor control studies,
humans or other animals are often marked with reflective markers to assist with
computer-based tracking, yet markers are intrusive (especially for smaller
animals), and the number and location of the markers must be determined a
priori. Here, we present a highly efficient method for markerless tracking
based on transfer learning with deep neural networks that achieves excellent
results with minimal training data. We demonstrate the versatility of this
framework by tracking various body parts in a broad collection of experimental
settings: mice odor trail-tracking, egg-laying behavior in drosophila, and
mouse hand articulation in a skilled forelimb task. For example, during the
skilled reaching behavior, individual joints can be automatically tracked (and
a confidence score is reported). Remarkably, even when a small number of frames
are labeled ($\approx 200$), the algorithm achieves excellent tracking
performance on test frames that is comparable to human accuracy.
-
We tackle the problem of one-shot segmentation: finding and segmenting a
previously unseen object in a cluttered scene based on a single instruction
example. We propose a baseline architecture combining a Siamese embedding for
detection with a U-net for segmentation and evaluate it on a novel dataset,
which we call $\textit{cluttered Omniglot}$. Using oracle models with access to
various amounts of ground-truth information, we show that in this kind of
visual search task, detection and segmentation are two intertwined problems,
the solution to each of which helps solving the other. We therefore introduce
$\textit{MaskNet}$, an improved model that sequentially attends to different
locations, generates segmentation proposals to mask out background clutter and
selects among the segmented objects. Our findings suggest that such image
recognition models based on an iterative refinement of object detection and
foreground segmentation may help improving both detection and segmentation in
highly cluttered scenes.
-
An important preprocessing step in most data analysis pipelines aims to
extract a small set of sources that explain most of the data. Currently used
algorithms for blind source separation (BSS), however, often fail to extract
the desired sources and need extensive cross-validation. In contrast, their
rarely used probabilistic counterparts can get away with little
cross-validation and are more accurate and reliable but no simple and scalable
implementations are available. Here we present a novel probabilistic BSS
framework (DECOMPOSE) that can be flexibly adjusted to the data, is extensible
and easy to use, adapts to individual sources and handles large-scale data
through algorithmic efficiency. DECOMPOSE encompasses and generalises many
traditional BSS algorithms such as PCA, ICA and NMF and we demonstrate
substantial improvements in accuracy and robustness on artificial and real
data.
-
Even todays most advanced machine learning models are easily fooled by almost
imperceptible perturbations of their inputs. Foolbox is a new Python package to
generate such adversarial perturbations and to quantify and compare the
robustness of machine learning models. It is build around the idea that the
most comparable robustness measure is the minimum perturbation needed to craft
an adversarial example. To this end, Foolbox provides reference implementations
of most published adversarial attack methods alongside some new ones, all of
which perform internal hyperparameter tuning to find the minimum adversarial
perturbation. Additionally, Foolbox interfaces with most popular deep learning
frameworks such as PyTorch, Keras, TensorFlow, Theano and MXNet and allows
different adversarial criteria such as targeted misclassification and top-k
misclassification as well as different distance measures. The code is licensed
under the MIT license and is openly available at
https://github.com/bethgelab/foolbox . The most up-to-date documentation can be
found at http://foolbox.readthedocs.io .
-
Many machine learning algorithms are vulnerable to almost imperceptible
perturbations of their inputs. So far it was unclear how much risk adversarial
perturbations carry for the safety of real-world machine learning applications
because most methods used to generate such perturbations rely either on
detailed model information (gradient-based attacks) or on confidence scores
such as class probabilities (score-based attacks), neither of which are
available in most real-world scenarios. In many such cases one currently needs
to retreat to transfer-based attacks which rely on cumbersome substitute
models, need access to the training data and can be defended against. Here we
emphasise the importance of attacks which solely rely on the final model
decision. Such decision-based attacks are (1) applicable to real-world
black-box models such as autonomous cars, (2) need less knowledge and are
easier to apply than transfer-based attacks and (3) are more robust to simple
defences than gradient- or score-based attacks. Previous attacks in this
category were limited to simple models or simple datasets. Here we introduce
the Boundary Attack, a decision-based attack that starts from a large
adversarial perturbation and then seeks to reduce the perturbation while
staying adversarial. The attack is conceptually simple, requires close to no
hyperparameter tuning, does not rely on substitute models and is competitive
with the best gradient-based attacks in standard computer vision tasks like
ImageNet. We apply the attack on two black-box algorithms from Clarifai.com.
The Boundary Attack in particular and the class of decision-based attacks in
general open new avenues to study the robustness of machine learning models and
raise new questions regarding the safety of deployed machine learning systems.
An implementation of the attack is available as part of Foolbox at
https://github.com/bethgelab/foolbox .
-
Large-scale recordings of neuronal activity make it possible to gain insights
into the collective activity of neural ensembles. It has been hypothesized that
neural populations might be optimized to operate at a 'thermodynamic critical
point', and that this property has implications for information processing.
Support for this notion has come from a series of studies which identified
statistical signatures of criticality in the ensemble activity of retinal
ganglion cells. What are the underlying mechanisms that give rise to these
observations? Here we show that signatures of criticality arise even in simple
feed-forward models of retinal population activity. In particular, they occur
whenever neural population data exhibits correlations, and is randomly
sub-sampled during data analysis. These results show that signatures of
criticality are not necessarily indicative of an optimized coding strategy, and
challenge the utility of analysis approaches based on equilibrium
thermodynamics for understanding partially observed biological systems.
-
Neuroscientists classify neurons into different types that perform similar
computations at different locations in the visual field. Traditional methods
for neural system identification do not capitalize on this separation of 'what'
and 'where'. Learning deep convolutional feature spaces that are shared among
many neurons provides an exciting path forward, but the architectural design
needs to account for data limitations: While new experimental techniques enable
recordings from thousands of neurons, experimental time is limited so that one
can sample only a small fraction of each neuron's response space. Here, we show
that a major bottleneck for fitting convolutional neural networks (CNNs) to
neural data is the estimation of the individual receptive field locations, a
problem that has been scratched only at the surface thus far. We propose a CNN
architecture with a sparse readout layer factorizing the spatial (where) and
feature (what) dimensions. Our network scales well to thousands of neurons and
short recordings and can be trained end-to-end. We evaluate this architecture
on ground-truth data to explore the challenges and limitations of CNN-based
system identification. Moreover, we show that our network model outperforms
current state-of-the art system identification models of mouse primary visual
cortex.
-
Neural Style Transfer has shown very exciting results enabling new forms of
image manipulation. Here we extend the existing method to introduce control
over spatial location, colour information and across spatial scale. We
demonstrate how this enhances the method by allowing high-resolution controlled
stylisation and helps to alleviate common failure cases such as applying ground
textures to sky regions. Furthermore, by decomposing style into these
perceptual factors we enable the combination of style information from multiple
sources to generate new, perceptually appealing styles from existing ones. We
also describe how these methods can be used to more efficiently produce large
size, high-quality stylisation. Finally we show how the introduced control
measures can be applied in recent methods for Fast Neural Style Transfer.
-
A recent paper suggests that Deep Neural Networks can be protected from
gradient-based adversarial perturbations by driving the network activations
into a highly saturated regime. Here we analyse such saturated networks and
show that the attacks fail due to numerical limitations in the gradient
computations. A simple stabilisation of the gradient estimates enables
successful and efficient attacks. Thus, it has yet to be shown that the
robustness observed in highly saturated networks is not simply due to numerical
limitations.
-
Here we present a parametric model for dynamic textures. The model is based
on spatiotemporal summary statistics computed from the feature representations
of a Convolutional Neural Network (CNN) trained on object recognition. We
demonstrate how the model can be used to synthesise new samples of dynamic
textures and to predict motion in simple movies.
-
Here we present DeepGaze II, a model that predicts where people look in
images. The model uses the features from the VGG-19 deep neural network trained
to identify objects in images. Contrary to other saliency models that use deep
features, here we use the VGG features for saliency prediction with no
additional fine-tuning (rather, a few readout layers are trained on top of the
VGG features to predict saliency). The model is therefore a strong test of
transfer learning. After conservative cross-validation, DeepGaze II explains
about 87% of the explainable information gain in the patterns of fixations and
achieves top performance in area under the curve metrics on the MIT300 hold-out
benchmark. These results corroborate the finding from DeepGaze I (which
explained 56% of the explainable information gain), that deep features trained
on object recognition provide a versatile feature space for performing related
visual tasks. We explore the factors that contribute to this success and
present several informative image examples. A web service is available to
compute model predictions at http://deepgaze.bethgelab.org.
-
This note presents an extension to the neural artistic style transfer
algorithm (Gatys et al.). The original algorithm transforms an image to have
the style of another given image. For example, a photograph can be transformed
to have the style of a famous painting. Here we address a potential shortcoming
of the original method: the algorithm transfers the colors of the original
painting, which can alter the appearance of the scene in undesirable ways. We
describe simple linear methods for transferring style while preserving colors.
-
Here we demonstrate that the feature space of random shallow convolutional
neural networks (CNNs) can serve as a surprisingly good model of natural
textures. Patches from the same texture are consistently classified as being
more similar then patches from different textures. Samples synthesized from the
model capture spatial correlations on scales much larger then the receptive
field size, and sometimes even rival or surpass the perceptual quality of state
of the art texture models (but show less variability). The current state of the
art in parametric texture synthesis relies on the multi-layer feature space of
deep CNNs that were trained on natural images. Our finding suggests that such
optimized multi-layer feature spaces are not imperative for texture modeling.
Instead, much simpler shallow and convolutional networks can serve as the basis
for novel texture synthesis algorithms.
-
Probabilistic generative models can be used for compression, denoising,
inpainting, texture synthesis, semi-supervised learning, unsupervised feature
learning, and other tasks. Given this wide range of applications, it is not
surprising that a lot of heterogeneity exists in the way these models are
formulated, trained, and evaluated. As a consequence, direct comparison between
models is often difficult. This article reviews mostly known but often
underappreciated properties relating to the evaluation and interpretation of
generative models with a focus on image models. In particular, we show that
three of the currently most commonly used criteria---average log-likelihood,
Parzen window estimates, and visual fidelity of samples---are largely
independent of each other when the data is high-dimensional. Good performance
with respect to one criterion therefore need not imply good performance with
respect to the other criteria. Our results show that extrapolation from one
criterion to another is not warranted and generative models need to be
evaluated directly with respect to the application(s) they were intended for.
In addition, we provide examples demonstrating that Parzen window estimates
should generally be avoided.
-
We study modeling and inference with the Elliptical Gamma Distribution (EGD).
We consider maximum likelihood (ML) estimation for EGD scatter matrices, a task
for which we develop new fixed-point algorithms. Our algorithms are efficient
and converge to global optima despite nonconvexity. Moreover, they turn out to
be much faster than both a well-known iterative algorithm of Kent & Tyler
(1991) and sophisticated manifold optimization algorithms. Subsequently, we
invoke our ML algorithms as subroutines for estimating parameters of a mixture
of EGDs. We illustrate our methods by applying them to model natural image
statistics---the proposed EGD mixture model yields the most parsimonious model
among several competing approaches.
-
Here we introduce a new model of natural textures based on the feature spaces
of convolutional neural networks optimised for object recognition. Samples from
the model are of high perceptual quality demonstrating the generative power of
neural networks trained in a purely discriminative fashion. Within the model,
textures are represented by the correlations between feature maps in several
layers of the network. We show that across layers the texture representations
increasingly capture the statistical properties of natural images while making
object information more and more explicit. The model provides a new tool to
generate stimuli for neuroscience and might offer insights into the deep
representations learned by convolutional neural networks.
-
Modeling the distribution of natural images is challenging, partly because of
strong statistical dependencies which can extend over hundreds of pixels.
Recurrent neural networks have been successful in capturing long-range
dependencies in a number of problems but only recently have found their way
into generative image models. We here introduce a recurrent image model based
on multi-dimensional long short-term memory units which are particularly suited
for image modeling due to their spatial structure. Our model scales to images
of arbitrary size and its likelihood is computationally tractable. We find that
it outperforms the state of the art in quantitative comparisons on several
image datasets and produces promising results when used for texture synthesis
and inpainting.
-
In fine art, especially painting, humans have mastered the skill to create
unique visual experiences through composing a complex interplay between the
content and style of an image. Thus far the algorithmic basis of this process
is unknown and there exists no artificial system with similar capabilities.
However, in other key areas of visual perception such as object and face
recognition near-human performance was recently demonstrated by a class of
biologically inspired vision models called Deep Neural Networks. Here we
introduce an artificial system based on a Deep Neural Network that creates
artistic images of high perceptual quality. The system uses neural
representations to separate and recombine content and style of arbitrary
images, providing a neural algorithm for the creation of artistic images.
Moreover, in light of the striking similarities between performance-optimised
artificial neural networks and biological vision, our work offers a path
forward to an algorithmic understanding of how humans create and perceive
artistic imagery.
-
Natural images can be viewed as patchworks of different textures, where the
local image statistics is roughly stationary within a small neighborhood but
otherwise varies from region to region. In order to model this variability, we
first applied the parametric texture algorithm of Portilla and Simoncelli to
image patches of 64X64 pixels in a large database of natural images such that
each image patch is then described by 655 texture parameters which specify
certain statistics, such as variances and covariances of wavelet coefficients
or coefficient magnitudes within that patch.
To model the statistics of these texture parameters, we then developed
suitable nonlinear transformations of the parameters that allowed us to fit
their joint statistics with a multivariate Gaussian distribution. We find that
the first 200 principal components contain more than 99% of the variance and
are sufficient to generate textures that are perceptually extremely close to
those generated with all 655 components. We demonstrate the usefulness of the
model in several ways: (1) We sample ensembles of texture patches that can be
directly compared to samples of patches from the natural image database and can
to a high degree reproduce their perceptual appearance. (2) We further
developed an image compression algorithm which generates surprisingly accurate
images at bit rates as low as 0.14 bits/pixel. Finally, (3) We demonstrate how
our approach can be used for an efficient and objective evaluation of samples
generated with probabilistic models of natural images.
-
Recent results suggest that state-of-the-art saliency models perform far from
optimal in predicting fixations. This lack in performance has been attributed
to an inability to model the influence of high-level image features such as
objects. Recent seminal advances in applying deep neural networks to tasks like
object recognition suggests that they are able to capture this kind of
structure. However, the enormous amount of training data necessary to train
these networks makes them difficult to apply directly to saliency prediction.
We present a novel way of reusing existing neural networks that have been
pretrained on the task of object recognition in models of fixation prediction.
Using the well-known network of Krizhevsky et al. (2012), we come up with a new
saliency model that significantly outperforms all state-of-the-art models on
the MIT Saliency Benchmark. We show that the structure of this network allows
new insights in the psychophysics of fixation selection and potentially their
neural implementation. To train our network, we build on recent work on the
modeling of saliency as point processes.
-
A fundamental challenge in calcium imaging has been to infer the timing of
action potentials from the measured noisy calcium fluorescence traces. We
systematically evaluate a range of spike inference algorithms on a large
benchmark dataset recorded from varying neural tissue (V1 and retina) using
different calcium indicators (OGB-1 and GCamp6). We show that a new algorithm
based on supervised learning in flexible probabilistic models outperforms all
previously published techniques, setting a new standard for spike inference
from calcium signals. Importantly, it performs better than other algorithms
even on datasets not seen during training. Future data acquired in new
experimental conditions can easily be used to further improve its spike
prediction accuracy and generalization performance. Finally, we show that
comparing algorithms on artificial data is not informative about performance on
real population imaging data, suggesting that a benchmark dataset may greatly
facilitate future algorithmic developments.
-
Within the set of the many complex factors driving gaze placement, the
properities of an image that are associated with fixations under free viewing
conditions have been studied extensively. There is a general impression that
the field is close to understanding this particular association. Here we frame
saliency models probabilistically as point processes, allowing the calculation
of log-likelihoods and bringing saliency evaluation into the domain of
information. We compared the information gain of state-of-the-art models to a
gold standard and find that only one third of the explainable spatial
information is captured. We additionally provide a principled method to show
where and how models fail to capture information in the fixations. Thus,
contrary to previous assertions, purely spatial saliency remains a significant
challenge.
-
GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image
analysis challenge aimed at the development of algorithms to analyze
astronomical images. Specifically, the challenge is to measure varying image
distortions in the presence of a variable convolution kernel, pixelization and
noise. This is the second in a series of challenges set to the astronomy,
computer science and statistics communities, providing a structured environment
in which methods can be improved and tested in preparation for planned
astronomical surveys. GREAT10 extends upon previous work by introducing
variable fields into the challenge. The "Galaxy Challenge" involves the precise
measurement of galaxy shape distortions, quantified locally by two parameters
called shear, in the presence of a known convolution kernel. Crucially, the
convolution kernel and the simulated gravitational lensing shape distortion
both now vary as a function of position within the images, as is the case for
real data. In addition, we introduce the "Star Challenge" that concerns the
reconstruction of a variable convolution kernel, similar to that in a typical
astronomical observation. This document details the GREAT10 Challenge for
potential participants. Continually updated information is also available from
http://www.greatchallenges.info.