-
To see is to sketch -- free-hand sketching naturally builds ties between
human and machine vision. In this paper, we present a novel approach for
translating an object photo to a sketch, mimicking the human sketching process.
This is an extremely challenging task because the photo and sketch domains
differ significantly. Furthermore, human sketches exhibit various levels of
sophistication and abstraction even when depicting the same object instance in
a reference photo. This means that even if photo-sketch pairs are available,
they only provide weak supervision signal to learn a translation model.
Compared with existing supervised approaches that solve the problem of
D(E(photo)) -> sketch, where E($\cdot$) and D($\cdot$) denote encoder and
decoder respectively, we take advantage of the inverse problem (e.g.,
D(E(sketch)) -> photo), and combine with the unsupervised learning tasks of
within-domain reconstruction, all within a multi-task learning framework.
Compared with existing unsupervised approaches based on cycle consistency
(i.e., D(E(D(E(photo)))) -> photo), we introduce a shortcut consistency
enforced at the encoder bottleneck (e.g., D(E(photo)) -> photo) to exploit the
additional self-supervision. Both qualitative and quantitative results show
that the proposed model is superior to a number of state-of-the-art
alternatives. We also show that the synthetic sketches can be used to train a
better fine-grained sketch-based image retrieval (FG-SBIR) model, effectively
alleviating the problem of sketch data scarcity.
-
Contemporary deep learning techniques have made image recognition a
reasonably reliable technology. However training effective photo classifiers
typically takes numerous examples which limits image recognition's scalability
and applicability to scenarios where images may not be available. This has
motivated investigation into zero-shot learning, which addresses the issue via
knowledge transfer from other modalities such as text. In this paper we
investigate an alternative approach of synthesizing image classifiers: almost
directly from a user's imagination, via free-hand sketch. This approach doesn't
require the category to be nameable or describable via attributes as per
zero-shot learning. We achieve this via training a {model regression} network
to map from {free-hand sketch} space to the space of photo classifiers. It
turns out that this mapping can be learned in a category-agnostic way, allowing
photo classifiers for new categories to be synthesized by user with no need for
annotated training photos. {We also demonstrate that this modality of
classifier generation can also be used to enhance the granularity of an
existing photo classifier, or as a complement to name-based zero-shot learning.
-
Human free-hand sketches have been studied in various contexts including
sketch recognition, synthesis and fine-grained sketch-based image retrieval
(FG-SBIR). A fundamental challenge for sketch analysis is to deal with
drastically different human drawing styles, particularly in terms of
abstraction level. In this work, we propose the first stroke-level sketch
abstraction model based on the insight of sketch abstraction as a process of
trading off between the recognizability of a sketch and the number of strokes
used to draw it. Concretely, we train a model for abstract sketch generation
through reinforcement learning of a stroke removal policy that learns to
predict which strokes can be safely removed without affecting recognizability.
We show that our abstraction model can be used for various sketch analysis
tasks including: (1) modeling stroke saliency and understanding the decision of
sketch recognition models, (2) synthesizing sketches of variable abstraction
for a given category, or reference object instance in a photo, and (3) training
a FG-SBIR model with photos only, bypassing the expensive photo-sketch pair
collection step.
-
We propose a deep hashing framework for sketch retrieval that, for the first
time, works on a multi-million scale human sketch dataset. Leveraging on this
large dataset, we explore a few sketch-specific traits that were otherwise
under-studied in prior literature. Instead of following the conventional sketch
recognition task, we introduce the novel problem of sketch hashing retrieval
which is not only more challenging, but also offers a better testbed for
large-scale sketch analysis, since: (i) more fine-grained sketch feature
learning is required to accommodate the large variations in style and
abstraction, and (ii) a compact binary code needs to be learned at the same
time to enable efficient retrieval. Key to our network design is the embedding
of unique characteristics of human sketch, where (i) a two-branch CNN-RNN
architecture is adapted to explore the temporal ordering of strokes, and (ii) a
novel hashing loss is specifically designed to accommodate both the temporal
and abstract traits of sketches. By working with a 3.8M sketch dataset, we show
that state-of-the-art hashing models specifically engineered for static images
fail to perform well on temporal sketch data. Our network on the other hand not
only offers the best retrieval performance on various code sizes, but also
yields the best generalization performance under a zero-shot setting and when
re-purposed for sketch recognition. Such superior performances effectively
demonstrate the benefit of our sketch-specific design.
-
Many vision problems require matching images of object instances across
different domains. These include fine-grained sketch-based image retrieval
(FG-SBIR) and Person Re-identification (person ReID). Existing approaches
attempt to learn a joint embedding space where images from different domains
can be directly compared. In most cases, this space is defined by the output of
the final layer of a deep neural network (DNN), which primarily contains
features of a high semantic level. In this paper, we argue that both high and
mid-level features are relevant for cross-domain instance matching (CDIM).
Importantly, mid-level features already exist in earlier layers of the DNN.
They just need to be extracted, represented, and fused properly with the final
layer. Based on this simple but powerful idea, we propose a unified framework
for CDIM. Instantiating our framework for FG-SBIR and ReID, we show that our
simple models can easily beat the state-of-the-art models, which are often
equipped with much more elaborate architectures.
-
Domain shift refers to the well known problem that a model trained in one
source domain performs poorly when applied to a target domain with different
statistics. {Domain Generalization} (DG) techniques attempt to alleviate this
issue by producing models which by design generalize well to novel testing
domains. We propose a novel {meta-learning} method for domain generalization.
Rather than designing a specific model that is robust to domain shift as in
most previous DG work, we propose a model agnostic training procedure for DG.
Our algorithm simulates train/test domain shift during training by synthesizing
virtual testing domains within each mini-batch. The meta-optimization objective
requires that steps to improve training domain performance should also improve
testing domain performance. This meta-learning procedure trains models with
good generalization ability to novel domains. We evaluate our method and
achieve state of the art results on a recent cross-domain image classification
benchmark, as well demonstrating its potential on two classic reinforcement
learning tasks.
-
The problem of domain generalization is to learn from multiple training
domains, and extract a domain-agnostic model that can then be applied to an
unseen domain. Domain generalization (DG) has a clear motivation in contexts
where there are target domains with distinct characteristics, yet sparse data
for training. For example recognition in sketch images, which are distinctly
more abstract and rarer than photos. Nevertheless, DG methods have primarily
been evaluated on photo-only benchmarks focusing on alleviating the dataset
bias where both problems of domain distinctiveness and data sparsity can be
minimal. We argue that these benchmarks are overly straightforward, and show
that simple deep learning baselines perform surprisingly well on them. In this
paper, we make two main contributions: Firstly, we build upon the favorable
domain shift-robust properties of deep learning methods, and develop a low-rank
parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG
benchmark dataset covering photo, sketch, cartoon and painting domains. This is
both more practically relevant, and harder (bigger domain shift) than existing
benchmarks. The results show that our method outperforms existing DG
alternatives, and our dataset provides a more significant DG challenge to drive
future research.
-
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.
-
We present a generative model which can automatically summarize the stroke
composition of free-hand sketches of a given category. When our model is fit to
a collection of sketches with similar poses, it discovers and learns the
structure and appearance of a set of coherent parts, with each part represented
by a group of strokes. It represents both consistent (topology) as well as
diverse aspects (structure and appearance variations) of each sketch category.
Key to the success of our model are important insights learned from a
comprehensive study performed on human stroke data. By fitting this model to
images, we are able to synthesize visually similar and pleasant free-hand
sketches.
-
We propose a multi-scale multi-channel deep neural network framework that,
for the first time, yields sketch recognition performance surpassing that of
humans. Our superior performance is a result of explicitly embedding the unique
characteristics of sketches in our model: (i) a network architecture designed
for sketch rather than natural photo statistics, (ii) a multi-channel
generalisation that encodes sequential ordering in the sketching process, and
(iii) a multi-scale network ensemble with joint Bayesian fusion that accounts
for the different levels of abstraction exhibited in free-hand sketches. We
show that state-of-the-art deep networks specifically engineered for photos of
natural objects fail to perform well on sketch recognition, regardless whether
they are trained using photo or sketch. Our network on the other hand not only
delivers the best performance on the largest human sketch dataset to date, but
also is small in size making efficient training possible using just CPUs.
-
Heterogeneous face recognition (HFR) refers to matching face imagery across
different domains. It has received much interest from the research community as
a result of its profound implications in law enforcement. A wide variety of new
invariant features, cross-modality matching models and heterogeneous datasets
being established in recent years. This survey provides a comprehensive review
of established techniques and recent developments in HFR. Moreover, we offer a
detailed account of datasets and benchmarks commonly used for evaluation. We
finish by assessing the state of the field and discussing promising directions
for future research.