The image biomarker standardisation initiative (IBSI) is an independent
international collaboration which works towards standardising the extraction of
image biomarkers from acquired imaging for the purpose of high-throughput
quantitative image analysis (radiomics). Lack of reproducibility and validation
of high-throughput quantitative image analysis studies is considered to be a
major challenge for the field. Part of this challenge lies in the scantiness of
consensus-based guidelines and definitions for the process of translating
acquired imaging into high-throughput image biomarkers. The IBSI therefore
seeks to provide image biomarker nomenclature and definitions, benchmark data
sets, and benchmark values to verify image processing and image biomarker
calculations, as well as reporting guidelines, for high-throughput image
In contrast with traditional video, omnidirectional video enables spherical
viewing direction with support for head-mounted displays, providing an
interactive and immersive experience. Unfortunately, to the best of our
knowledge, there are few visual quality assessment (VQA) methods, either
subjective or objective, for omnidirectional video coding. This paper proposes
both subjective and objective methods for assessing quality loss in encoding
omnidirectional video. Specifically, we first present a new database, which
includes the viewing direction data from several subjects watching
omnidirectional video sequences. Then, from our database, we find a high
consistency in viewing directions across different subjects. The viewing
directions are normally distributed in the center of the front regions, but
they sometimes fall into other regions, related to video content. Given this
finding, we present a subjective VQA method for measuring difference mean
opinion score (DMOS) of the whole and regional omnidirectional video, in terms
of overall DMOS (O-DMOS) and vectorized DMOS (V-DMOS), respectively. Moreover,
we propose two objective VQA methods for encoded omnidirectional video, in
light of human perception characteristics of omnidirectional video. One method
weighs the distortion of pixels with regard to their distances to the center of
front regions, which considers human preference in a panorama. The other method
predicts viewing directions according to video content, and then the predicted
viewing directions are leveraged to allocate weights to the distortion of each
pixel in our objective VQA method. Finally, our experimental results verify
that both the subjective and objective methods proposed in this paper advance
state-of-the-art VQA for omnidirectional video.
Objective:Optoacoustic (photoacoustic) tomography is aimed at reconstructing
maps of the initial pressure rise induced by the absorption of light pulses in
tissue. In practice, due to inaccurate assumptions in the forward model, noise
and other experimental factors, the images are often afflicted by artifacts,
occasionally manifested as negative values. The aim of the work is to develop
an inversion method which reduces the occurrence of negative values and
improves the quantitative performance of optoacoustic imaging. Methods: We
present a novel method for optoacoustic tomography based on an entropy
maximization algorithm, which uses logarithmic regularization for attaining
non-negative reconstructions. The reconstruction image quality is further
improved using structural prior based fluence correction. Results: We report
the performance achieved by the entropy maximization scheme on numerical
simulation, experimental phantoms and in-vivo samples. Conclusion: The proposed
algorithm demonstrates superior reconstruction performance by delivering
non-negative pixel values with no visible distortion of anatomical structures.
Significance: Our method can enable quantitative optoacoustic imaging, and has
the potential to improve pre-clinical and translational imaging applications.
The seminal work of Gatys et al. demonstrated the power of Convolutional
Neural Networks (CNNs) in creating artistic imagery by separating and
recombining image content and style. This process of using CNNs to render a
content image in different styles is referred to as Neural Style Transfer
(NST). Since then, NST has become a trending topic both in academic literature
and industrial applications. It is receiving increasing attention and a variety
of approaches are proposed to either improve or extend the original NST
algorithm. In this paper, we aim to provide a comprehensive overview of the
current progress towards NST. We first propose a taxonomy of current algorithms
in the field of NST. Then, we present several evaluation methods and compare
different NST algorithms both qualitatively and quantitatively. The review
concludes with a discussion of various applications of NST and open problems
for future research. A list of papers discussed in this review, corresponding
codes, pre-trained models and more comparison results are publicly available at
So far, the problem of unmixing large or multitemporal hyperspectral datasets
has been specifically addressed in the remote sensing literature only by a few
dedicated strategies. Among them, some attempts have been made within a
distributed estimation framework, in particular relying on the alternating
direction method of multipliers (ADMM). In this paper, we propose to study the
interest of a partially asynchronous distributed unmixing procedure based on a
recently proposed asynchronous algorithm. Under standard assumptions, the
proposed algorithm inherits its convergence properties from recent
contributions in non-convex optimization, while allowing the problem of
interest to be efficiently addressed. Comparisons with a distributed
synchronous counterpart of the proposed unmixing procedure allow its interest
to be assessed on synthetic and real data. Besides, thanks to its genericity
and flexibility, the procedure investigated in this work can be implemented to
address various matrix factorization problems.
Convolutional sparse representations are a form of sparse representation with
a dictionary that has a structure that is equivalent to convolution with a set
of linear filters. While effective algorithms have recently been developed for
the convolutional sparse coding problem, the corresponding dictionary learning
problem is substantially more challenging. Furthermore, although a number of
different approaches have been proposed, the absence of thorough comparisons
between them makes it difficult to determine which of them represents the
current state of the art. The present work both addresses this deficiency and
proposes some new approaches that outperform existing ones in certain contexts.
A thorough set of performance comparisons indicates a very wide range of
performance differences among the existing and proposed methods, and clearly
identifies those that are the most effective.
We present and discuss different algorithms for converting rectangular
imagery into elliptical regions. We mainly focus on methods that use
mathematical mappings with explicit and invertible equations. The key idea is
to start with invertible mappings between the square and the circular disc then
extend it to handle rectangles and ellipses. This extension can be done by
simply removing the eccentricity and reintroducing it back after using a chosen
square-to-disc mapping.
Convolutional sparse representations are a form of sparse representation with
a structured, translation invariant dictionary. Most convolutional dictionary
learning algorithms to date operate in batch mode, requiring simultaneous
access to all training images during the learning process, which results in
very high memory usage and severely limits the training data that can be used.
Very recently, however, a number of authors have considered the design of
online convolutional dictionary learning algorithms that offer far better
scaling of memory and computational cost with training set size than batch
methods. This paper extends our prior work, improving a number of aspects of
our previous algorithm; proposing an entirely new one, with better performance,
and that supports the inclusion of a spatial mask for learning from incomplete
data; and providing a rigorous theoretical analysis of these methods.
Supervised image segmentation assigns image voxels to a set of labels, as
defined by a specific labeling protocol. In this paper, we decompose
segmentation into two steps. The first step is what we call "primitive
segmentation", where voxels that form sub-parts (primitives) of the various
segmentation labels available in the training data, are grouped together. The
second step involves computing a protocol-specific label map based on the
primitive segmentation. Our core contribution is a novel loss function for the
first step, where a primitive segmentation model is trained. The proposed loss
function is the entropy of the (protocol-specific) "ground truth" label map
conditioned on the primitive segmentation. The conditional entropy loss enables
combining training datasets that have been manually labeled with different
protocols. Furthermore, as we show empirically, it facilitates an efficient
strategy for transfer learning via a lightweight protocol adaptation model that
can be trained with little manually labeled data. We apply the proposed
approach to the volumetric segmentation of brain MRI scans, where we achieve
promising results.
There is debate if phoneme or viseme units are the most effective for a
lipreading system. Some studies use phoneme units even though phonemes describe
unique short sounds; other studies tried to improve lipreading accuracy by
focusing on visemes with varying results. We compare the performance of a
lipreading system by modeling visual speech using either 13 viseme or 38
phoneme units. We report the accuracy of our system at both word and unit
levels. The evaluation task is large vocabulary continuous speech using the
TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs
and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We
use DCT and Eigenlips as a representation of mouth ROI image. The phoneme
lipreading system word accuracy outperforms the viseme based system word
accuracy. However, the phoneme system achieved lower accuracy at the unit level
which shows the importance of the dictionary for decoding classification
outputs into words.
Visemes are the visual equivalent of phonemes. Although not precisely
defined, a working definition of a viseme is "a set of phonemes which have
identical appearance on the lips". Therefore a phoneme falls into one viseme
class but a viseme may represent many phonemes: a many to one mapping. This
mapping introduces ambiguity between phonemes when using viseme classifiers.
Not only is this ambiguity damaging to the performance of audio-visual
classifiers operating on real expressive speech, there is also considerable
choice between possible mappings. In this paper we explore the issue of this
choice of viseme-to-phoneme map. We show that there is definite difference in
performance between viseme-to-phoneme mappings and explore why some maps appear
to work better than others. We also devise a new algorithm for constructing
phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear'
visemes, are shown to perform better than previously known units.
Visual lip gestures observed whilst lipreading have a few working
definitions, the most common two are; `the visual equivalent of a phoneme' and
`phonemes which are indistinguishable on the lips'. To date there is no formal
definition, in part because to date we have not established a two-way
relationship or mapping between visemes and phonemes. Some evidence suggests
that visual speech is highly dependent upon the speaker. So here, we use a
phoneme-clustering method to form new phoneme-to-viseme maps for both
individual and multiple speakers. We test these phoneme to viseme maps to
examine how similarly speakers talk visually and we use signed rank tests to
measure the distance between individuals. We conclude that broadly speaking,
speakers have the same repertoire of mouth gestures, where they differ is in
the use of the gestures.
Digital cameras and mobile phones enable us to conveniently record precious
moments. While digital image quality is constantly being improved, taking
high-quality photos of digital screens still remains challenging because the
photos are often contaminated with moir\'{e} patterns, a result of the
interference between the pixel grids of the camera sensor and the device
screen. Moir\'{e} patterns can severely damage the visual quality of photos.
However, few studies have aimed to solve this problem. In this paper, we
introduce a novel multiresolution fully convolutional network for automatically
removing moir\'{e} patterns from photos. Since a moir\'{e} pattern spans over a
wide range of frequencies, our proposed network performs a nonlinear
multiresolution analysis of the input image before computing how to cancel
moir\'{e} artefacts within every frequency band. We also create a large-scale
benchmark dataset with $100,000^+$ image pairs for investigating and evaluating
moir\'{e} pattern removal algorithms. Our network achieves state-of-the-art
performance on this dataset in comparison to existing learning architectures
for image restoration problems.
We propose a 3D convolutional neural network to simultaneously segment and
detect cell nuclei in confocal microscopy images. Mirroring the co-dependency
of these tasks, our proposed model consists of two serial components: the first
part computes a segmentation of cell bodies, while the second module identifies
the centers of these cells. Our model is trained end-to-end from scratch on a
mouse parotid salivary gland stem cell nuclei dataset comprising 107 image
stacks from three independent cell preparations, each containing several
hundred individual cell nuclei in 3D. In our experiments, we conduct a thorough
evaluation of both detection accuracy and segmentation quality, on two
different datasets. The results show that the proposed method provides
significantly improved detection and segmentation accuracy compared to
state-of-the-art and benchmark algorithms. Finally, we use a previously
described test-time drop-out strategy to obtain uncertainty estimates on our
predictions and validate these estimates by demonstrating that they are
strongly correlated with accuracy.
This paper presents our approach to the One-Minute Gradual-Emotion
Recognition (OMG-Emotion) Challenge, focusing on dimensional emotion
recognition through visual analysis of the provided emotion videos. The
approach is based on a Convolutional and Recurrent (CNN-RNN) deep neural
architecture we have developed for the relevant large AffWild Emotion Database.
We extended and adapted this architecture, by letting a combination of multiple
features generated in the CNN component be explored by RNN subnets. Our target
has been to obtain best performance on the OMG-Emotion visual validation data
set, while learning the respective visual training data set. Extended
experimentation has led to best architectures for the estimation of the values
of the valence and arousal emotion dimensions over these data sets.
Automatic understanding of human affect using visual signals is of great
importance in everyday human-machine interactions. Appraising human emotional
states, behaviors and reactions displayed in real-world settings, can be
accomplished using latent continuous dimensions (e.g., the circumplex model of
affect). Valence (i.e., how positive or negative is an emotion) and arousal
(i.e., power of the activation of the emotion) constitute the most popular and
effective affect representations. Nevertheless, the majority of collected
datasets this far, although containing naturalistic emotional states, have been
captured in highly controlled recording conditions. In this paper, we introduce
the Aff-Wild benchmark for training and evaluating affect recognition
algorithms. We also report on the results of the First Affect-in-the-wild
Challenge (Aff-Wild Challenge) that was recently organized on the Aff-Wild
database, and was the first ever challenge on the estimation of valence and
arousal in-the-wild. Furthermore, we design and extensively train an end-to-end
deep neural architecture which performs prediction of continuous emotion
dimensions based on visual cues. The proposed deep learning architecture,
AffWildNet, includes convolutional and recurrent neural network (CNN-RNN)
layers, exploiting the invariant properties of convolutional features, while
also modeling temporal dynamics that arise in human behavior via the recurrent
layers. The AffWildNet produced state-of-the-art results on the Aff-Wild
Challenge. We then exploit the AffWild database for learning features, which
can be used as priors for achieving best performances both for dimensional, as
well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW
2017 datasets, compared to all other methods designed for the same goal.
Quantitative visualization of shock-induced complex flow field emanating from
the open end of a miniaturized hand-driven shock tube (Reddy tube) is
presented. During operation, the planar shock wave of Mach number Mi=1.3 is
discharged through the low-pressure driven-section, kept open to ambient
atmosphere. From the moment of shock discharge, its aftereffects of evolving
flow field are recorded quantitatively for 300us near the exit of the tube by
using our newly developed high resolution (16Mpixel) in-house developed
wavefront measuring camera setup.
Extracting features from a huge amount of data for object recognition is a
challenging task. Convolution neural network can be used to meet the challenge,
but it often requires a large number of computation resources. In this paper, a
computation-efficient convolutional module, named SdcBlock, is proposed and
based on it, the convolution network SdcNet is introduced for object
recognition tasks. In the proposed module, optimized successive depthwise
convolutions supported by appropriate data management is applied in order to
generate vectors containing high density and more varieties of feature
information. The hyperparameters can be easily adjusted to suit varieties of
tasks under different computation restrictions without significantly
jeopardizing the performance. The experiments have shown that SdcNet achieved
an error rate of 5.60% in CIFAR-10 with only 55M Flops and also reduced further
the error rate to 5.24% using a moderate volume of 103M Flops. The expected
computation efficiency of the SdcNet has been confirmed.
$L_1$ regularization is used for finding sparse solutions to an
underdetermined linear system. As sparse signals are widely expected in remote
sensing, this type of regularization scheme and its extensions have been widely
employed in many remote sensing problems, such as image fusion, target
detection, image super-resolution, and others and have led to promising
results. However, solving such sparse reconstruction problems is
computationally expensive and has limitations in its practical use. In this
paper, we proposed a novel efficient algorithm for solving the complex-valued
$L_1$ regularized least squares problem. Taking the high-dimensional
tomographic synthetic aperture radar (TomoSAR) as a practical example, we
carried out extensive experiments, both with simulation data and real data, to
demonstrate that the proposed approach can retain the accuracy of second order
methods while dramatically speeding up the processing by one or two orders.
Although we have chosen TomoSAR as the example, the proposed method can be
generally applied to any spectral estimation problems.
The integration of information across multiple modalities and across time is
a promising way to enhance the emotion recognition performance of affective
systems. Much previous work has focused on instantaneous emotion recognition.
The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which
was held in conjunction with the IEEE World Congress on Computational
Intelligence, encouraged participants to address long-term emotion recognition
by integrating cues from multiple modalities, including facial expression,
audio and language. Intuitively, a multi-modal inference network should be able
to leverage information from each modality and their correlations to improve
recognition over that achievable by a single modality network. We describe here
a multi-modal neural architecture that integrates visual information over time
using an LSTM, and combines it with utterance level audio and text cues to
recognize human sentiment from multimodal clips. Our model outperforms the
unimodal baseline, achieving the concordance correlation coefficients (CCC) of
0.400 on the arousal task, and 0.353 on the valence task.
In the area of magnetic resonance imaging (MRI), an extensive range of
non-linear reconstruction algorithms have been proposed that can be used with
general Fourier subsampling patterns. However, the design of these subsampling
patterns has typically been considered in isolation from the reconstruction
rule and the anatomy under consideration. In this paper, we propose a
learning-based framework for optimizing MRI subsampling patterns for a specific
reconstruction rule and anatomy, considering both the noiseless and noisy
settings. Our learning algorithm has access to a representative set of training
signals, and searches for a sampling pattern that performs well on average for
the signals in this set. We present a novel parameter-free greedy mask
selection method, and show it to be effective for a variety of reconstruction
rules and performance metrics. Moreover we also support our numerical findings
by providing a rigorous justification of our framework via statistical learning
Feature extraction from infrared (IR) images remains a challenging task.
Learning based methods that can work on raw imagery/patches have therefore
assumed significance. We propose a novel multi-task extension of the widely
used sparse-representation-classification (SRC) method in both single and
multi-view set-ups. That is, the test sample could be a single IR image or
images from different views. When expanded in terms of a training dictionary,
the coefficient matrix in a multi-view scenario admits a sparse structure that
is not easily captured by traditional sparsity-inducing measures such as the
$l_0$-row pseudo norm. To that end, we employ collaborative spike and slab
priors on the coefficient matrix, which can capture fairly general sparse
structures. Our work involves joint parameter and sparse coefficient estimation
(JPCEM) which alleviates the need to handpick prior parameters before
classification. The experimental merits of JPCEM are substantiated through
comparisons with other state-of-art methods on a challenging mid-wave IR image
(MWIR) ATR database made available by the US Army Night Vision and Electronic
Sensors Directorate.
One of the most interesting challenges in Artificial Intelligence is to train
conditional generators which are able to provide labeled fake samples drawn
from a specific distribution. In this work, a new framework is presented to
train a deep conditional generator by placing a classifier in parallel with the
discriminator and back propagate the classification error through the generator
network. The method is versatile and is applicable to any variations of
Generative Adversarial Network (GAN) implementation, and also is giving
superior results compare to similar methods.
We describe an end-to-end trainable model for image compression based on
variational autoencoders. The model incorporates a hyperprior to effectively
capture spatial dependencies in the latent representation. This hyperprior
relates to side information, a concept universal to virtually all modern image
codecs, but largely unexplored in image compression using artificial neural
networks (ANNs). Unlike existing autoencoder compression methods, our model
trains a complex prior jointly with the underlying autoencoder. We demonstrate
that this model leads to state-of-the-art image compression when measuring
visual quality using the popular MS-SSIM index, and yields rate-distortion
performance surpassing published ANN-based methods when evaluated using a more
traditional metric based on squared error (PSNR). Furthermore, we provide a
qualitative comparison of models trained for different distortion metrics.
Spectral variability is one of the major issue when conducting hyperspectral
unmixing. Within a given image composed of some elementary materials (herein
referred to as endmember classes), the spectral signature characterizing these
classes may spatially vary due to intrinsic component fluctuations or external
factors (illumination). These redundant multiple endmember spectra within each
class adversely affect the performance of unmixing methods. This paper proposes
a mixing model that explicitly incorporates a hierarchical structure of
redundant multiple spectra representing each class. The proposed method is
designed to promote sparsity on the selection of both spectra and classes
within each pixel. The resulting unmixing algorithm is able to adaptively
recover several bundles of endmember spectra associated with each class and
robustly estimate abundances. In addition, its flexibility allows a variable
number of classes to be present within each pixel of the hyperspectral image to
be unmixed. The proposed method is compared with other state-of-the-art
unmixing methods that incorporate sparsity using both simulated and real
hyperspectral data. The results show that the proposed method can successfully
determine the variable number of classes present within each class and estimate
the corresponding class abundances.