-
Combining deep neural networks with structured logic rules is desirable to
harness flexibility and reduce uninterpretability of the neural models. We
propose a general framework capable of enhancing various types of neural
networks (e.g., CNNs and RNNs) with declarative first-order logic rules.
Specifically, we develop an iterative distillation method that transfers the
structured information of logic rules into the weights of neural networks. We
deploy the framework on a CNN for sentiment analysis, and an RNN for named
entity recognition. With a few highly intuitive rules, we obtain substantial
improvements and achieve state-of-the-art or comparable results to previous
best-performing systems.
-
McKernel introduces a framework to use kernel approximates in the mini-batch
setting with Stochastic Gradient Descent (SGD) as an alternative to Deep
Learning. Based on Random Kitchen Sinks [Rahimi and Recht 2007], we provide a
C++ library for Large-scale Machine Learning. It contains a CPU optimized
implementation of the algorithm in [Le et al. 2013], that allows the
computation of approximated kernel expansions in log-linear time. The algorithm
requires to compute the product of matrices Walsh Hadamard. A cache friendly
Fast Walsh Hadamard that achieves compelling speed and outperforms current
state-of-the-art methods has been developed. McKernel establishes the
foundation of a new architecture of learning that allows to obtain large-scale
non-linear classification combining lightning kernel expansions and a linear
classifier. It travails in the mini-batch setting working analogously to Neural
Networks. We show the validity of our method through extensive experiments on
MNIST and FASHION MNIST [Xiao et al. 2017].
-
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.
-
Latent Dirichlet allocation (LDA) is useful in document analysis, image
processing, and many information systems; however, its generalization
performance has been left unknown because it is a singular learning machine to
which regular statistical theory can not be applied.
Stochastic matrix factorization (SMF) is a restricted matrix factorization in
which matrix factors are stochastic; the column of the matrix is in a simplex.
SMF is being applied to image recognition and text mining. We can understand
SMF as a statistical model by which a stochastic matrix of given data is
represented by a product of two stochastic matrices, whose generalization
performance has also been left unknown because of non-regularity.
In this paper, by using an algebraic and geometric method, we show the
analytic equivalence of LDA and SMF, both of which have the same real log
canonical threshold (RLCT), resulting in that they asymptotically have the same
Bayesian generalization error and the same log marginal likelihood. Moreover,
we derive the upper bound of the RLCT and prove that it is smaller than the
dimension of the parameter divided by two, hence the Bayesian generalization
errors of them are smaller than those of regular statistical models.
-
Background. A large number of algorithms is being developed to reconstruct
evolutionary models of individual tumours from genome sequencing data. Most
methods can analyze multiple samples collected either through bulk multi-region
sequencing experiments or the sequencing of individual cancer cells. However,
rarely the same method can support both data types.
Results. We introduce TRaIT, a computational framework to infer mutational
graphs that model the accumulation of multiple types of somatic alterations
driving tumour evolution. Compared to other tools, TRaIT supports multi-region
and single-cell sequencing data within the same statistical framework, and
delivers expressive models that capture many complex evolutionary phenomena.
TRaIT improves accuracy, robustness to data-specific errors and computational
complexity compared to competing methods.
Conclusions. We show that the application of TRaIT to single-cell and
multi-region cancer datasets can produce accurate and reliable models of
single-tumour evolution, quantify the extent of intra-tumour heterogeneity and
generate new testable experimental hypotheses.
-
Distributed machine learning algorithms enable learning of models from
datasets that are distributed over a network without gathering the data at a
centralized location. While efficient distributed algorithms have been
developed under the assumption of faultless networks, failures that can render
these algorithms nonfunctional occur frequently in the real world. This paper
focuses on the problem of Byzantine failures, which are the hardest to
safeguard against in distributed algorithms. While Byzantine fault tolerance
has a rich history, existing work does not translate into efficient and
practical algorithms for high-dimensional learning in fully distributed (also
known as decentralized) settings. In this paper, an algorithm termed
Byzantine-resilient distributed coordinate descent (ByRDiE) is developed and
analyzed that enables distributed learning in the presence of Byzantine
failures. Theoretical analysis (convex settings) and numerical experiments
(convex and nonconvex settings) highlight its usefulness for high-dimensional
distributed learning in the presence of Byzantine failures.
-
We discuss a multiple-play multi-armed bandit (MAB) problem in which several
arms are selected at each round. Recently, Thompson sampling (TS), a randomized
algorithm with a Bayesian spirit, has attracted much attention for its
empirically excellent performance, and it is revealed to have an optimal regret
bound in the standard single-play MAB problem. In this paper, we propose the
multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the
multiple-play MAB problem, and discuss its regret analysis. We prove that MP-TS
for binary rewards has the optimal regret upper bound that matches the regret
lower bound provided by Anantharam et al. (1987). Therefore, MP-TS is the first
computationally efficient algorithm with optimal regret. A set of computer
simulations was also conducted, which compared MP-TS with state-of-the-art
algorithms. We also propose a modification of MP-TS, which is shown to have
better empirical performance.
-
In school, a teacher plays an important role in various classroom teaching
patterns. Likewise to this human learning activity, the learning using
privileged information (LUPI) paradigm provides additional information
generated by the teacher to 'teach' learning models during the training stage.
Therefore, this novel learning paradigm is a typical Teacher-Student
Interaction mechanism. This paper is the first to present a random vector
functional link network based on the LUPI paradigm, called RVFL+. Rather than
simply combining two existing approaches, the newly-derived RVFL+ fills the gap
between classical randomized neural networks and the newfashioned LUPI
paradigm, which offers an alternative way to train RVFL networks. Moreover, the
proposed RVFL+ can perform in conjunction with the kernel trick for highly
complicated nonlinear feature learning, which is termed KRVFL+. Furthermore,
the statistical property of the proposed RVFL+ is investigated, and we present
a sharp and high-quality generalization error bound based on the Rademacher
complexity. Competitive experimental results on 14 real-world datasets
illustrate the great effectiveness and efficiency of the novel RVFL+ and
KRVFL+, which can achieve better generalization performance than
state-of-the-art methods.
-
Consider the multivariate nonparametric regression model. It is shown that
estimators based on sparsely connected deep neural networks with ReLU
activation function and properly chosen network architecture achieve the
minimax rates of convergence (up to $\log n$-factors) under a general
composition assumption on the regression function. The framework includes many
well-studied structural constraints such as (generalized) additive models.
While there is a lot of flexibility in the network architecture, the tuning
parameter is the sparsity of the network. Specifically, we consider large
networks with number of potential network parameters exceeding the sample size.
The analysis gives some insights into why multilayer feedforward neural
networks perform well in practice. Interestingly, for ReLU activation function
the depth (number of layers) of the neural network architectures plays an
important role and our theory suggests that for nonparametric regression,
scaling the network depth with the sample size is natural. It is also shown
that under the composition assumption wavelet estimators can only achieve
suboptimal rates.
-
We analyze the learning properties of the stochastic gradient method when
multiple passes over the data and mini-batches are allowed. We study how
regularization properties are controlled by the step-size, the number of passes
and the mini-batch size. In particular, we consider the square loss and show
that for a universal step-size choice, the number of passes acts as a
regularization parameter, and optimal finite sample bounds can be achieved by
early-stopping. Moreover, we show that larger step-sizes are allowed when
considering mini-batches. Our analysis is based on a unifying approach,
encompassing both batch and stochastic gradient methods as special cases. As a
byproduct, we derive optimal convergence results for batch gradient methods
(even in the non-attainable cases).
-
We study high-dimensional distribution learning in an agnostic setting where
an adversary is allowed to arbitrarily corrupt an $\varepsilon$-fraction of the
samples. Such questions have a rich history spanning statistics, machine
learning and theoretical computer science. Even in the most basic settings, the
only known approaches are either computationally inefficient or lose
dimension-dependent factors in their error guarantees. This raises the
following question:Is high-dimensional agnostic distribution learning even
possible, algorithmically?
In this work, we obtain the first computationally efficient algorithms with
dimension-independent error guarantees for agnostically learning several
fundamental classes of high-dimensional distributions: (1) a single Gaussian,
(2) a product distribution on the hypercube, (3) mixtures of two product
distributions (under a natural balancedness condition), and (4) mixtures of
spherical Gaussians. Our algorithms achieve error that is independent of the
dimension, and in many cases scales nearly-linearly with the fraction of
adversarially corrupted samples. Moreover, we develop a general recipe for
detecting and correcting corruptions in high-dimensions, that may be applicable
to many other problems.
-
In many scientific and engineering applications, we are tasked with the
maximisation of an expensive to evaluate black box function $f$. Traditional
settings for this problem assume just the availability of this single function.
However, in many cases, cheap approximations to $f$ may be obtainable. For
example, the expensive real world behaviour of a robot can be approximated by a
cheap computer simulation. We can use these approximations to eliminate low
function value regions cheaply and use the expensive evaluations of $f$ in a
small but promising region and speedily identify the optimum. We formalise this
task as a \emph{multi-fidelity} bandit problem where the target function and
its approximations are sampled from a Gaussian process. We develop MF-GP-UCB, a
novel method based on upper confidence bound techniques. In our theoretical
analysis we demonstrate that it exhibits precisely the above behaviour, and
achieves better regret than strategies which ignore multi-fidelity information.
Empirically, MF-GP-UCB outperforms such naive strategies and other
multi-fidelity methods on several synthetic and real experiments.
-
How should a firm allocate its limited interviewing resources to select the
optimal cohort of new employees from a large set of job applicants? How should
that firm allocate cheap but noisy resume screenings and expensive but in-depth
in-person interviews? We view this problem through the lens of combinatorial
pure exploration (CPE) in the multi-armed bandit setting, where a central
learning agent performs costly exploration of a set of arms before selecting a
final subset with some combinatorial structure. We generalize a recent CPE
algorithm to the setting where arm pulls can have different costs and return
different levels of information. We then prove theoretical upper bounds for a
general class of arm-pulling strategies in this new setting. We apply our
general algorithm to a real-world problem with combinatorial structure:
incorporating diversity into university admissions. We take real data from
admissions at one of the largest US-based computer science graduate programs
and show that a simulation of our algorithm produces a cohort with hiring
overall utility while spending comparable budget to the current admissions
process at that university.
-
As one of the most important paradigms of recurrent neural networks, the echo
state network (ESN) has been applied to a wide range of fields, from robotics
to medicine, finance, and language processing. A key feature of the ESN
paradigm is its reservoir --- a directed and weighted network of neurons that
projects the input time series into a high dimensional space where linear
regression or classification can be applied. Despite extensive studies, the
impact of the reservoir network on the ESN performance remains unclear.
Combining tools from physics, dynamical systems and network science, we attempt
to open the black box of ESN and offer insights to understand the behavior of
general artificial neural networks. Through spectral analysis of the reservoir
network we reveal a key factor that largely determines the ESN memory capacity
and hence affects its performance. Moreover, we find that adding short loops to
the reservoir network can tailor ESN for specific tasks and optimize learning.
We validate our findings by applying ESN to forecast both synthetic and real
benchmark time series. Our results provide a new way to design task-specific
ESN. More importantly, it demonstrates the power of combining tools from
physics, dynamical systems and network science to offer new insights in
understanding the mechanisms of general artificial neural networks.
-
A defining feature of sampling-based motion planning is the reliance on an
implicit representation of the state space, which is enabled by a set of
probing samples. Traditionally, these samples are drawn either
probabilistically or deterministically to uniformly cover the state space. Yet,
the motion of many robotic systems is often restricted to "small" regions of
the state space, due to, for example, differential constraints or
collision-avoidance constraints. To accelerate the planning process, it is thus
desirable to devise non-uniform sampling strategies that favor sampling in
those regions where an optimal solution might lie. This paper proposes a
methodology for non-uniform sampling, whereby a sampling distribution is
learned from demonstrations, and then used to bias sampling. The sampling
distribution is computed through a conditional variational autoencoder,
allowing sample generation from the latent space conditioned on the specific
planning problem. This methodology is general, can be used in combination with
any sampling-based planner, and can effectively exploit the underlying
structure of a planning problem while maintaining the theoretical guarantees of
sampling-based approaches. Specifically, on several planning problems, the
proposed methodology is shown to effectively learn representations for the
relevant regions of the state space, resulting in an order of magnitude
improvement in terms of success rate and convergence to the optimal cost.
-
Machine learning (ML) techniques are increasingly common in security
applications, such as malware and intrusion detection. However, ML models are
often susceptible to evasion attacks, in which an adversary makes changes to
the input (such as malware) in order to avoid being detected. A conventional
approach to evaluate ML robustness to such attacks, as well as to design robust
ML, is by considering simplified feature-space models of attacks, where the
attacker changes ML features directly to effect evasion, while minimizing or
constraining the magnitude of this change. We investigate the effectiveness of
this approach to designing robust ML in the face of attacks that can be
realized in actual malware (realizable attacks). We demonstrate that in the
context of structure-based PDF malware detection, such techniques appear to
have limited effectiveness, but they are effective with content-based
detectors. In either case, we show that augmenting the feature space models
with conserved features (those that cannot be unilaterally modified without
compromising malicious functionality) significantly improves performance.
Finally, we show that feature space models enable generalized robustness when
faced with a variety of realizable attacks, as compared to classifiers which
are tuned to be robust to a specific realizable attack.
-
We investigate metric learning in the context of dynamic time warping (DTW),
the by far most popular dissimilarity measure used for the comparison and
analysis of motion capture data. While metric learning enables a
problem-adapted representation of data, the majority of methods has been
proposed for vectorial data only. In this contribution, we extend the popular
principle offered by the large margin nearest neighbors learner (LMNN) to DTW
by treating the resulting component-wise dissimilarity values as features. We
demonstrate that this principle greatly enhances the classification accuracy in
several benchmarks. Further, we show that recent auxiliary concepts such as
metric regularization can be transferred from the vectorial case to
component-wise DTW in a similar way. We illustrate that metric regularization
constitutes a crucial prerequisite for the interpretation of the resulting
relevance profiles.
-
Deep learning-based techniques have achieved state-of-the-art performance on
a wide variety of recognition and classification tasks. However, these networks
are typically computationally expensive to train, requiring weeks of
computation on many GPUs; as a result, many users outsource the training
procedure to the cloud or rely on pre-trained models that are then fine-tuned
for a specific task. In this paper we show that outsourced training introduces
new security risks: an adversary can create a maliciously trained network (a
backdoored neural network, or a \emph{BadNet}) that has state-of-the-art
performance on the user's training and validation samples, but behaves badly on
specific attacker-chosen inputs. We first explore the properties of BadNets in
a toy example, by creating a backdoored handwritten digit classifier. Next, we
demonstrate backdoors in a more realistic scenario by creating a U.S. street
sign classifier that identifies stop signs as speed limits when a special
sticker is added to the stop sign; we then show in addition that the backdoor
in our US street sign detector can persist even if the network is later
retrained for another task and cause a drop in accuracy of {25}\% on average
when the backdoor trigger is present. These results demonstrate that backdoors
in neural networks are both powerful and---because the behavior of neural
networks is difficult to explicate---stealthy. This work provides motivation
for further research into techniques for verifying and inspecting neural
networks, just as we have developed tools for verifying and debugging software.
-
The key challenge in multiagent learning is learning a best response to the
behaviour of other agents, which may be non-stationary: if the other agents
adapt their strategy as well, the learning target moves. Disparate streams of
research have approached non-stationarity from several angles, which make a
variety of implicit assumptions that make it hard to keep an overview of the
state of the art and to validate the innovation and significance of new works.
This survey presents a coherent overview of work that addresses
opponent-induced non-stationarity with tools from game theory, reinforcement
learning and multi-armed bandits. Further, we reflect on the principle
approaches how algorithms model and cope with this non-stationarity, arriving
at a new framework and five categories (in increasing order of sophistication):
ignore, forget, respond to target models, learn models, and theory of mind. A
wide range of state-of-the-art algorithms is classified into a taxonomy, using
these categories and key characteristics of the environment (e.g.,
observability) and adaptation behaviour of the opponents (e.g., smooth,
abrupt). To clarify even further we present illustrative variations of one
domain, contrasting the strengths and limitations of each category. Finally, we
discuss in which environments the different approaches yield most merit, and
point to promising avenues of future research.
-
We propose a simple yet effective technique for neural network learning. The
forward propagation is computed as usual. In back propagation, only a small
subset of the full gradient is computed to update the model parameters. The
gradient vectors are sparsified in such a way that only the top-$k$ elements
(in terms of magnitude) are kept. As a result, only $k$ rows or columns
(depending on the layout) of the weight matrix are modified, leading to a
linear reduction ($k$ divided by the vector dimension) in the computational
cost. Surprisingly, experimental results demonstrate that we can update only
1-4% of the weights at each back propagation pass. This does not result in a
larger number of training iterations. More interestingly, the accuracy of the
resulting models is actually improved rather than degraded, and a detailed
analysis is given. The code is available at https://github.com/lancopku/meProp
-
Inferring the relations between two images is an important class of tasks in
computer vision. Examples of such tasks include computing optical flow and
stereo disparity. We treat the relation inference tasks as a machine learning
problem and tackle it with neural networks. A key to the problem is learning a
representation of relations. We propose a new neural network module, contrast
association unit (CAU), which explicitly models the relations between two sets
of input variables. Due to the non-negativity of the weights in CAU, we adopt a
multiplicative update algorithm for learning these weights. Experiments show
that neural networks with CAUs are more effective in learning five fundamental
image transformations than conventional neural networks.
-
Bipartite ranking is a fundamental machine learning and data mining problem.
It commonly concerns the maximization of the AUC metric. Recently, a number of
studies have proposed online bipartite ranking algorithms to learn from massive
streams of class-imbalanced data. These methods suggest both linear and
kernel-based bipartite ranking algorithms based on first and second-order
online learning. Unlike kernelized ranker, linear ranker is more scalable
learning algorithm. The existing linear online bipartite ranking algorithms
lack either handling non-separable data or constructing adaptive large margin.
These limitations yield unreliable bipartite ranking performance. In this work,
we propose a linear online confidence-weighted bipartite ranking algorithm
(CBR) that adopts soft confidence-weighted learning. The proposed algorithm
leverages the same properties of soft confidence-weighted learning in a
framework for bipartite ranking. We also develop a diagonal variation of the
proposed confidence-weighted bipartite ranking algorithm to deal with
high-dimensional data by maintaining only the diagonal elements of the
covariance matrix. We empirically evaluate the effectiveness of the proposed
algorithms on several benchmark and high-dimensional datasets. The experimental
results validate the reliability of the proposed algorithms. The results also
show that our algorithms outperform or are at least comparable to the competing
online AUC maximization methods.
-
We study stochastic optimization of nonconvex loss functions, which are
typical objectives for training neural networks. We propose stochastic
approximation algorithms which optimize a series of regularized, nonlinearized
losses on large minibatches of samples, using only first-order gradient
information. Our algorithms provably converge to an approximate critical point
of the expected objective with faster rates than minibatch stochastic gradient
descent, and facilitate better parallelization by allowing larger minibatches.
-
Due to the big size of data and limited data storage volume of a single
computer or a single server, data are often stored in a distributed manner.
Thus, performing large-scale machine learning operations with the distributed
datasets through communication networks is often required. In this paper, we
study the convergence rate of the distributed dual coordinate ascent for
distributed machine learning problems in a general tree-structured network.
Since a tree network model can be understood as the generalization of a star
network model, our algorithm can be thought of as the generalization of the
distributed dual coordinate ascent in a star network model. We provide the
convergence rate of the distributed dual coordinate ascent over a general tree
network in a recursive manner and analyze the network effect on the convergence
rate. Secondly, by considering network communication delays, we optimize the
distributed dual coordinate ascent algorithm to maximize its convergence speed.
From our analytical result, we can choose the optimal number of local
iterations depending on the communication delay severity to achieve the fastest
convergence speed. In numerical experiments, we consider machine learning
scenarios over communication networks, where local workers cannot directly
reach to a central node due to constraints in communication, and demonstrate
that the usability of our distributed dual coordinate ascent algorithm in tree
networks. Additionally, we show that adapting number of local and global
iterations to network communication delays in the distributed dual coordinated
ascent algorithm can improve its convergence speed.
-
Ensembling multiple predictions is a widely used technique for improving the
accuracy of various machine learning tasks. One obvious drawback of ensembling
is its higher execution cost during inference. In this paper, we first describe
our insights on the relationship between the probability of prediction and the
effect of ensembling with current deep neural networks; ensembling does not
help mispredictions for inputs predicted with a high probability even when
there is a non-negligible number of mispredicted inputs. This finding motivated
us to develop a way to adaptively control the ensembling. If the prediction for
an input reaches a high enough probability, i.e., the output from the softmax
function, on the basis of the confidence level, we stop ensembling for this
input to avoid wasting computation power. We evaluated the adaptive ensembling
by using various datasets and showed that it reduces the computation cost
significantly while achieving accuracy similar to that of static ensembling
using a pre-defined number of local predictions. We also show that our
statistically rigorous confidence-level-based early-exit condition reduces the
burden of task-dependent threshold tuning better compared with naive early exit
based on a pre-defined threshold in addition to yielding a better accuracy with
the same cost.