
Combining deep neural networks with structured logic rules is desirable to
harness flexibility and reduce uninterpretability of the neural models. We
propose a general framework capable of enhancing various types of neural
networks (e.g., CNNs and RNNs) with declarative firstorder logic rules.
Specifically, we develop an iterative distillation method that transfers the
structured information of logic rules into the weights of neural networks. We
deploy the framework on a CNN for sentiment analysis, and an RNN for named
entity recognition. With a few highly intuitive rules, we obtain substantial
improvements and achieve stateoftheart or comparable results to previous
bestperforming systems.

McKernel introduces a framework to use kernel approximates in the minibatch
setting with Stochastic Gradient Descent (SGD) as an alternative to Deep
Learning. Based on Random Kitchen Sinks [Rahimi and Recht 2007], we provide a
C++ library for Largescale Machine Learning. It contains a CPU optimized
implementation of the algorithm in [Le et al. 2013], that allows the
computation of approximated kernel expansions in loglinear time. The algorithm
requires to compute the product of matrices Walsh Hadamard. A cache friendly
Fast Walsh Hadamard that achieves compelling speed and outperforms current
stateoftheart methods has been developed. McKernel establishes the
foundation of a new architecture of learning that allows to obtain largescale
nonlinear classification combining lightning kernel expansions and a linear
classifier. It travails in the minibatch setting working analogously to Neural
Networks. We show the validity of our method through extensive experiments on
MNIST and FASHION MNIST [Xiao et al. 2017].

This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective  What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint.  For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation  What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat.  What format is to be
used? Examples are: MIDI, piano roll or text.  How will the representation be
encoded? Examples are: scalar, onehot or manyhot.
Architecture  What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge  What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy  How do we model and control the process of generation? Examples
are: singlestep feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottomup, based on the analysis of many existing deeplearning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.

Latent Dirichlet allocation (LDA) is useful in document analysis, image
processing, and many information systems; however, its generalization
performance has been left unknown because it is a singular learning machine to
which regular statistical theory can not be applied.
Stochastic matrix factorization (SMF) is a restricted matrix factorization in
which matrix factors are stochastic; the column of the matrix is in a simplex.
SMF is being applied to image recognition and text mining. We can understand
SMF as a statistical model by which a stochastic matrix of given data is
represented by a product of two stochastic matrices, whose generalization
performance has also been left unknown because of nonregularity.
In this paper, by using an algebraic and geometric method, we show the
analytic equivalence of LDA and SMF, both of which have the same real log
canonical threshold (RLCT), resulting in that they asymptotically have the same
Bayesian generalization error and the same log marginal likelihood. Moreover,
we derive the upper bound of the RLCT and prove that it is smaller than the
dimension of the parameter divided by two, hence the Bayesian generalization
errors of them are smaller than those of regular statistical models.

Background. A large number of algorithms is being developed to reconstruct
evolutionary models of individual tumours from genome sequencing data. Most
methods can analyze multiple samples collected either through bulk multiregion
sequencing experiments or the sequencing of individual cancer cells. However,
rarely the same method can support both data types.
Results. We introduce TRaIT, a computational framework to infer mutational
graphs that model the accumulation of multiple types of somatic alterations
driving tumour evolution. Compared to other tools, TRaIT supports multiregion
and singlecell sequencing data within the same statistical framework, and
delivers expressive models that capture many complex evolutionary phenomena.
TRaIT improves accuracy, robustness to dataspecific errors and computational
complexity compared to competing methods.
Conclusions. We show that the application of TRaIT to singlecell and
multiregion cancer datasets can produce accurate and reliable models of
singletumour evolution, quantify the extent of intratumour heterogeneity and
generate new testable experimental hypotheses.

Distributed machine learning algorithms enable learning of models from
datasets that are distributed over a network without gathering the data at a
centralized location. While efficient distributed algorithms have been
developed under the assumption of faultless networks, failures that can render
these algorithms nonfunctional occur frequently in the real world. This paper
focuses on the problem of Byzantine failures, which are the hardest to
safeguard against in distributed algorithms. While Byzantine fault tolerance
has a rich history, existing work does not translate into efficient and
practical algorithms for highdimensional learning in fully distributed (also
known as decentralized) settings. In this paper, an algorithm termed
Byzantineresilient distributed coordinate descent (ByRDiE) is developed and
analyzed that enables distributed learning in the presence of Byzantine
failures. Theoretical analysis (convex settings) and numerical experiments
(convex and nonconvex settings) highlight its usefulness for highdimensional
distributed learning in the presence of Byzantine failures.

We discuss a multipleplay multiarmed bandit (MAB) problem in which several
arms are selected at each round. Recently, Thompson sampling (TS), a randomized
algorithm with a Bayesian spirit, has attracted much attention for its
empirically excellent performance, and it is revealed to have an optimal regret
bound in the standard singleplay MAB problem. In this paper, we propose the
multipleplay Thompson sampling (MPTS) algorithm, an extension of TS to the
multipleplay MAB problem, and discuss its regret analysis. We prove that MPTS
for binary rewards has the optimal regret upper bound that matches the regret
lower bound provided by Anantharam et al. (1987). Therefore, MPTS is the first
computationally efficient algorithm with optimal regret. A set of computer
simulations was also conducted, which compared MPTS with stateoftheart
algorithms. We also propose a modification of MPTS, which is shown to have
better empirical performance.

In school, a teacher plays an important role in various classroom teaching
patterns. Likewise to this human learning activity, the learning using
privileged information (LUPI) paradigm provides additional information
generated by the teacher to 'teach' learning models during the training stage.
Therefore, this novel learning paradigm is a typical TeacherStudent
Interaction mechanism. This paper is the first to present a random vector
functional link network based on the LUPI paradigm, called RVFL+. Rather than
simply combining two existing approaches, the newlyderived RVFL+ fills the gap
between classical randomized neural networks and the newfashioned LUPI
paradigm, which offers an alternative way to train RVFL networks. Moreover, the
proposed RVFL+ can perform in conjunction with the kernel trick for highly
complicated nonlinear feature learning, which is termed KRVFL+. Furthermore,
the statistical property of the proposed RVFL+ is investigated, and we present
a sharp and highquality generalization error bound based on the Rademacher
complexity. Competitive experimental results on 14 realworld datasets
illustrate the great effectiveness and efficiency of the novel RVFL+ and
KRVFL+, which can achieve better generalization performance than
stateoftheart methods.

Consider the multivariate nonparametric regression model. It is shown that
estimators based on sparsely connected deep neural networks with ReLU
activation function and properly chosen network architecture achieve the
minimax rates of convergence (up to $\log n$factors) under a general
composition assumption on the regression function. The framework includes many
wellstudied structural constraints such as (generalized) additive models.
While there is a lot of flexibility in the network architecture, the tuning
parameter is the sparsity of the network. Specifically, we consider large
networks with number of potential network parameters exceeding the sample size.
The analysis gives some insights into why multilayer feedforward neural
networks perform well in practice. Interestingly, for ReLU activation function
the depth (number of layers) of the neural network architectures plays an
important role and our theory suggests that for nonparametric regression,
scaling the network depth with the sample size is natural. It is also shown
that under the composition assumption wavelet estimators can only achieve
suboptimal rates.

We analyze the learning properties of the stochastic gradient method when
multiple passes over the data and minibatches are allowed. We study how
regularization properties are controlled by the stepsize, the number of passes
and the minibatch size. In particular, we consider the square loss and show
that for a universal stepsize choice, the number of passes acts as a
regularization parameter, and optimal finite sample bounds can be achieved by
earlystopping. Moreover, we show that larger stepsizes are allowed when
considering minibatches. Our analysis is based on a unifying approach,
encompassing both batch and stochastic gradient methods as special cases. As a
byproduct, we derive optimal convergence results for batch gradient methods
(even in the nonattainable cases).

We study highdimensional distribution learning in an agnostic setting where
an adversary is allowed to arbitrarily corrupt an $\varepsilon$fraction of the
samples. Such questions have a rich history spanning statistics, machine
learning and theoretical computer science. Even in the most basic settings, the
only known approaches are either computationally inefficient or lose
dimensiondependent factors in their error guarantees. This raises the
following question:Is highdimensional agnostic distribution learning even
possible, algorithmically?
In this work, we obtain the first computationally efficient algorithms with
dimensionindependent error guarantees for agnostically learning several
fundamental classes of highdimensional distributions: (1) a single Gaussian,
(2) a product distribution on the hypercube, (3) mixtures of two product
distributions (under a natural balancedness condition), and (4) mixtures of
spherical Gaussians. Our algorithms achieve error that is independent of the
dimension, and in many cases scales nearlylinearly with the fraction of
adversarially corrupted samples. Moreover, we develop a general recipe for
detecting and correcting corruptions in highdimensions, that may be applicable
to many other problems.

In many scientific and engineering applications, we are tasked with the
maximisation of an expensive to evaluate black box function $f$. Traditional
settings for this problem assume just the availability of this single function.
However, in many cases, cheap approximations to $f$ may be obtainable. For
example, the expensive real world behaviour of a robot can be approximated by a
cheap computer simulation. We can use these approximations to eliminate low
function value regions cheaply and use the expensive evaluations of $f$ in a
small but promising region and speedily identify the optimum. We formalise this
task as a \emph{multifidelity} bandit problem where the target function and
its approximations are sampled from a Gaussian process. We develop MFGPUCB, a
novel method based on upper confidence bound techniques. In our theoretical
analysis we demonstrate that it exhibits precisely the above behaviour, and
achieves better regret than strategies which ignore multifidelity information.
Empirically, MFGPUCB outperforms such naive strategies and other
multifidelity methods on several synthetic and real experiments.

How should a firm allocate its limited interviewing resources to select the
optimal cohort of new employees from a large set of job applicants? How should
that firm allocate cheap but noisy resume screenings and expensive but indepth
inperson interviews? We view this problem through the lens of combinatorial
pure exploration (CPE) in the multiarmed bandit setting, where a central
learning agent performs costly exploration of a set of arms before selecting a
final subset with some combinatorial structure. We generalize a recent CPE
algorithm to the setting where arm pulls can have different costs and return
different levels of information. We then prove theoretical upper bounds for a
general class of armpulling strategies in this new setting. We apply our
general algorithm to a realworld problem with combinatorial structure:
incorporating diversity into university admissions. We take real data from
admissions at one of the largest USbased computer science graduate programs
and show that a simulation of our algorithm produces a cohort with hiring
overall utility while spending comparable budget to the current admissions
process at that university.

As one of the most important paradigms of recurrent neural networks, the echo
state network (ESN) has been applied to a wide range of fields, from robotics
to medicine, finance, and language processing. A key feature of the ESN
paradigm is its reservoir  a directed and weighted network of neurons that
projects the input time series into a high dimensional space where linear
regression or classification can be applied. Despite extensive studies, the
impact of the reservoir network on the ESN performance remains unclear.
Combining tools from physics, dynamical systems and network science, we attempt
to open the black box of ESN and offer insights to understand the behavior of
general artificial neural networks. Through spectral analysis of the reservoir
network we reveal a key factor that largely determines the ESN memory capacity
and hence affects its performance. Moreover, we find that adding short loops to
the reservoir network can tailor ESN for specific tasks and optimize learning.
We validate our findings by applying ESN to forecast both synthetic and real
benchmark time series. Our results provide a new way to design taskspecific
ESN. More importantly, it demonstrates the power of combining tools from
physics, dynamical systems and network science to offer new insights in
understanding the mechanisms of general artificial neural networks.

A defining feature of samplingbased motion planning is the reliance on an
implicit representation of the state space, which is enabled by a set of
probing samples. Traditionally, these samples are drawn either
probabilistically or deterministically to uniformly cover the state space. Yet,
the motion of many robotic systems is often restricted to "small" regions of
the state space, due to, for example, differential constraints or
collisionavoidance constraints. To accelerate the planning process, it is thus
desirable to devise nonuniform sampling strategies that favor sampling in
those regions where an optimal solution might lie. This paper proposes a
methodology for nonuniform sampling, whereby a sampling distribution is
learned from demonstrations, and then used to bias sampling. The sampling
distribution is computed through a conditional variational autoencoder,
allowing sample generation from the latent space conditioned on the specific
planning problem. This methodology is general, can be used in combination with
any samplingbased planner, and can effectively exploit the underlying
structure of a planning problem while maintaining the theoretical guarantees of
samplingbased approaches. Specifically, on several planning problems, the
proposed methodology is shown to effectively learn representations for the
relevant regions of the state space, resulting in an order of magnitude
improvement in terms of success rate and convergence to the optimal cost.

Machine learning (ML) techniques are increasingly common in security
applications, such as malware and intrusion detection. However, ML models are
often susceptible to evasion attacks, in which an adversary makes changes to
the input (such as malware) in order to avoid being detected. A conventional
approach to evaluate ML robustness to such attacks, as well as to design robust
ML, is by considering simplified featurespace models of attacks, where the
attacker changes ML features directly to effect evasion, while minimizing or
constraining the magnitude of this change. We investigate the effectiveness of
this approach to designing robust ML in the face of attacks that can be
realized in actual malware (realizable attacks). We demonstrate that in the
context of structurebased PDF malware detection, such techniques appear to
have limited effectiveness, but they are effective with contentbased
detectors. In either case, we show that augmenting the feature space models
with conserved features (those that cannot be unilaterally modified without
compromising malicious functionality) significantly improves performance.
Finally, we show that feature space models enable generalized robustness when
faced with a variety of realizable attacks, as compared to classifiers which
are tuned to be robust to a specific realizable attack.

We investigate metric learning in the context of dynamic time warping (DTW),
the by far most popular dissimilarity measure used for the comparison and
analysis of motion capture data. While metric learning enables a
problemadapted representation of data, the majority of methods has been
proposed for vectorial data only. In this contribution, we extend the popular
principle offered by the large margin nearest neighbors learner (LMNN) to DTW
by treating the resulting componentwise dissimilarity values as features. We
demonstrate that this principle greatly enhances the classification accuracy in
several benchmarks. Further, we show that recent auxiliary concepts such as
metric regularization can be transferred from the vectorial case to
componentwise DTW in a similar way. We illustrate that metric regularization
constitutes a crucial prerequisite for the interpretation of the resulting
relevance profiles.

Deep learningbased techniques have achieved stateoftheart performance on
a wide variety of recognition and classification tasks. However, these networks
are typically computationally expensive to train, requiring weeks of
computation on many GPUs; as a result, many users outsource the training
procedure to the cloud or rely on pretrained models that are then finetuned
for a specific task. In this paper we show that outsourced training introduces
new security risks: an adversary can create a maliciously trained network (a
backdoored neural network, or a \emph{BadNet}) that has stateoftheart
performance on the user's training and validation samples, but behaves badly on
specific attackerchosen inputs. We first explore the properties of BadNets in
a toy example, by creating a backdoored handwritten digit classifier. Next, we
demonstrate backdoors in a more realistic scenario by creating a U.S. street
sign classifier that identifies stop signs as speed limits when a special
sticker is added to the stop sign; we then show in addition that the backdoor
in our US street sign detector can persist even if the network is later
retrained for another task and cause a drop in accuracy of {25}\% on average
when the backdoor trigger is present. These results demonstrate that backdoors
in neural networks are both powerful andbecause the behavior of neural
networks is difficult to explicatestealthy. This work provides motivation
for further research into techniques for verifying and inspecting neural
networks, just as we have developed tools for verifying and debugging software.

The key challenge in multiagent learning is learning a best response to the
behaviour of other agents, which may be nonstationary: if the other agents
adapt their strategy as well, the learning target moves. Disparate streams of
research have approached nonstationarity from several angles, which make a
variety of implicit assumptions that make it hard to keep an overview of the
state of the art and to validate the innovation and significance of new works.
This survey presents a coherent overview of work that addresses
opponentinduced nonstationarity with tools from game theory, reinforcement
learning and multiarmed bandits. Further, we reflect on the principle
approaches how algorithms model and cope with this nonstationarity, arriving
at a new framework and five categories (in increasing order of sophistication):
ignore, forget, respond to target models, learn models, and theory of mind. A
wide range of stateoftheart algorithms is classified into a taxonomy, using
these categories and key characteristics of the environment (e.g.,
observability) and adaptation behaviour of the opponents (e.g., smooth,
abrupt). To clarify even further we present illustrative variations of one
domain, contrasting the strengths and limitations of each category. Finally, we
discuss in which environments the different approaches yield most merit, and
point to promising avenues of future research.

We propose a simple yet effective technique for neural network learning. The
forward propagation is computed as usual. In back propagation, only a small
subset of the full gradient is computed to update the model parameters. The
gradient vectors are sparsified in such a way that only the top$k$ elements
(in terms of magnitude) are kept. As a result, only $k$ rows or columns
(depending on the layout) of the weight matrix are modified, leading to a
linear reduction ($k$ divided by the vector dimension) in the computational
cost. Surprisingly, experimental results demonstrate that we can update only
14% of the weights at each back propagation pass. This does not result in a
larger number of training iterations. More interestingly, the accuracy of the
resulting models is actually improved rather than degraded, and a detailed
analysis is given. The code is available at https://github.com/lancopku/meProp

Inferring the relations between two images is an important class of tasks in
computer vision. Examples of such tasks include computing optical flow and
stereo disparity. We treat the relation inference tasks as a machine learning
problem and tackle it with neural networks. A key to the problem is learning a
representation of relations. We propose a new neural network module, contrast
association unit (CAU), which explicitly models the relations between two sets
of input variables. Due to the nonnegativity of the weights in CAU, we adopt a
multiplicative update algorithm for learning these weights. Experiments show
that neural networks with CAUs are more effective in learning five fundamental
image transformations than conventional neural networks.

Bipartite ranking is a fundamental machine learning and data mining problem.
It commonly concerns the maximization of the AUC metric. Recently, a number of
studies have proposed online bipartite ranking algorithms to learn from massive
streams of classimbalanced data. These methods suggest both linear and
kernelbased bipartite ranking algorithms based on first and secondorder
online learning. Unlike kernelized ranker, linear ranker is more scalable
learning algorithm. The existing linear online bipartite ranking algorithms
lack either handling nonseparable data or constructing adaptive large margin.
These limitations yield unreliable bipartite ranking performance. In this work,
we propose a linear online confidenceweighted bipartite ranking algorithm
(CBR) that adopts soft confidenceweighted learning. The proposed algorithm
leverages the same properties of soft confidenceweighted learning in a
framework for bipartite ranking. We also develop a diagonal variation of the
proposed confidenceweighted bipartite ranking algorithm to deal with
highdimensional data by maintaining only the diagonal elements of the
covariance matrix. We empirically evaluate the effectiveness of the proposed
algorithms on several benchmark and highdimensional datasets. The experimental
results validate the reliability of the proposed algorithms. The results also
show that our algorithms outperform or are at least comparable to the competing
online AUC maximization methods.

We study stochastic optimization of nonconvex loss functions, which are
typical objectives for training neural networks. We propose stochastic
approximation algorithms which optimize a series of regularized, nonlinearized
losses on large minibatches of samples, using only firstorder gradient
information. Our algorithms provably converge to an approximate critical point
of the expected objective with faster rates than minibatch stochastic gradient
descent, and facilitate better parallelization by allowing larger minibatches.

Due to the size of data and the limited data storage space in a single local
computer, data can often be stored in a distributed manner. In order to use the
distributed big data in machine learning, performing largescale machine
learning from the distributed data through communication networks is
inevitable. In this paper, we investigate the impact of network communication
constraints on the convergence speed of distributed machine learning
optimization algorithms. Firstly, we study the convergence rate of the
distributed dual coordinate ascent in a general tree structured network, since
every connected communication network can have a spanning tree, and a tree
network can be understood as the generalization of a star network. Secondly, by
considering network communication delays, we optimize the networkconstrained
dual coordinate ascent to maximize its convergence speed in terms of operation
time. Through numerical experiments, we demonstrate that under different
network communication delays, the delaydependent number of local and global
iterations in distributed dual coordinated ascent can play a significant role
in the achievement of maximum convergence speed.

Ensembling multiple predictions is a widely used technique for improving the
accuracy of various machine learning tasks. One obvious drawback of ensembling
is its higher execution cost during inference. In this paper, we first describe
our insights on the relationship between the probability of prediction and the
effect of ensembling with current deep neural networks; ensembling does not
help mispredictions for inputs predicted with a high probability even when
there is a nonnegligible number of mispredicted inputs. This finding motivated
us to develop a way to adaptively control the ensembling. If the prediction for
an input reaches a high enough probability, i.e., the output from the softmax
function, on the basis of the confidence level, we stop ensembling for this
input to avoid wasting computation power. We evaluated the adaptive ensembling
by using various datasets and showed that it reduces the computation cost
significantly while achieving accuracy similar to that of static ensembling
using a predefined number of local predictions. We also show that our
statistically rigorous confidencelevelbased earlyexit condition reduces the
burden of taskdependent threshold tuning better compared with naive early exit
based on a predefined threshold in addition to yielding a better accuracy with
the same cost.