
Most commonly used distributed machine learning systems are either
synchronous or centralized asynchronous. Synchronous algorithms like
AllReduceSGD perform poorly in a heterogeneous environment, while asynchronous
algorithms using a parameter server suffer from 1) communication bottleneck at
parameter servers when workers are many, and 2) significantly worse convergence
when the traffic to parameter server is congested. Can we design an algorithm
that is robust in a heterogeneous environment, while being communication
efficient and maintaining the bestpossible convergence rate? In this paper, we
propose an asynchronous decentralized stochastic gradient decent algorithm
(ADPSGD) satisfying all above expectations. Our theoretical analysis shows
ADPSGD converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear
speedup w.r.t. number of workers. Empirically, ADPSGD outperforms the best of
decentralized parallel SGD (DPSGD), asynchronous parallel SGD (APSGD), and
standard data parallel SGD (AllReduceSGD), often by orders of magnitude in a
heterogeneous environment. When training ResNet50 on ImageNet with up to 128
GPUs, ADPSGD converges (w.r.t epochs) similarly to the AllReduceSGD, but each
epoch can be up to 48X faster than its synchronous counterparts in a
networksharing HPC environment. To the best of our knowledge, ADPSGD is the
first asynchronous algorithm that achieves a similar epochwise convergence
rate as AllReduceSGD, at an over 100GPU scale.

Bayesian optimization is the core technique behind the emergence of AutoML,
which holds the promise of automatically searching for models and
hyperparameters to make machine learning techniques more accessible. As such
services are moving towards the cloud, we ask  {\em When multiple AutoML
users share the same computational infrastructure, how should we allocate
resources to maximize the "global happiness" of all users?}
We focus on GPEI, one of the most popular algorithms for automatic model
selection and hyperparameter tuning, and develop a novel multidevice,
multitenant extension that is aware of \emph{multiple} computation devices and
multiple users sharing the same set of computation devices. Theoretically,
given $N$ users and $M$ devices, we obtain a regret bound of $O((\text{\bf
{MIU}}(T,K) + M)\frac{N^2}{M})$, where $\text{\bf {MIU}}(T,K)$ refers to the
maximal incremental uncertainty up to time $T$ for the covariance matrix $K$.
Empirically, we evaluate our algorithm on two applications of automatic model
selection, and show that our algorithm significantly outperforms the strategy
of serving users independently. Moreover, when multiple computation devices are
available, we achieve nearlinear speedup when the number of users is much
larger than the number of devices.

While training a machine learning model using multiple workers, each of which
collects data from their own data sources, it would be most useful when the
data collected from different workers can be {\em unique} and {\em different}.
Ironically, recent analysis of decentralized parallel stochastic gradient
descent (DPSGD) relies on the assumption that the data hosted on different
workers are {\em not too different}. In this paper, we ask the question: {\em
Can we design a decentralized parallel stochastic gradient descent algorithm
that is less sensitive to the data variance across workers?} In this paper, we
present D$^2$, a novel decentralized parallel stochastic gradient descent
algorithm designed for large data variance \xr{among workers} (imprecisely,
"decentralized" data). The core of D$^2$ is a variance blackuction extension of
the standard DPSGD algorithm, which improves the convergence rate from
$O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over
T^{2/3}}\right)$ to $O\left({\sigma \over \sqrt{nT}}\right)$ where $\zeta^{2}$
denotes the variance among data on different workers. As a result, D$^2$ is
robust to data variance among workers. We empirically evaluated D$^2$ on image
classification tasks where each worker has access to only the data of a limited
set of labels, and find that D$^2$ significantly outperforms DPSGD.

Optimizing distributed learning systems is an art of balancing between
computation and communication. There have been two lines of research that try
to deal with slower networks: {\em quantization} for low bandwidth networks,
and {\em decentralization} for high latency networks. In this paper, we explore
a natural question: {\em can the combination of both decentralization and
quantization lead to a system that is robust to both bandwidth and latency?}
Although the system implication of such combination is trivial, the
underlying theoretical principle and algorithm design is challenging: simply
quantizing data sent in a decentralized training algorithm would accumulate the
error. In this paper, we develop a framework of quantized, decentralized
training and propose two different strategies, which we call {\em extrapolation
compression} and {\em difference compression}. We analyze both algorithms and
prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of
workers and $T$ is the number of iterations, matching the {\rc convergence}
rate for full precision, centralized training. We evaluate our algorithms on
training deep learning models, and find that our proposed algorithm outperforms
the best of merely decentralized and merely quantized algorithm significantly
for networks with {\em both} high latency and low bandwidth.

Reliably detecting relevant relations between entities in unstructured text
is a valuable resource for knowledge extraction, which is why it has awaken
significant interest in the field of Natural Language Processing. In this
paper, we present a system for relation classification and extraction based on
an ensemble of convolutional and recurrent neural networks that ranked first in
3 out of the 4 subtasks at SemEval 2018 Task 7. We provide detailed
explanations and grounds for the design choices behind the most relevant
features and analyze their importance.

The study of unobscured active galactic nuclei (AGN) and quasars depends on
the reliable decomposition of the light from the AGN point source and the
extended host galaxy light. The problem is typically approached using
parametric fitting routines using separate models for the host galaxy and the
point spread function (PSF). We present a new approach using a Generative
Adversarial Network (GAN) trained on galaxy images. We test the method using
Sloan Digital Sky Survey (SDSS) rband images with artificial AGN point sources
added which are then removed using the GAN and with parametric methods using
GALFIT. When the AGN point source PS is more than twice as bright as the host
galaxy, we find that our method, PSFGAN, can recover PS and host galaxy
magnitudes with smaller systematic error and a lower average scatter ($49\%$).
PSFGAN is more tolerant to poor knowledge of the PSF than parametric methods.
Our tests show that PSFGAN is robust against a broadening in the PSF width of
$\pm 50\%$ if it is trained on multiple PSF's. We demonstrate that while a
matched training set does improve performance, we can still subtract point
sources using a PSFGAN trained on nonastronomical images. While initial
training is computationally expensive, evaluating PSFGAN on data is more than
$40$ times faster than GALFIT fitting two components. Finally, PSFGAN it is
more robust and easy to use than parametric methods as it requires no input
parameters.

We present a framework to link and describe AGN variability on a wide range
of timescales, from days to billions of years. In particular, we concentrate on
the AGN variability features related to changes in black hole fuelling and
accretion rate. In our framework, the variability features observed in
different AGN at different timescales may be explained as realisations of the
same underlying statistical properties. In this context, we propose a model to
simulate the evolution of AGN light curves with time based on the probability
density function (PDF) and power spectral density (PSD) of the Eddington ratio
($L/L_{\rm Edd}$) distribution. Motivated by general galaxy population
properties, we propose that the PDF may be inspired by the $L/L_{\rm Edd}$
distribution function (ERDF), and that a single (or limited number of) ERDF+PSD
set may explain all observed variability features. After outlining the
framework and the model, we compile a set of variability measurements in terms
of structure function (SF) and magnitude difference. We then combine the
variability measurements on a SF plot ranging from days to Gyr. The proposed
framework enables constraints on the underlying PSD and the ability to link AGN
variability on different timescales, therefore providing new insights into AGN
variability and black hole growth phenomena.

Modern scientific instruments produce vast amounts of data, which can
overwhelm the processing ability of computer systems. Lossy compression of data
is an intriguing solution but comes with its own dangers, such as potential
signal loss, and the need for careful parameter optimization. In this work, we
focus on a setting where this problem is especially acute compressive sensing
frameworks for radio astronomy and ask: Can the precision of the data
representation be lowered for all input data, with recovery guarantees and good
practical performance? Our first contribution is a theoretical analysis of the
Iterative Hard Thresholding (IHT) algorithm when all input data, that is, the
measurement matrix and the observation, are quantized aggressively, to as
little as 2 bits per value. Under reasonable constraints, we show that there
exists a variant of low precision IHT which can still provide recovery
guarantees. The second contribution is a tailored analysis of our general
quantized framework to radio astronomy, showing that its conditions are
satisfied in this case. We evaluate our approach using an FPGA implementation,
and show that it can achieve up to 9.19x speed up with negligible loss of
recovery quality, on real telescope data

It is safe to assume that, for the foreseeable future, machine learning,
especially deep learning will remain both data and computationhungry. In this
paper, we ask: Can we build a global exchange where everyone can contribute
computation and data to train the next generation of machine learning
applications?
We present an early, but running prototype of DataBright, a system that turns
the creation of training examples and the sharing of computation into an
investment mechanism. Unlike most crowdsourcing platforms, where the
contributor gets paid when they submit their data, DataBright pays dividends
whenever a contributor's data or hardware is used by someone to train a machine
learning model. The contributor becomes a shareholder in the dataset they
created. To enable the measurement of usage, a computation platform that
contributors can trust is also necessary. DataBright thus merges both a data
market and a trusted computation market.
We illustrate that trusted computation can enable the creation of an AI
market, where each data point has an exact value that should be paid to its
creator. DataBright allows data creators to retain ownership of their
contribution and attaches to it a measurable value. The value of the data is
given by its utility in subsequent distributed computation done on the
DataBright computation market. The computation market allocates tasks and
subsequent payments to pooled hardware. This leads to the creation of a
decentralized AI cloud. Our experiments show that trusted hardware such as
Intel SGX can be added to the usual ML pipeline with no additional costs. We
use this setting to orchestrate distributed computation that enables the
creation of a computation market. DataBright is available for download at
https://github.com/ds3lab/databright.

We conduct an empirical study of machine learning functionalities provided by
major cloud service providers, which we call machine learning clouds. Machine
learning clouds hold the promise of hiding all the sophistication of running
largescale machine learning: Instead of specifying how to run a machine
learning task, users only specify what machine learning task to run and the
cloud figures out the rest. Raising the level of abstraction, however, rarely
comes free  a performance penalty is possible. How good, then, are current
machine learning clouds on realworld machine learning workloads?
We study this question with a focus on binary classication problems. We
present mlbench, a novel benchmark constructed by harvesting datasets from
Kaggle competitions. We then compare the performance of the top winning code
available from Kaggle with that of running machine learning clouds from both
Azure and Amazon on mlbench. Our comparative study reveals the strength and
weakness of existing machine learning clouds and points out potential future
directions for improvement.

For Markov chain Monte Carlo methods, one of the greatest discrepancies
between theory and system is the scan order  while most theoretical
development on the mixing time analysis deals with random updates, realworld
systems are implemented with systematic scans. We bridge this gap for models
that exhibit a bipartite structure, including, most notably, the
Restricted/Deep Boltzmann Machine. The de facto implementation for these models
scans variables in a layerwise fashion. We show that the Gibbs sampler with a
layerwise alternating scan order has its relaxation time (in terms of epochs)
no larger than that of a randomupdate Gibbs sampler (in terms of variable
updates). We also construct examples to show that this bound is asymptotically
tight. Through standard inequalities, our result also implies a comparison on
the mixing times.

Most distributed machine learning systems nowadays, including TensorFlow and
CNTK, are built in a centralized fashion. One bottleneck of centralized
algorithms lies on high communication cost on the central node. Motivated by
this, we ask, can decentralized algorithms be faster than its centralized
counterpart?
Although decentralized PSGD (DPSGD) algorithms have been studied by the
control community, existing analysis and theory do not show any advantage over
centralized PSGD (CPSGD) algorithms, simply assuming the application scenario
where only the decentralized network is available. In this paper, we study a
DPSGD algorithm and provide the first theoretical analysis that indicates a
regime in which decentralized algorithms might outperform centralized
algorithms for distributed stochastic gradient descent. This is because DPSGD
has comparable total computational complexities to CPSGD but requires much
less communication cost on the busiest node. We further conduct an empirical
study to validate our theoretical analysis across multiple frameworks (CNTK and
Torch), different network configurations, and computation platforms up to 112
GPUs. On network configurations with low bandwidth or high latency, DPSGD can
be up to one order of magnitude faster than its welloptimized centralized
counterparts.

We present ease.ml, a declarative machine learning service platform we built
to support more than ten research groups outside the computer science
departments at ETH Zurich for their machine learning needs. With ease.ml, a
user defines the highlevel schema of a machine learning application and
submits the task via a Web interface. The system automatically deals with the
rest, such as model selection and data movement. In this paper, we describe the
ease.ml architecture and focus on a novel technical problem introduced by
ease.ml regarding resource allocation. We ask, as a "service provider" that
manages a shared cluster of machines among all our users running machine
learning workloads, what is the resource allocation strategy that maximizes the
global satisfaction of all our users?
Resource allocation is a critical yet subtle issue in this multitenant
scenario, as we have to balance between efficiency and fairness. We first
formalize the problem that we call multitenant model selection, aiming for
minimizing the total regret of all users running automatic model selection
tasks. We then develop a novel algorithm that combines multiarmed bandits with
Bayesian optimization and prove a regret bound under the multitenant setting.
Finally, we report our evaluation of ease.ml on synthetic data and on one
service we are providing to our users, namely, image classification with deep
neural networks. Our experimental evaluation results show that our proposed
solution can be up to 9.8x faster in achieving the same global quality for all
users as the two popular heuristics used by our users before ease.ml.

Recently there has been significant interest in training machinelearning
models at low precision: by reducing precision, one can reduce computation and
communication by one order of magnitude. We examine training at reduced
precision, both from a theoretical and practical perspective, and ask: is it
possible to train models at endtoend low precision with provable guarantees?
Can this lead to consistent orderofmagnitude speedups? We present a framework
called ZipML to answer these questions. For linear models, the answer is yes.
We develop a simple framework based on one simple but novel strategy called
double sampling. Our framework is able to execute training at low precision
with no bias, guaranteeing convergence, whereas naive quantization would
introduce significant bias. We validate our framework across a range of
applications, and show that it enables an FPGA prototype that is up to 6.5x
faster than an implementation using full 32bit precision. We further develop a
varianceoptimal stochastic quantization strategy and show that it can make a
significant difference in a variety of settings. When applied to linear models
together with double sampling, we save up to another 1.7x in data movement
compared with uniform quantization. When training deep networks with quantized
models, we achieve higher accuracy than the stateoftheart XNORNet. Finally,
we extend our framework through approximation to nonlinear models, such as
SVM. We show that, although using lowprecision data induces bias, we can
appropriately bound and control the bias. We find in practice 8bit precision
is often sufficient to converge to the correct solution. Interestingly,
however, in practice we notice that our framework does not always outperform
the naive rounding approach. We discuss this negative result in detail.

Observations of astrophysical objects such as galaxies are limited by various
sources of random and systematic noise from the sky background, the optical
system of the telescope and the detector used to record the data. Conventional
deconvolution techniques are limited in their ability to recover features in
imaging data by the ShannonNyquist sampling theorem. Here we train a
generative adversarial network (GAN) on a sample of $4,550$ images of nearby
galaxies at $0.01<z<0.02$ from the Sloan Digital Sky Survey and conduct
$10\times$ cross validation to evaluate the results. We present a method using
a GAN trained on galaxy images that can recover features from artificially
degraded images with worse seeing and higher noise than the original with a
performance which far exceeds simple deconvolution. The ability to better
recover detailed features such as galaxy morphology from lowsignaltonoise
and low angular resolution imaging data significantly increases our ability to
study existing data sets of astrophysical objects as well as future
observations with observatories such as the Large Synoptic Sky Telescope (LSST)
and the Hubble and James Webb space telescopes.

Asynchronous methods are widely used in deep learning, but have limited
theoretical justification when applied to nonconvex problems. We show that
running stochastic gradient descent (SGD) in an asynchronous manner can be
viewed as adding a momentumlike term to the SGD iteration. Our result does not
assume convexity of the objective function, so it is applicable to deep
learning systems. We observe that a standard queuing model of asynchrony
results in a form of momentum that is commonly used by deep learning
practitioners. This forges a link between queuing theory and asynchrony in deep
learning systems, which could be useful for systems builders. For convolutional
neural networks, we experimentally validate that the degree of asynchrony
directly correlates with the momentum, confirming our main result. An important
implication is that tuning the momentum parameter is important when considering
different levels of asynchrony. We assert that properly tuned momentum reduces
the number of steps required for convergence. Finally, our theory suggests new
ways of counteracting the adverse effects of asynchrony: a simple mechanism
like using negative algorithmic momentum can improve performance under high
asynchrony. Since asynchronous methods have better hardware efficiency, this
result may shed light on when asynchronous execution is more efficient for deep
learning systems.

We study the factors affecting training time in multidevice deep learning
systems. Given a specification of a convolutional neural network, our goal is
to minimize the time to train this model on a cluster of commodity CPUs and
GPUs. We first focus on the singlenode setting and show that by using standard
batching and dataparallel techniques, throughput can be improved by at least
5.5x over stateoftheart systems on CPUs. This ensures an endtoend training
speed directly proportional to the throughput of a device regardless of its
underlying hardware, allowing each node in the cluster to be treated as a black
box. Our second contribution is a theoretical and empirical study of the
tradeoffs affecting endtoend training time in a multipledevice setting. We
identify the degree of asynchronous parallelization as a key factor affecting
both hardware and statistical efficiency. We see that asynchrony can be viewed
as introducing a momentum term. Our results imply that tuning momentum is
critical in asynchronous parallel configurations, and suggest that published
results that have not been fully tuned might report suboptimal performance for
some configurations. For our third contribution, we use our novel understanding
of the interaction between system and optimization dynamics to provide an
efficient hyperparameter optimizer. Our optimizer involves a predictive model
for the total time to convergence and selects an allocation of resources to
minimize that time. We demonstrate that the most popular distributed deep
learning systems fall within our tradeoff space, but do not optimize within the
space. By doing this optimization, our prototype runs 1.9x to 12x faster than
the fastest stateoftheart systems.

We present CYCLADES, a general framework for parallelizing stochastic
optimization algorithms in a shared memory setting. CYCLADES is asynchronous
during shared model updates, and requires no memory locking mechanisms, similar
to HOGWILD!type algorithms. Unlike HOGWILD!, CYCLADES introduces no conflicts
during the parallel execution, and offers a blackbox analysis for provable
speedups across a large family of algorithms. Due to its inherent conflictfree
nature and cache locality, our multicore implementation of CYCLADES
consistently outperforms HOGWILD!type algorithms on sufficiently sparse
datasets, leading to up to 40% speedup gains compared to the HOGWILD!
implementation of SGD, and up to 5x gains over asynchronous implementations of
variance reduction algorithms.

The complexity of the visual world creates significant challenges for
comprehensive visual understanding. In spite of recent successes in visual
recognition, today's vision systems would still struggle to deal with visual
queries that require a deeper reasoning. We propose a knowledge base (KB)
framework to handle an assortment of visual queries, without the need to train
new classifiers for new tasks. Building such a largescale multimodal KB
presents a major challenge of scalability. We cast a largescale MRF into a KB
representation, incorporating visual, textual and structured data, as well as
their diverse relations. We introduce a scalable knowledge base construction
system that is capable of building a KB with half billion variables and
millions of parameters in a few hours. Our system achieves competitive results
compared to purposebuilt models on standard recognition and retrieval tasks,
while exhibiting greater flexibility in answering richer visual queries.

Gibbs sampling on factor graphs is a widely used inference technique, which
often produces good empirical results. Theoretical guarantees for its
performance are weak: even for tree structured graphs, the mixing time of Gibbs
may be exponential in the number of variables. To help understand the behavior
of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy
width. We show that under suitable conditions on the weights, bounded hierarchy
width ensures polynomial mixing time. Our study of hierarchy width is in part
motivated by a class of factor graph templates, hierarchical templates, which
have bounded hierarchy widthregardless of the data used to instantiate them.
We demonstrate a rich application from natural language processing in which
Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human
volunteers.

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of
machine learning problems. Researchers and industry have developed several
techniques to optimize SGD's runtime performance, including asynchronous
execution and reduced precision. Our main result is a martingalebased analysis
that enables us to capture the rich noise models that may arise from such
techniques. Specifically, we use our new analysis in three ways: (1) we derive
convergence rates for the convex case (Hogwild!) with relaxed assumptions on
the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for
nonconvex matrix problems including matrix completion; and (3) we design and
analyze an asynchronous SGD algorithm, called Buckwild!, that uses
lowerprecision arithmetic. We show experimentally that our algorithms run
efficiently for a variety of problems on modern hardware.

Populating a database with unstructured information is a longstanding
problem in industry and research that encompasses problems of extraction,
cleaning, and integration. Recent names used for this problem include dealing
with dark data and knowledge base construction (KBC). In this work, we describe
DeepDive, a system that combines database and machine learning ideas to help
develop KBC systems, and we present techniques to make the KBC process more
efficient. We observe that the KBC process is iterative, and we develop
techniques to incrementally produce inference results for KBC systems. We
propose two methods for incremental inference, based respectively on sampling
and variational techniques. We also study the tradeoff space of these methods
and develop a simple rulebased optimizer. DeepDive includes all of these
contributions, and we evaluate DeepDive on five KBC systems, showing that it
can speed up KBC inference tasks by up to two orders of magnitude with
negligible impact on quality.

We present Caffe con Troll (CcT), a fully compatible endtoend version of
the popular framework Caffe with rebuilt internals. We built CcT to examine the
performance characteristics of training and deploying generalpurpose
convolutional neural networks across different hardware architectures. We find
that, by employing standard batching optimizations for CPU training, we achieve
a 4.5x throughput improvement over Caffe on popular networks like CaffeNet.
Moreover, with these improvements, the endtoend training time for CNNs is
directly proportional to the FLOPS delivered by the CPU, which enables us to
efficiently train hybrid CPUGPU systems for CNNs.

Knowledge base construction (KBC) is the process of populating a knowledge
base, i.e., a relational database together with inference rules, with
information extracted from documents and structured sources. KBC blurs the
distinction between two traditional database problems, information extraction
and information integration. For the last several years, our group has been
building knowledge bases with scientific collaborators. Using our approach, we
have built knowledge bases that have comparable and sometimes better quality
than those constructed by human volunteers. In contrast to these knowledge
bases, which took experts a decade or more human years to construct, many of
our projects are constructed by a single graduate student.
Our approach to KBC is based on joint probabilistic inference and learning,
but we do not see inference as either a panacea or a magic bullet: inference is
a tool that allows us to be systematic in how we construct, debug, and improve
the quality of such systems. In addition, inference allows us to construct
these systems in a more loosely coupled way than traditional approaches. To
support this idea, we have built the DeepDive system, which has the design goal
of letting the user "think about featuresnot algorithms." We think of
DeepDive as declarative in that one specifies what they want but not how to get
it. We describe our approach with a focus on feature engineering, which we
argue is an understudied problem relative to its importance to endtoend
quality.

Many aspects of macroevolutionary theory and our understanding of biotic
responses to global environmental change derive from literaturebased
compilations of palaeontological data. Existing manually assembled databases
are, however, incomplete and difficult to assess and enhance. Here, we develop
and validate the quality of a machine reading system, PaleoDeepDive, that
automatically locates and extracts data from heterogeneous text, tables, and
figures in publications. PaleoDeepDive performs comparably to humans in complex
data extraction and inference tasks and generates congruent synthetic
macroevolutionary results. Unlike traditional databases, PaleoDeepDive produces
a probabilistic database that systematically improves as information is added.
We also show that the system can readily accommodate sophisticated data types,
such as morphological data in biological illustrations and associated textual
descriptions. Our machine reading approach to scientific data integration and
synthesis brings within reach many questions that are currently underdetermined
and does so in ways that may stimulate entirely new modes of inquiry.