
The recent tremendous success of unsupervised word embeddings in a multitude
of applications raises the obvious question if similar methods could be derived
to improve embeddings (i.e. semantic representations) of word sequences as
well. We present a simple but efficient unsupervised objective to train
distributed representations of sentences. Our method outperforms the
stateoftheart unsupervised models on most benchmark tasks, highlighting the
robustness of the produced generalpurpose sentence embeddings.

The scale of modern datasets necessitates the development of efficient
distributed optimization methods for machine learning. We present a
generalpurpose framework for distributed computing environments, CoCoA, that
has an efficient communication scheme and is applicable to a wide variety of
problems in machine learning and signal processing. We extend the framework to
cover general nonstronglyconvex regularizers, including L1regularized
problems like lasso, sparse logistic regression, and elastic net
regularization, and show how earlier work can be derived as a special case. We
provide convergence guarantees for the class of convex regularized loss
minimization objectives, leveraging a novel approach in handling
nonstronglyconvex regularizers and nonsmooth loss functions. The resulting
framework has markedly improved performance over stateoftheart methods, as
we illustrate with an extensive set of experiments on real distributed
datasets.

We consider learning of fundamental properties of communities in large noisy
networks, in the prototypical situation where the nodes or users are split into
two classes according to a binary property, e.g., according to their opinions
or preferences on a topic. For learning these properties, we propose a
nonparametric, unsupervised, and scalable graph scan procedure that is, in
addition, robust against a class of powerful adversaries. In our setup, one of
the communities can fall under the influence of a knowledgeable adversarial
leader, who knows the full network structure, has unlimited computational
resources and can completely foresee our planned actions on the network. We
prove strong consistency of our results in this setup with minimal assumptions.
In particular, the learning procedure estimates the baseline activity of normal
users asymptotically correctly with probability 1; the only assumption being
the existence of a single implicit community of asymptotically negligible
logarithmic size. We provide experiments on real and synthetic data to
illustrate the performance of our method, including examples with adversaries.

DNNs are ubiquitous datacenter workloads, requiring orders of magnitude more
computing power from servers than traditional workloads. As such, datacenter
operators are forced to adopt domainspecific accelerators that employ
halfprecision floatingpoint (FP) numeric representations to improve
arithmetic density. Unfortunately, even these representations are not dense
enough, and are, therefore, suboptimal for DNNs. We propose a hybrid approach
that employs dense block floatingpoint (BFP) arithmetic on dot product
computations and FP arithmetic elsewhere. While using BFP improves the
performance of dot product operations, that compose most of DNN computations,
allowing values to freely float between dot product operations leads to a
better choice of tensor exponents when converting values to back BFP. We show
that models trained with hybrid BFPFP arithmetic either match or outperform
their FP32 counterparts, leading to more compact models and denser arithmetic
in computing platforms.

Two popular examples of firstorder optimization methods over linear spaces
are coordinate descent and matching pursuit algorithms, with their randomized
variants. While the former targets the optimization by moving along
coordinates, the latter considers a generalized notion of directions.
Exploiting the connection between the two algorithms, we present a unified
analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates
on smooth objectives and linear convergence on strongly convex objectives. As a
byproduct of our affine invariant analysis of matching pursuit, our rates for
steepest coordinate descent are the tightest known. Furthermore, we show the
first accelerated convergence rate $\mathcal{O}(1/t^2)$ for matching pursuit on
convex objectives.

Keyphrase extraction is the task of automatically selecting a small set of
phrases that best describe a given free text document. Keyphrases can be used
for indexing, searching, aggregating and summarizing text documents, serving
many automatic as well as humanfacing use cases. Existing supervised systems
for keyphrase extraction require large amounts of labeled training data and
generalize very poorly outside the domain of the training data. At the same
time, unsupervised systems found in the literature have poor accuracy, and
often do not generalize well, as they require the input document to belong to a
larger corpus also given as input. Furthermore, both supervised and
unsupervised methods are often too slow for realtime scenarios and suffer from
overgeneration. Addressing these drawbacks, in this paper, we introduce an
unsupervised method for keyphrase extraction from single documents that
leverages sentence embeddings. By selecting phrases whose semantic embeddings
are close to the embeddings of the whole document, we are able to separate the
best candidate phrases from the rest. We show that our embeddingbased method
is not only simpler, but also more effective than graphbased state of the art
systems, achieving higher Fscores on standard datasets. Simplicity is a
significant advantage, especially when processing large amounts of documents
from the Web, resulting in considerable speed gains. Moreover, we describe how
to increase coverage and diversity among the selected keyphrases by introducing
an embeddingbased maximal marginal relevance (MMR) for new phrases. A user
study including over 200 votes showed that, although reducing the phrase
semantic overlap leads to no gains in terms of Fscore, our diversity enriched
selection is preferred by humans.

We propose a generic algorithmic building block to accelerate training of
machine learning models on heterogenous compute systems. The scheme allows to
efficiently employ compute accelerators such as GPUs and FPGAs for the training
of largescale machine learning models, when the training data exceeds their
memory capacity. Also, it provides adaptivity to any system's memory hierarchy
in terms of size and processing speed. Our technique builds upon primaldual
coordinate methods, and uses duality gap information to dynamically decide
which part of the data should be made available for fast processing. We provide
a strong theoretical motivation for our gapbased selection scheme and provide
an efficient practical implementation thereof. To illustrate the power of our
approach we demonstrate its performance for training of generalized linear
models on large scale datasets exceeding the memory size of a modern GPU,
showing an orderofmagnitude speedup over existing approaches.

Greedy optimization methods such as Matching Pursuit (MP) and FrankWolfe
(FW) algorithms regained popularity in recent years due to their simplicity,
effectiveness and theoretical guarantees. MP and FW address optimization over
the linear span and the convex hull of a set of atoms, respectively. In this
paper, we consider the intermediate case of optimization over the convex cone,
parametrized as the conic hull of a generic atom set, leading to the first
principled definitions of nonnegative MP algorithms for which we give explicit
convergence rates and demonstrate excellent empirical performance. In
particular, we derive sublinear ($\mathcal{O}(1/t)$) convergence on general
smooth and convex objectives, and linear convergence ($\mathcal{O}(e^{t})$) on
strongly convex objectives, in both cases for general sets of atoms.
Furthermore, we establish a clear correspondence of our algorithms to known
algorithms from the MP and FW literature. Our novel algorithms and analyses
target general atom sets and general objective functions, and hence are
directly applicable to a large variety of learning settings.

This study deals with semantic segmentation of highresolution (aerial)
images where a semantic class label is assigned to each pixel via supervised
classification as a basis for automatic map generation. Recently, deep
convolutional neural networks (CNNs) have shown impressive performance and have
quickly become the defacto standard for semantic segmentation, with the added
benefit that taskspecific feature design is no longer necessary. However, a
major downside of deep learning methods is that they are extremely datahungry,
thus aggravating the perennial bottleneck of supervised classification, to
obtain enough annotated training data. On the other hand, it has been observed
that they are rather robust against noise in the training labels. This opens up
the intriguing possibility to avoid annotating huge amounts of training data,
and instead train the classifier from existing legacy data or crowdsourced
maps which can exhibit high levels of noise. The question addressed in this
paper is: can training with largescale, publicly available labels replace a
substantial part of the manual labeling effort and still achieve sufficient
performance? Such data will inevitably contain a significant portion of errors,
but in return virtually unlimited quantities of it are available in larger
parts of the world. We adapt a stateoftheart CNN architecture for semantic
segmentation of buildings and roads in aerial images, and compare its
performance when using different training data sets, ranging from manually
labeled, pixelaccurate ground truth of the same city to automatic training
data derived from OpenStreetMap data from distant locations. We report our
results that indicate that satisfying performance can be obtained with
significantly less manual annotation effort, by exploiting noisy largescale
training data.

We propose a new selection rule for the coordinate selection in coordinate
descent methods for hugescale optimization. The efficiency of this novel
scheme is provably better than the efficiency of uniformly random selection,
and can reach the efficiency of steepest coordinate descent (SCD), enabling an
acceleration of a factor of up to $n$, the number of coordinates. In many
practical applications, our scheme can be implemented at no extra cost and
computational efficiency very close to the faster uniform selection. Numerical
experiments with Lasso and Ridge regression show promising improvements, in
line with our theoretical guarantees.

Motivated by concerns for user privacy, we design a steganographic system
("stegosystem") that enables two users to exchange encrypted messages without
an adversary detecting that such an exchange is taking place. We propose a new
linguistic stegosystem based on a Long ShortTerm Memory (LSTM) neural network.
We demonstrate our approach on the Twitter and Enron email datasets and show
that it yields highquality steganographic text while significantly improving
capacity (encrypted bits per word) relative to the stateoftheart.

This paper presents a novel approach for multilingual sentiment
classification in short texts. This is a challenging task as the amount of
training data in languages other than English is very limited. Previously
proposed multilingual approaches typically require to establish a
correspondence to English for which powerful classifiers are already available.
In contrast, our method does not require such supervision. We leverage large
amounts of weaklysupervised data in various languages to train a multilayer
convolutional network and demonstrate the importance of using pretraining of
such networks. We thoroughly evaluate our approach on various multilingual
datasets, including the recent SemEval2016 sentiment prediction benchmark
(Task 4), where we achieved stateoftheart performance. We also compare the
performance of our model trained individually for each language to a variant
trained for all languages at once. We show that the latter model reaches
slightly worse  but still acceptable  performance when compared to the single
language model, while benefiting from better generalization properties across
languages.

Coordinate descent methods employ random partial updates of decision
variables in order to solve hugescale convex optimization problems. In this
work, we introduce new adaptive rules for the random selection of their
updates. By adaptive, we mean that our selection rules are based on the dual
residual or the primaldual gap estimates and can change at each iteration. We
theoretically characterize the performance of our selection rules and
demonstrate improvements over the stateoftheart, and extend our theory and
algorithms to general convex objectives. Numerical evidence with hingeloss
support vector machines and Lasso confirm that the practice follows the theory.

Two of the most fundamental prototypes of greedy optimization are the
matching pursuit and FrankWolfe algorithms. In this paper, we take a unified
view on both classes of methods, leading to the first explicit convergence
rates of matching pursuit methods in an optimization sense, for general sets of
atoms. We derive sublinear ($1/t$) convergence for both classes on general
smooth objectives, and linear convergence on strongly convex objectives, as
well as a clear correspondence of algorithm variants. Our presented algorithms
and rates are affine invariant, and do not need any incoherence or sparsity
assumptions.

We formulate an affine invariant implementation of the accelerated
firstorder algorithm in Nesterov (1983). Its complexity bound is proportional
to an affine invariant regularity constant defined with respect to the
Minkowski gauge of the feasible set. We extend these results to more general
problems, optimizing H\"older smooth functions using $p$uniformly convex prox
terms, and derive an algorithm whose complexity better fits the geometry of the
feasible set and adapts to both the best H\"older smoothness parameter and the
best gradient Lipschitz constant. Finally, we detail matching complexity lower
bounds when the feasible set is an $\ell_p$ ball. In this setting, our upper
bounds on iteration complexity for the algorithm in Nesterov (1983) are thus
optimal in terms of target precision, smoothness and problem dimension.

We propose a new framework for deriving screening rules for convex
optimization problems. Our approach covers a large class of constrained and
penalized optimization formulations, and works in two steps. First, given any
approximate point, the structure of the objective function and the duality gap
is used to gather information on the optimal solution. In the second step, this
information is used to produce screening rules, i.e. safely identifying
unimportant weight variables of the optimal solution. Our general framework
leads to a large variety of useful existing as well as new screening rules for
many applications. For example, we provide new screening rules for general
simplex and $L_1$constrained problems, Elastic Net, squaredloss Support
Vector Machines, minimum enclosing ball, as well as structured norm regularized
problems, such as group lasso.

With the growth of data and necessity for distributed optimization methods,
solvers that work well on a single machine must be redesigned to leverage
distributed computation. Recent work in this area has been limited by focusing
heavily on developing highly specific methods for the distributed environment.
These specialpurpose methods are often unable to fully leverage the
competitive performance of their welltuned and customized single machine
counterparts. Further, they are unable to easily integrate improvements that
continue to be made to single machine methods. To this end, we present a
framework for distributed optimization that both allows the flexibility of
arbitrary solvers to be used on each (single) machine locally, and yet
maintains competitive performance against other stateoftheart
specialpurpose distributed methods. We give strong primaldual convergence
rate guarantees for our framework that hold for arbitrary local solvers. We
demonstrate the impact of local solver selection both theoretically and in an
extensive experimental comparison. Finally, we provide thorough implementation
details for our framework, highlighting areas for practical performance gains.

We propose an algorithmindependent framework to equip existing optimization
methods with primaldual certificates. Such certificates and corresponding rate
of convergence guarantees are important for practitioners to diagnose progress,
in particular in machine learning applications. We obtain new primaldual
convergence rates, e.g., for the Lasso as well as many L1, Elastic Net, group
Lasso and TVregularized problems. The theory applies to any normregularized
generalized linear model. Our approach provides efficiently computable duality
gaps which are globally defined, without modifying the original problems in the
region of interest.

Despite the importance of sparsity in many largescale applications, there
are few methods for distributed optimization of sparsityinducing objectives.
In this paper, we present a communicationefficient framework for
L1regularized optimization in the distributed environment. By viewing
classical objectives in a more general primaldual setting, we develop a new
class of methods that can be efficiently distributed and applied to common
sparsityinducing models, such as Lasso, sparse logistic regression, and
elastic netregularized problems. We provide theoretical convergence guarantees
for our framework, and demonstrate its efficiency and flexibility with a
thorough experimental comparison on Amazon EC2. Our proposed framework yields
speedups of up to 50x as compared to current stateoftheart methods for
distributed L1regularized optimization.

Efficiently representing real world data in a succinct and parsimonious
manner is of central importance in many fields. We present a generalized greedy
pursuit framework, allowing us to efficiently solve structured matrix
factorization problems, where the factors are allowed to be from arbitrary sets
of structured vectors. Such structure may include sparsity, nonnegativeness,
order, or a combination thereof. The algorithm approximates a given matrix by a
linear combination of few rank1 matrices, each factorized into an outer
product of two vector atoms of the desired structure. For the nonconvex
subproblems of obtaining good rank1 structured matrix atoms, we employ and
analyze a general atomic power method. In addition to the above applications,
we prove linear convergence for generalized pursuit variants in Hilbert spaces
 for the task of approximation over the linear span of arbitrary dictionaries
 which generalizes OMP and is useful beyond matrix problems. Our experiments
on real datasets confirm both the efficiency and also the broad applicability
of our framework in practice.

The FrankWolfe (FW) optimization algorithm has lately regained popularity
thanks in particular to its ability to nicely handle the structured constraints
appearing in machine learning applications. However, its convergence rate is
known to be slow (sublinear) when the solution lies at the boundary. A simple
lessknown fix is to add the possibility to take 'away steps' during
optimization, an operation that importantly does not require a feasibility
oracle. In this paper, we highlight and clarify several variants of the
FrankWolfe optimization algorithm that have been successfully applied in
practice: awaysteps FW, pairwise FW, fullycorrective FW and Wolfe's minimum
norm point algorithm, and prove for the first time that they all enjoy global
linear convergence, under a weaker condition than strong convexity of the
objective. The constant in the convergence rate has an elegant interpretation
as the product of the (classical) condition number of the function with a novel
geometric quantity that plays the role of a 'condition number' of the
constraint set. We provide pointers to where these algorithms have made a
difference in practice, in particular with the flow polytope, the marginal
polytope and the base polytope for submodular optimization.

Distributed optimization methods for largescale machine learning suffer from
a communication bottleneck. It is difficult to reduce this bottleneck while
still efficiently and accurately aggregating partial work from different
machines. In this paper, we present a novel generalization of the recent
communicationefficient primaldual framework (CoCoA) for distributed
optimization. Our framework, CoCoA+, allows for additive combination of local
updates to the global parameters at each iteration, whereas previous schemes
with convergence guarantees only allow conservative averaging. We give stronger
(primaldual) convergence rate guarantees for both CoCoA as well as our new
variants, and generalize the theory for both methods to cover nonsmooth convex
loss functions. We provide an extensive experimental comparison that shows the
markedly improved performance of CoCoA+ on several realworld distributed
datasets, especially when scaling up the number of machines.

Communication remains the most significant bottleneck in the performance of
distributed optimization algorithms for largescale machine learning. In this
paper, we propose a communicationefficient framework, CoCoA, that uses local
computation in a primaldual setting to dramatically reduce the amount of
necessary communication. We provide a strong convergence rate analysis for this
class of algorithms, as well as experiments on realworld distributed datasets
with implementations in Spark. In our experiments, we find that as compared to
stateoftheart minibatch versions of SGD and SDCA algorithms, CoCoA
converges to the same .001accurate solution quality on average 25x as quickly.

We investigate the relation of two fundamental tools in machine learning and
signal processing, that is the support vector machine (SVM) for classification,
and the Lasso technique used in regression. We show that the resulting
optimization problems are equivalent, in the following sense. Given any
instance of an $\ell_2$loss softmargin (or hardmargin) SVM, we construct a
Lasso instance having the same optimal solutions, and vice versa.
As a consequence, many existing optimization algorithms for both SVMs and
Lasso can also be applied to the respective other problem instances. Also, the
equivalence allows for many known theoretical insights for SVM and Lasso to be
translated between the two settings. One such implication gives a simple
kernelized version of the Lasso, analogous to the kernels used in the SVM
setting. Another consequence is that the sparsity of a Lasso solution is equal
to the number of support vectors for the corresponding SVM instance, and that
one can use screening rules to prune the set of support vectors. Furthermore,
we can relate sublinear time algorithms for the two problems, and give a new
such algorithm variant for the Lasso. We also study the regularization paths
for both methods.

We study the linear convergence of variants of the FrankWolfe algorithms for
some classes of strongly convex problems, using only affineinvariant
quantities. As in Guelat & Marcotte (1986), we show the linear convergence of
the standard FrankWolfe algorithm when the solution is in the interior of the
domain, but with affine invariant constants. We also show the linear
convergence of the awaysteps variant of the FrankWolfe algorithm, but with
constants which only depend on the geometry of the domain, and not any property
of the location of the optimal solution. Running these algorithms does not
require knowing any problem specific parameters.