
Prospection, the act of predicting the consequences of many possible futures,
is intrinsic to human planning and action, and may even be at the root of
consciousness. Surprisingly, this idea has been explored comparatively little
in robotics. In this work, we propose a neural network architecture and
associated planning algorithm that (1) learns a representation of the world
useful for generating prospective futures after the application of highlevel
actions, (2) uses this generative model to simulate the result of sequences of
highlevel actions in a variety of environments, and (3) uses this same
representation to evaluate these actions and perform tree search to find a
sequence of highlevel actions in a new environment. Models are trained via
imitation learning on a variety of domains, including navigation,
pickandplace, and a surgical robotics task. Our approach allows us to
visualize intermediate motion goals and learn to plan complex activity from
visual information.

In this paper we investigate the family of functions representable by deep
neural networks (DNN) with rectified linear units (ReLU). We give an algorithm
to train a ReLU DNN with one hidden layer to *global optimality* with runtime
polynomial in the data size albeit exponential in the input dimension. Further,
we improve on the known lower bounds on size (from exponential to super
exponential) for approximating a ReLU deep net function by a shallower ReLU
net. Our gap theorems hold for smoothly parametrized families of "hard"
functions, contrary to countable, discrete families known in the literature. An
example consequence of our gap theorems is the following: for every natural
number $k$ there exists a function representable by a ReLU DNN with $k^2$
hidden layers and total size $k^3$, such that any ReLU DNN with at most $k$
hidden layers will require at least $\frac{1}{2}k^{k+1}1$ total nodes.
Finally, for the family of $\mathbb{R}^n\to \mathbb{R}$ DNNs with ReLU
activations, we show a new lowerbound on the number of affine pieces, which is
larger than previous constructions in certain regimes of the network
architecture and most distinctively our lowerbound is demonstrated by an
explicit construction of a *smoothly parameterized* family of functions
attaining this scaling. Our construction utilizes the theory of zonotopes from
polyhedral theory.

We propose novel firstorder stochastic approximation algorithms for
canonical correlation analysis (CCA). Algorithms presented are instances of
inexact matrix stochastic gradient (MSG) and inexact matrix exponentiated
gradient (MEG), and achieve $\epsilon$suboptimality in the population
objective in $\operatorname{poly}(\frac{1}{\epsilon})$ iterations. We also
consider practical variants of the proposed algorithms and compare them with
other methods for CCA both theoretically and empirically.

Many machine learning techniques sacrifice convenient computational
structures to gain estimation robustness and modeling flexibility. However, by
exploring the modeling structures, we find these "sacrifices" do not always
require more computational efforts. To shed light on such a "freelunch"
phenomenon, we study the squarerootLasso (SQRTLasso) type regression
problem. Specifically, we show that the nonsmooth loss functions of SQRTLasso
type regression ease tuning effort and gain adaptivity to inhomogeneous noise,
but is not necessarily more challenging than Lasso in computation. We can
directly apply proximal algorithms (e.g. proximal gradient descent, proximal
Newton, and proximal QuasiNewton algorithms) without worrying the
nonsmoothness of the loss function. Theoretically, we prove that the proximal
algorithms combined with the pathwise optimization scheme enjoy fast
convergence guarantees with high probability. Numerical results are provided to
support our theory.

We propose a general theory for studying the \xl{landscape} of nonconvex
\xl{optimization} with underlying symmetric structures \tz{for a class of
machine learning problems (e.g., lowrank matrix factorization, phase
retrieval, and deep linear neural networks)}. In specific, we characterize the
locations of stationary points and the null space of Hessian matrices \xl{of
the objective function} via the lens of invariant groups\removed{for associated
optimization problems, including lowrank matrix factorization, phase
retrieval, and deep linear neural networks}. As a major motivating example, we
apply the proposed general theory to characterize the global \xl{landscape} of
the \xl{nonconvex optimization in} lowrank matrix factorization problem. In
particular, we illustrate how the rotational symmetry group gives rise to
infinitely many nonisolated strict saddle points and equivalent global minima
of the objective function. By explicitly identifying all stationary points, we
divide the entire parameter space into three regions: ($\cR_1$) the region
containing the neighborhoods of all strict saddle points, where the objective
has negative curvatures; ($\cR_2$) the region containing neighborhoods of all
global minima, where the objective enjoys strong convexity along certain
directions; and ($\cR_3$) the complement of the above regions, where the
gradient has sufficiently large magnitudes. We further extend our result to the
matrix sensing problem. Such global landscape implies strong global convergence
guarantees for popular iterative algorithms with arbitrary initial solutions.

We propose a stochastic variance reduced optimization algorithm for solving
sparse learning problems with cardinality constraints. Sufficient conditions
are provided, under which the proposed algorithm enjoys strong linear
convergence guarantees and optimal estimation accuracy in high dimensions. We
further extend the proposed algorithm to an asynchronous parallel variant with
a near linear speedup. Numerical experiments demonstrate the efficiency of our
algorithm in terms of both parameter estimation and computational performance.

We present Deep Generalized Canonical Correlation Analysis (DGCCA)  a
method for learning nonlinear transformations of arbitrarily many views of
data, such that the resulting transformations are maximally informative of each
other. While methods for nonlinear twoview representation learning (Deep CCA,
(Andrew et al., 2013)) and linear manyview representation learning
(Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCAstyle multiview
representation learning technique that combines the flexibility of nonlinear
(deep) representation learning with the statistical power of incorporating
information from many independent sources, or views. We present the DGCCA
formulation as well as an efficient stochastic optimization algorithm for
solving it. We learn DGCCA representations on two distinct datasets for three
downstream tasks: phonetic transcription from acoustic and articulatory
measurements, and recommending hashtags and friends on a dataset of Twitter
users. We find that DGCCA representations soundly beat existing methods at
phonetic transcription and hashtag recommendation, and in general perform no
worse than standard linear manyview techniques.

The cyclic block coordinate descenttype (CBCDtype) methods, which performs
iterative updates for a few coordinates (a block) simultaneously throughout the
procedure, have shown remarkable computational performance for solving strongly
convex minimization problems. Typical applications include many popular
statistical machine learning methods such as elasticnet regression, ridge
penalized logistic regression, and sparse additive regression. Existing
optimization literature has shown that for strongly convex minimization, the
CBCDtype methods attain iteration complexity of $\cO(p\log(1/\epsilon))$,
where $\epsilon$ is a prespecified accuracy of the objective value, and $p$ is
the number of blocks. However, such iteration complexity explicitly depends on
$p$, and therefore is at least $p$ times worse than the complexity
$\cO(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this
theoretical gap, we propose an improved convergence analysis for the CBCDtype
methods. In particular, we first show that for a family of quadratic
minimization problems, the iteration complexity
$\cO(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCDtype methods matches that of
the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus
our complexity bounds are sharper than the existing bounds by at least a factor
of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved
complexity bounds are tight (up to a $\log^2 p$ factor), under the assumption
that the largest and smallest eigenvalues of the Hessian matrix do not scale
with $p$. Finally, we generalize our analysis to other strongly convex
minimization problems beyond quadratic ones.

Medical researchers are coming to appreciate that many diseases are in fact
complex, heterogeneous syndromes composed of subpopulations that express
different variants of a related complication. Time series data extracted from
individual electronic health records (EHR) offer an exciting new way to study
subtle differences in the way these diseases progress over time. In this paper,
we focus on answering two questions that can be asked using these databases of
time series. First, we want to understand whether there are individuals with
similar disease trajectories and whether there are a small number of degrees of
freedom that account for differences in trajectories across the population.
Second, we want to understand how important clinical outcomes are associated
with disease trajectories. To answer these questions, we propose the Disease
Trajectory Map (DTM), a novel probabilistic model that learns lowdimensional
representations of sparse and irregularly sampled time series. We propose a
stochastic variational inference algorithm for learning the DTM that allows the
model to scale to large modern medical datasets. To demonstrate the DTM, we
analyze data collected on patients with the complex autoimmune disease,
scleroderma. We find that DTM learns meaningful representations of disease
trajectories and that the representations are significantly associated with
important clinical outcomes.

Modern NLP models rely heavily on engineered features, which often combine
word and contextual information into complex lexical features. Such combination
results in large numbers of features, which can lead to overfitting. We
present a new model that represents complex lexical featurescomprised of
parts for words, contextual information and labelsin a tensor that captures
conjunction information among these parts. We apply lowrank tensor
approximations to the corresponding parameter tensors to reduce the parameter
space and improve prediction speed. Furthermore, we investigate two methods for
handling features that include $n$grams of mixed lengths. Our model achieves
stateoftheart results on tasks in relation extraction, PPattachment, and
preposition disambiguation.

We consider learning representations (features) in the setting in which we
have access to multiple unlabeled views of the data for learning while only one
view is available for downstream tasks. Previous work on this problem has
proposed several techniques based on deep neural networks, typically involving
either autoencoderlike networks with a reconstruction objective or paired
feedforward networks with a batchstyle correlationbased objective. We analyze
several techniques based on prior work, as well as new variants, and compare
them empirically on image, speech, and text tasks. We find an advantage for
correlationbased representation learning, while the best results on most tasks
are obtained with our new variant, deep canonically correlated autoencoders
(DCCAE). We also explore a stochastic optimization procedure for minibatch
correlationbased objectives and discuss the time/performance tradeoffs for
kernelbased and neural networkbased implementations.

Deep CCA is a recently proposed deep neural network extension to the
traditional canonical correlation analysis (CCA), and has been successful for
multiview representation learning in several domains. However, stochastic
optimization of the deep CCA objective is not straightforward, because it does
not decouple over training examples. Previous optimizers for deep CCA are
either batchbased algorithms or stochastic optimization using large
minibatches, which can have high memory consumption. In this paper, we tackle
the problem of stochastic optimization for deep CCA with small minibatches,
based on an iterative solution to the CCA objective, and show that we can
achieve as good performance as previous optimizers and thus alleviate the
memory requirement.

We study PCA as a stochastic optimization problem and propose a novel
stochastic approximation algorithm which we refer to as "Matrix Stochastic
Gradient" (MSG), as well as a practical variant, Capped MSG. We study the
method both theoretically and empirically.

We consider a Markov decision process with deterministic state transition
dynamics, adversarially generated rewards that change arbitrarily from round to
round, and a bandit feedback model in which the decision maker only observes
the rewards it receives. In this setting, we present a novel and efficient
online decision making algorithm named MarcoPolo. Under mild assumptions on the
structure of the transition dynamics, we prove that MarcoPolo enjoys a regret
of O(T^(3/4)sqrt(log(T))) against the best deterministic policy in hindsight.
Specifically, our analysis does not rely on the stringent unichain assumption,
which dominates much of the previous work on this topic.

Online learning algorithms are designed to learn even when their input is
generated by an adversary. The widelyaccepted formal definition of an online
algorithm's ability to learn is the gametheoretic notion of regret. We argue
that the standard definition of regret becomes inadequate if the adversary is
allowed to adapt to the online algorithm's actions. We define the alternative
notion of policy regret, which attempts to provide a more meaningful way to
measure an online algorithm's performance against adaptive adversaries.
Focusing on the online bandit setting, we show that no bandit algorithm can
guarantee a sublinear policy regret against an adaptive adversary with
unbounded memory. On the other hand, if the adversary's memory is bounded, we
present a general technique that converts any bandit algorithm with a sublinear
regret bound into an algorithm with a sublinear policy regret bound. We extend
this result to other variants of regret, such as switching regret, internal
regret, and swap regret.