
Training generative adversarial networks is unstable in highdimensions as
the true data distribution tends to be concentrated in a small fraction of the
ambient space. The discriminator is then quickly able to classify nearly all
generated samples as fake, leaving the generator without meaningful gradients
and causing it to deteriorate after a point in training. In this work, we
propose training a single generator simultaneously against an array of
discriminators, each of which looks at a different random lowdimensional
projection of the data. Individual discriminators, now provided with restricted
views of the input, are unable to reject generated samples perfectly and
continue to provide meaningful gradients to the generator throughout training.
Meanwhile, the generator learns to produce samples consistent with the full
data distribution to satisfy all discriminators simultaneously. We demonstrate
the practical utility of this approach experimentally, and show that it is able
to produce image samples with higher quality than traditional training with a
single discriminator.

Semidefinite programs (SDP) are important in learning and combinatorial
optimization with numerous applications. In pursuit of lowrank solutions and
low complexity algorithms, we consider the BurerMonteiro factorization
approach for solving SDPs. We show that all approximate local optima are global
optima for the penalty formulation of appropriately rankconstrained SDPs as
long as the number of constraints scales subquadratically with the desired
rank of the optimal solution. Our result is based on a simple penalty function
formulation of the rankconstrained SDP along with a smoothed analysis to avoid
worstcase cost matrices. We particularize our results to two applications,
namely, MaxCut and matrix completion.

We present a generalization bound for feedforward neural networks in terms of
the product of the spectral norm of the layers and the Frobenius norm of the
weights. The generalization bound is derived using a PACBayes analysis.

With a goal of understanding what drives generalization in deep networks, we
consider several recently suggested explanations, including normbased control,
sharpness and robustness. We study how these measures can ensure
generalization, highlighting the importance of scale normalization, and making
a connection between sharpness and PACBayes theory. We then investigate how
well the measures explain different observed phenomena.

We study implicit regularization when optimizing an underdetermined quadratic
objective over a matrix $X$ with gradient descent on a factorization of $X$. We
conjecture and provide empirical and theoretical evidence that with small
enough step sizes and initialization close enough to the origin, gradient
descent on a full dimensional factorization converges to the minimum nuclear
norm solution.

In this paper we present a new algorithm for computing a low rank
approximation of the product $A^TB$ by taking only a single pass of the two
matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch
$A$ and $B$ individually, and then (b) find the top components using PCA on the
sketch. Our algorithm in contrast retains additional summary information about
$A,B$ (e.g. row and column norms etc.) and uses this additional information to
obtain an improved approximation from the sketches. Our main analytical result
establishes a comparable spectral norm guarantee to existing twopass methods;
in addition we also provide results from an Apache Spark implementation that
shows better computational and statistical performance on realworld and
synthetic evaluation datasets.

We study the projected gradient descent method on lowrank matrix problems
with a strongly convex objective. We use the BurerMonteiro factorization
approach to implicitly enforce lowrankness; such factorization introduces
nonconvexity in the objective. We focus on constraint sets that include both
positive semidefinite (PSD) constraints and specific matrix normconstraints.
Such criteria appear in quantum state tomography and phase retrieval
applications.
We show that nonconvex projected gradient descent favors local linear
convergence in the factored space. We build our theory on a novel descent
lemma, that nontrivially extends recent results on the unconstrained problem.
The resulting algorithm is Projected Factored Gradient Descent, abbreviated as
ProjFGD, and shows superior performance compared to state of the art on quantum
state tomography and sparse phase retrieval applications.

We show that there are no spurious local minima in the nonconvex factorized
parametrization of lowrank matrix recovery from incoherent linear
measurements. With noisy measurements we show all local minima are very close
to a global optimum. Together with a curvature bound at saddle points, this
yields a polynomial time global convergence guarantee for stochastic gradient
descent {\em from random initialization}.

We study the minimization of a convex function $f(X)$ over the set of
$n\times n$ positive semidefinite matrices, but when the problem is recast as
$\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq
n$. We study the performance of gradient descent on $g$which we refer to as
Factored Gradient Descent (FGD)under standard assumptions on the original
function $f$.
We provide a rule for selecting the step size and, with this choice, show
that the local convergence rate of FGD mirrors that of standard gradient
descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for
smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly
convex. In addition, we provide a procedure to initialize FGD for (restricted)
strongly convex objectives and when one only has access to $f$ via a
firstorder oracle; for several problem instances, such proper initialization
leads to global convergence guarantees.
FGD and similar procedures are widely used in practice for problems that can
be posed as matrix factorization. To the best of our knowledge, this is the
first paper to provide precise convergence rate guarantees for general convex
functions under standard convex assumptions.

In this paper we propose new techniques to sample arbitrary thirdorder
tensors, with an objective of speeding up tensor algorithms that have recently
gained popularity in machine learning. Our main contribution is a new way to
select, in a biased random way, only $O(n^{1.5}/\epsilon^2)$ of the possible
$n^3$ elements while still achieving each of the three goals: \\ {\em (a)
tensor sparsification}: for a tensor that has to be formed from arbitrary
samples, compute very few elements to get a good spectral approximation, and
for arbitrary orthogonal tensors {\em (b) tensor completion:} recover an
exactly lowrank tensor from a small number of samples via alternating least
squares, or {\em (c) tensor factorization:} approximating factors of a lowrank
tensor corrupted by noise. \\ Our sampling can be used along with existing
tensorbased algorithms to speed them up, removing the computational bottleneck
in these methods.

In this work, we propose a new randomized algorithm for computing a lowrank
approximation to a given matrix. Taking an approach different from existing
literature, our method first involves a specific biased sampling, with an
element being chosen based on the leverage scores of its row and column, and
then involves weighted alternating minimization over the factored form of the
intended lowrank matrix, to minimize error only on these samples. Our method
can leverage input sparsity, yet produce approximations in {\em spectral} (as
opposed to the weaker Frobenius) norm; this combines the best aspects of
otherwise disparate current results, but with a dependence on the condition
number $\kappa = \sigma_1/\sigma_r$. In particular we require $O(nnz(M) +
\frac{n\kappa^2 r^5}{\epsilon^2})$ computations to generate a rank$r$
approximation to $M$ in spectral norm. In contrast, the best existing method
requires $O(nnz(M)+ \frac{nr^2}{\epsilon^4})$ time to compute an approximation
in Frobenius norm. Besides the tightness in spectral norm, we have a better
dependence on the error $\epsilon$. Our method is naturally and highly
parallelizable.
Our new approach enables two extensions that are interesting on their own.
The first is a new method to directly compute a lowrank approximation (in
efficient factored form) to the product of two given matrices; it computes a
small random set of entries of the product, and then executes weighted
alternating minimization (as before) on these. The sampling strategy is
different because now we cannot access leverage scores of the product matrix
(but instead have to work with input matrices). The second extension is an
improved algorithm with smaller communication complexity for the distributed
PCA setting (where each server has small set of rows of the matrix, and want to
compute low rank approximation with small amount of communication with other
servers).

Matrix completion, i.e., the exact and provable recovery of a lowrank matrix
from a small subset of its elements, is currently only known to be possible if
the matrix satisfies a restrictive structural constraintknown as {\em
incoherence}on its row and column spaces. In these cases, the subset of
elements is sampled uniformly at random.
In this paper, we show that {\em any} rank$ r $ $ n$by$ n $ matrix can be
exactly recovered from as few as $O(nr \log^2 n)$ randomly chosen elements,
provided this random choice is made according to a {\em specific biased
distribution}: the probability of any element being sampled should be
proportional to the sum of the leverage scores of the corresponding row, and
column. Perhaps equally important, we show that this specific form of sampling
is nearly necessary, in a natural precise sense; this implies that other
perhaps more intuitive sampling schemes fail.
We further establish three ways to use the above result for the setting when
leverage scores are not known \textit{a priori}: (a) a sampling strategy for
the case when only one of the row or column spaces are incoherent, (b) a
twophase sampling procedure for general matrices that first samples to
estimate leverage scores followed by sampling for exact recovery, and (c) an
analysis showing the advantages of weighted nuclear/tracenorm minimization
over the vanilla unweighted formulation for the case of nonuniform sampling.

The problem of lowrank matrix completion has recently generated a lot of
interest leading to several results that offer exact solutions to the problem.
However, in order to do so, these methods make assumptions that can be quite
restrictive in practice. More specifically, the methods assume that: a) the
observed indices are sampled uniformly at random, and b) for every new matrix,
the observed indices are sampled afresh. In this work, we address these issues
by providing a universal recovery guarantee for matrix completion that works
for a variety of sampling schemes. In particular, we show that if the set of
sampled indices come from the edges of a bipartite graph with large spectral
gap (i.e. gap between the first and the second singular value), then the
nuclear norm minimization based method exactly recovers all lowrank matrices
that satisfy certain incoherence properties. Moreover, we also show that under
certain stricter incoherence conditions, $O(nr^2)$ uniformly sampled entries
are enough to recover any rank$r$ $n\times n$ matrix, in contrast to the
$O(nr\log n)$ sample complexity required by other matrix completion algorithms
as well as existing analyses of the nuclear norm method.