
We consider the task of learning the parameters of a {\em single} component
of a mixture model, for the case when we are given {\em side information} about
that component, we call this the "search problem" in mixture models. We would
like to solve this with computational and sample complexity lower than solving
the overall original problem, where one learns parameters of all components.
Our main contributions are the development of a simple but general model for
the notion of side information, and a corresponding simple matrixbased
algorithm for solving the search problem in this general setting. We then
specialize this model and algorithm to four common scenarios: Gaussian mixture
models, LDA topic models, subspace clustering, and mixed linear regression. For
each one of these we show that if (and only if) the side information is
informative, we obtain parameter estimates with greater accuracy, and also
improved computation complexity than existing moment based mixture model
algorithms (e.g. tensor methods). We also illustrate several natural ways one
can obtain such side information, for specific problem instances. Our
experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the
practicality of our algorithms showing significant improvement in runtime and
accuracy.

We consider support recovery in the quadratic logistic regression setting 
where the target depends on both p linear terms $x_i$ and up to $p^2$ quadratic
terms $x_i x_j$. Quadratic terms enable prediction/modeling of higherorder
effects between features and the target, but when incorporated naively may
involve solving a very large regression problem. We consider the sparse case,
where at most $s$ terms (linear or quadratic) are nonzero, and provide a new
faster algorithm. It involves (a) identifying the weak support (i.e. all
relevant variables) and (b) standard logistic regression optimization only on
these chosen variables. The first step relies on a novel insight about
correlation tests in the presence of nonlinearity, and takes $O(pn)$ time for
$n$ samples  giving potentially huge computational gains over the naive
approach. Motivated by insights from the boolean case, we propose a nonlinear
correlation test for nonbinary finite support case that involves hashing a
variable and then correlating with the output variable. We also provide
experimental results to demonstrate the effectiveness of our methods.

A rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ can be written as a product
$U V^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n
\times r}$. One could exploit this observation in optimization: e.g., consider
the minimization of a convex function $f(X)$ over rank$r$ matrices, where the
set of rank$r$ matrices is modeled via the factorization $UV^\top$. Though
such parameterization reduces the number of variables, and is more
computationally efficient (of particular interest is the case $r \ll \min\{m,
n\}$), it comes at a cost: $f(UV^\top)$ becomes a nonconvex function w.r.t.
$U$ and $V$.
We study such parameterization for optimization of generic convex objectives
$f$, and focus on firstorder, gradient descent algorithmic solutions. We
propose the BiFactored Gradient Descent (BFGD) algorithm, an efficient
firstorder method that operates on the $U, V$ factors. We show that when $f$
is (restricted) smooth, BFGD has local sublinear convergence, and linear
convergence when $f$ is both (restricted) smooth and (restricted) strongly
convex. For several key applications, we provide simple and efficient
initialization schemes that provide approximate solutions good enough for the
above convergence results to hold.

In this paper we present a new algorithm for computing a low rank
approximation of the product $A^TB$ by taking only a single pass of the two
matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch
$A$ and $B$ individually, and then (b) find the top components using PCA on the
sketch. Our algorithm in contrast retains additional summary information about
$A,B$ (e.g. row and column norms etc.) and uses this additional information to
obtain an improved approximation from the sketches. Our main analytical result
establishes a comparable spectral norm guarantee to existing twopass methods;
in addition we also provide results from an Apache Spark implementation that
shows better computational and statistical performance on realworld and
synthetic evaluation datasets.

We study the projected gradient descent method on lowrank matrix problems
with a strongly convex objective. We use the BurerMonteiro factorization
approach to implicitly enforce lowrankness; such factorization introduces
nonconvexity in the objective. We focus on constraint sets that include both
positive semidefinite (PSD) constraints and specific matrix normconstraints.
Such criteria appear in quantum state tomography and phase retrieval
applications.
We show that nonconvex projected gradient descent favors local linear
convergence in the factored space. We build our theory on a novel descent
lemma, that nontrivially extends recent results on the unconstrained problem.
The resulting algorithm is Projected Factored Gradient Descent, abbreviated as
ProjFGD, and shows superior performance compared to state of the art on quantum
state tomography and sparse phase retrieval applications.

We consider the nonsquare matrix sensing problem, under restricted isometry
property (RIP) assumptions. We focus on the nonconvex formulation, where any
rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ is represented as $UV^\top$,
where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. In
this paper, we complement recent findings on the nonconvex geometry of the
analogous PSD setting [5], and show that matrix factorization does not
introduce any spurious local minima, under RIP.

We consider the problem of solving mixed random linear equations with $k$
components. This is the noiseless setting of mixed linear regression. The goal
is to estimate multiple linear models from mixed samples in the case where the
labels (which sample corresponds to which model) are not observed. We give a
tractable algorithm for the mixed linear equation problem, and show that under
some technical conditions, our algorithm is guaranteed to solve the problem
exactly with sample complexity linear in the dimension, and polynomial in $k$,
the number of components. Previous approaches have required either exponential
dependence on $k$, or superlinear dependence on the dimension. The proposed
algorithm is a combination of tensor decomposition and alternating
minimization. Our analysis involves proving that the initialization provided by
the tensor method allows alternating minimization, which is equivalent to EM in
our setting, to converge to the global optimum at a linear rate.

This paper considers the recovery of a rank $r$ positive semidefinite matrix
$X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i
:= a_i^T X X^T a_i$ (i.e., quadratic measurements of $X$). Such problems arise
in a variety of applications, including covariance sketching of
highdimensional data streams, quadratic regression, quantum state tomography,
among others. A natural approach to this problem is to minimize the loss
function $f(U) = \sum_i (y_i  a_i^TUU^Ta_i)^2$ which has an entire manifold of
solutions given by $\{XO\}_{O\in\mathcal{O}_r}$ where $\mathcal{O}_r$ is the
orthogonal group of $r\times r$ orthogonal matrices; this is {\it nonconvex}
in the $n\times r$ matrix $U$, but methods like gradient descent are simple and
easy to implement (as compared to semidefinite relaxation approaches).
In this paper we show that once we have $m \geq C nr \log^2(n)$ samples from
isotropic gaussian $a_i$, with high probability {\em (a)} this function admits
a dimensionindependent region of {\em local strong convexity} on lines
perpendicular to the solution manifold, and {\em (b)} with an additional
polynomial factor of $r$ samples, a simple spectral initialization will land
within the region of convexity with high probability. Together, this implies
that gradient descent with initialization (but no resampling) will converge
linearly to the correct $X$, up to an orthogonal transformation. We believe
that this general technique (local convexity reachable by spectral
initialization) should prove applicable to a broader class of nonconvex
optimization problems.

This paper considers the problem of matrix completion when some number of the
columns are completely and arbitrarily corrupted, potentially by a malicious
adversary. It is wellknown that standard algorithms for matrix completion can
return arbitrarily poor results, if even a single column is corrupted. One
direct application comes from robust collaborative filtering. Here, some number
of users are socalled manipulators who try to skew the predictions of the
algorithm by calibrating their inputs to the system. In this paper, we develop
an efficient algorithm for this problem based on a combination of a trimming
procedure and a convex program that minimizes the nuclear norm and the
$\ell_{1,2}$ norm. Our theoretical results show that given a vanishing fraction
of observed entries, it is nevertheless possible to complete the underlying
matrix even when the number of corrupted columns grows. Significantly, our
results hold without any assumptions on the locations or values of the observed
entries of the manipulated columns. Moreover, we show by an
informationtheoretic argument that our guarantees are nearly optimal in terms
of the fraction of sampled entries on the authentic columns, the fraction of
corrupted columns, and the rank of the underlying matrix. Our results therefore
sharply characterize the tradeoffs between sample, robustness and rank in
matrix completion.

We study the minimization of a convex function $f(X)$ over the set of
$n\times n$ positive semidefinite matrices, but when the problem is recast as
$\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq
n$. We study the performance of gradient descent on $g$which we refer to as
Factored Gradient Descent (FGD)under standard assumptions on the original
function $f$.
We provide a rule for selecting the step size and, with this choice, show
that the local convergence rate of FGD mirrors that of standard gradient
descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for
smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly
convex. In addition, we provide a procedure to initialize FGD for (restricted)
strongly convex objectives and when one only has access to $f$ via a
firstorder oracle; for several problem instances, such proper initialization
leads to global convergence guarantees.
FGD and similar procedures are widely used in practice for problems that can
be posed as matrix factorization. To the best of our knowledge, this is the
first paper to provide precise convergence rate guarantees for general convex
functions under standard convex assumptions.

Stochastic gradient descent is the method of choice for largescale machine
learning problems, by virtue of its light complexity per iteration. However, it
lags behind its nonstochastic counterparts with respect to the convergence
rate, due to high variance introduced by the stochastic updates. The popular
Stochastic VarianceReduced Gradient (SVRG) method mitigates this shortcoming,
introducing a new update rule which requires infrequent passes over the entire
input dataset to compute the fullgradient.
In this work, we propose CheapSVRG, a stochastic variancereduction
optimization scheme. Our algorithm is similar to SVRG but instead of the full
gradient, it uses a surrogate which can be efficiently computed on a small
subset of the input data. It achieves a linear convergence rate up to some
error level, depending on the nature of the optimization problemand features
a tradeoff between the computational complexity and the convergence rate.
Empirical evaluation shows that CheapSVRG performs at least competitively
compared to the state of the art.

In this paper we consider the collaborative ranking setting: a pool of users
each provides a small number of pairwise preferences between $d$ possible
items; from these we need to predict preferences of the users for items they
have not yet seen. We do so by fitting a rank $r$ score matrix to the pairwise
data, and provide two main contributions: (a) we show that an algorithm based
on convex optimization provides good generalization guarantees once each user
provides as few as $O(r\log^2 d)$ pairwise comparisons  essentially matching
the sample complexity required in the related matrix completion setting (which
uses actual numerical as opposed to pairwise information), and (b) we develop a
largescale nonconvex implementation, which we call AltSVM, that trains a
factored form of the matrix via alternating minimization (which we show reduces
to alternating SVM problems), and scales and parallelizes very well to large
problem settings. It also outperforms common baselines on many moderately large
popular collaborative filtering datasets in both NDCG and in other measures of
ranking performance.

Phase retrieval problems involve solving linear equations, but with missing
sign (or phase, for complex numbers) information. More than four decades after
it was first proposed, the seminal error reduction algorithm of (Gerchberg and
Saxton 1972) and (Fienup 1982) is still the popular choice for solving many
variants of this problem. The algorithm is based on alternating minimization;
i.e. it alternates between estimating the missing phase information, and the
candidate solution. Despite its wide usage in practice, no global convergence
guarantees for this algorithm are known. In this paper, we show that a
(resampling) variant of this approach converges geometrically to the solution
of one such problem  finding a vector $\mathbf{x}$ from
$\mathbf{y},\mathbf{A}$, where $\mathbf{y} =
\left\mathbf{A}^{\top}\mathbf{x}\right$ and $\mathbf{z}$ denotes a vector
of elementwise magnitudes of $\mathbf{z}$  under the assumption that
$\mathbf{A}$ is Gaussian.
Empirically, we demonstrate that alternating minimization performs similar to
recently proposed convex techniques for this problem (which are based on
"lifting" to a convex matrix problem) in sample complexity and robustness to
noise. However, it is much more efficient and can scale to large problems.
Analytically, for a resampling version of alternating minimization, we show
geometric convergence to the solution, and sample complexity that is off by log
factors from obvious lower bounds. We also establish close to optimal scaling
for the case when the unknown vector is sparse. Our work represents the first
theoretical guarantee for alternating minimization (albeit with resampling) for
any variant of phase retrieval problems in the nonconvex setting.

An active learner is given a class of models, a large set of unlabeled
examples, and the ability to interactively query labels of a subset of these
examples; the goal of the learner is to learn a model in the class that fits
the data well.
Previous theoretical work has rigorously characterized label complexity of
active learning, but most of this work has focused on the PAC or the agnostic
PAC model. In this paper, we shift our attention to a more general setting 
maximum likelihood estimation. Provided certain conditions hold on the model
class, we provide a twostage active learning algorithm for this problem. The
conditions we require are fairly general, and cover the widely popular class of
Generalized Linear Models, which in turn, include models for binary and
multiclass classification, regression, and conditional random fields.
We provide an upper bound on the label requirement of our algorithm, and a
lower bound that matches it up to lower order terms. Our analysis shows that
unlike binary classification in the realizable case, just a single extra round
of interaction is sufficient to achieve nearoptimal performance in maximum
likelihood estimation. On the empirical side, the recent work in
~\cite{Zhang12} and~\cite{Zhang14} (on active linear and logistic regression)
shows the promise of this approach.

In this paper we propose new techniques to sample arbitrary thirdorder
tensors, with an objective of speeding up tensor algorithms that have recently
gained popularity in machine learning. Our main contribution is a new way to
select, in a biased random way, only $O(n^{1.5}/\epsilon^2)$ of the possible
$n^3$ elements while still achieving each of the three goals: \\ {\em (a)
tensor sparsification}: for a tensor that has to be formed from arbitrary
samples, compute very few elements to get a good spectral approximation, and
for arbitrary orthogonal tensors {\em (b) tensor completion:} recover an
exactly lowrank tensor from a small number of samples via alternating least
squares, or {\em (c) tensor factorization:} approximating factors of a lowrank
tensor corrupted by noise. \\ Our sampling can be used along with existing
tensorbased algorithms to speed them up, removing the computational bottleneck
in these methods.

Graph clustering involves the task of dividing nodes into clusters, so that
the edge density is higher within clusters as opposed to across clusters. A
natural, classic and popular statistical setting for evaluating solutions to
this problem is the stochastic block model, also referred to as the planted
partition model.
In this paper we present a new algorithma convexified version of Maximum
Likelihoodfor graph clustering. We show that, in the classic stochastic block
model setting, it outperforms existing methods by polynomial factors when the
cluster size is allowed to have general scalings. In fact, it is within
logarithmic factors of known lower bounds for spectral methods, and there is
evidence suggesting that no polynomial time algorithm would do significantly
better.
We then show that this guarantee carries over to a more general extension of
the stochastic block model. Our method can handle the settings of semirandom
graphs, heterogeneous degree distributions, unequal cluster sizes, unaffiliated
nodes, partially observed graphs and planted clique/coloring etc. In
particular, our results provide the best exact recovery guarantees to date for
the planted partition, planted kdisjointcliques and planted noisy coloring
models with general cluster sizes; in other settings, we match the best
existing results up to logarithmic factors.

In this paper we look at content placement in the highdimensional regime:
there are n servers, and O(n) distinct types of content. Each server can store
and serve O(1) types at any given time. Demands for these content types arrive,
and have to be served in an online fashion; over time, there are a total of
O(n) of these demands. We consider the algorithmic task of content placement:
determining which types of content should be on which server at any given time,
in the setting where the demand statistics (i.e. the relative popularity of
each type of content) are not known apriori, but have to be inferred from the
very demands we are trying to satisfy. This is the high dimensional regime
because this scaling (everything being O(n)) prevents consistent estimation of
demand statistics; it models many modern settings where large numbers of users,
servers and videos/webpages interact in this way. We characterize the
performance of any scheme that separates learning and placement (i.e. which use
a portion of the demands to gain some estimate of the demand statistics, and
then uses the same for the remaining demands), showing it is orderwise
strictly suboptimal. We then study a simple adaptive scheme  which myopically
attempts to store the most recently requested content on idle servers  and
show it outperforms schemes that separate learning and placement. Our results
also generalize to the setting where the demand statistics change with time.
Overall, our results demonstrate that separating the estimation of demand, and
the subsequent use of the same, is strictly suboptimal.

A common phenomena in modern recommendation systems is the use of feedback
from one user to infer the `value' of an item to other users. This results in
an exploration vs. exploitation tradeoff, in which items of possibly low value
have to be presented to users in order to ascertain their value. Existing
approaches to solving this problem focus on the case where the number of items
are small, or admit some underlying structure  it is unclear, however, if
good recommendation is possible when dealing with contentrich settings with
unstructured content.
We consider this problem under a simple natural model, wherein the number of
items and the number of itemviews are of the same order, and an `accessgraph'
constrains which user is allowed to see which item. Our main insight is that
the presence of the accessgraph in fact makes good recommendation possible 
however this requires the exploration policy to be designed to take advantage
of the accessgraph. Our results demonstrate the importance of `serendipity' in
exploration, and how higher graphexpansion translates to a higher quality of
recommendations; it also suggests a reason why in some settings, simple
policies like Twitter's `LatestFirst' policy achieve a good performance.
From a technical perspective, our model presents a way to study
explorationexploitation tradeoffs in settings where the number of `trials' and
`strategies' are large (potentially infinite), and more importantly, of the
same order. Our algorithms admit competitiveratio guarantees which hold for
the worstcase user, under both finitepopulation and infinitehorizon
settings, and are parametrized in terms of properties of the underlying graph.
Conversely, we also demonstrate that improperlydesigned policies can be highly
suboptimal, and that in many settings, our results are orderwise optimal.

We consider the problem of subspace clustering: given points that lie on or
near the union of many lowdimensional linear subspaces, recover the subspaces.
To this end, one first identifies sets of points close to the same subspace and
uses the sets to estimate the subspaces. As the geometric structure of the
clusters (linear subspaces) forbids proper performance of general distance
based approaches such as Kmeans, many modelspecific methods have been
proposed. In this paper, we provide new simple and efficient algorithms for
this problem. Our statistical analysis shows that the algorithms are guaranteed
exact (perfect) clustering performance under certain conditions on the number
of points and the affinity between subspaces. These conditions are weaker than
those considered in the standard statistical literature. Experimental results
on synthetic data generated from the standard unions of subspaces model
demonstrate our theory. We also show that our algorithm performs competitively
against stateoftheart algorithms on realworld applications such as motion
segmentation and face clustering, with much simpler implementation and lower
computational cost.

We propose a new method for robust PCA  the task of recovering a lowrank
matrix from sparse corruptions that are of unknown value and support. Our
method involves alternating between projecting appropriate residuals onto the
set of lowrank matrices, and the set of sparse matrices; each projection is
{\em nonconvex} but easy to compute. In spite of this nonconvexity, we
establish exact recovery of the lowrank matrix, under the same conditions that
are required by existing methods (which are based on convex optimization). For
an $m \times n$ input matrix ($m \leq n)$, our method has a running time of
$O(r^2mn)$ per iteration, and needs $O(\log(1/\epsilon))$ iterations to reach
an accuracy of $\epsilon$. This is close to the running time of simple PCA via
the power method, which requires $O(rmn)$ per iteration, and
$O(\log(1/\epsilon))$ iterations. In contrast, existing methods for robust PCA,
which are based on convex optimization, have $O(m^2n)$ complexity per
iteration, and take $O(1/\epsilon)$ iterations, i.e., exponentially more
iterations for the same accuracy.
Experiments on both synthetic and real data establishes the improved speed
and accuracy of our method over existing convex implementations.

In this work, we propose a new randomized algorithm for computing a lowrank
approximation to a given matrix. Taking an approach different from existing
literature, our method first involves a specific biased sampling, with an
element being chosen based on the leverage scores of its row and column, and
then involves weighted alternating minimization over the factored form of the
intended lowrank matrix, to minimize error only on these samples. Our method
can leverage input sparsity, yet produce approximations in {\em spectral} (as
opposed to the weaker Frobenius) norm; this combines the best aspects of
otherwise disparate current results, but with a dependence on the condition
number $\kappa = \sigma_1/\sigma_r$. In particular we require $O(nnz(M) +
\frac{n\kappa^2 r^5}{\epsilon^2})$ computations to generate a rank$r$
approximation to $M$ in spectral norm. In contrast, the best existing method
requires $O(nnz(M)+ \frac{nr^2}{\epsilon^4})$ time to compute an approximation
in Frobenius norm. Besides the tightness in spectral norm, we have a better
dependence on the error $\epsilon$. Our method is naturally and highly
parallelizable.
Our new approach enables two extensions that are interesting on their own.
The first is a new method to directly compute a lowrank approximation (in
efficient factored form) to the product of two given matrices; it computes a
small random set of entries of the product, and then executes weighted
alternating minimization (as before) on these. The sampling strategy is
different because now we cannot access leverage scores of the product matrix
(but instead have to work with input matrices). The second extension is an
improved algorithm with smaller communication complexity for the distributed
PCA setting (where each server has small set of rows of the matrix, and want to
compute low rank approximation with small amount of communication with other
servers).

This paper considers the problem of clustering a partially observed
unweighted graphi.e., one where for some node pairs we know there is an edge
between them, for some others we know there is no edge, and for the remaining
we do not know whether or not there is an edge. We want to organize the nodes
into disjoint clusters so that there is relatively dense (observed)
connectivity within clusters, and sparse across clusters.
We take a novel yet natural approach to this problem, by focusing on finding
the clustering that minimizes the number of "disagreements"i.e., the sum of
the number of (observed) missing edges within clusters, and (observed) present
edges across clusters. Our algorithm uses convex optimization; its basis is a
reduction of disagreement minimization to the problem of recovering an
(unknown) lowrank matrix and an (unknown) sparse matrix from their partially
observed sum. We evaluate the performance of our algorithm on the classical
Planted Partition/Stochastic Block Model. Our main theorem provides sufficient
conditions for the success of our algorithm as a function of the minimum
cluster size, edge density and observation probability; in particular, the
results characterize the tradeoff between the observation probability and the
edge density gap. When there are a constant number of clusters of equal size,
our results are optimal up to logarithmic factors.

Matrix completion, i.e., the exact and provable recovery of a lowrank matrix
from a small subset of its elements, is currently only known to be possible if
the matrix satisfies a restrictive structural constraintknown as {\em
incoherence}on its row and column spaces. In these cases, the subset of
elements is sampled uniformly at random.
In this paper, we show that {\em any} rank$ r $ $ n$by$ n $ matrix can be
exactly recovered from as few as $O(nr \log^2 n)$ randomly chosen elements,
provided this random choice is made according to a {\em specific biased
distribution}: the probability of any element being sampled should be
proportional to the sum of the leverage scores of the corresponding row, and
column. Perhaps equally important, we show that this specific form of sampling
is nearly necessary, in a natural precise sense; this implies that other
perhaps more intuitive sampling schemes fail.
We further establish three ways to use the above result for the setting when
leverage scores are not known \textit{a priori}: (a) a sampling strategy for
the case when only one of the row or column spaces are incoherent, (b) a
twophase sampling procedure for general matrices that first samples to
estimate leverage scores followed by sampling for exact recovery, and (c) an
analysis showing the advantages of weighted nuclear/tracenorm minimization
over the vanilla unweighted formulation for the case of nonuniform sampling.

Mixed linear regression involves the recovery of two (or more) unknown
vectors from unlabeled linear measurements; that is, where each sample comes
from exactly one of the vectors, but we do not know which one. It is a classic
problem, and the natural and empirically most popular approach to its solution
has been the EM algorithm. As in other settings, this is prone to bad local
minima; however, each iteration is very fast (alternating between guessing
labels, and solving with those labels).
In this paper we provide a new initialization procedure for EM, based on
finding the leading two eigenvectors of an appropriate matrix. We then show
that with this, a resampled version of the EM algorithm provably converges to
the correct vectors, under natural assumptions on the sampling distribution,
and with nearly optimal (unimprovable) sample complexity. This provides not
only the first characterization of EM's performance, but also much lower sample
complexity as compared to both standard (randomly initialized) EM, and other
methods for this problem.

Alternating minimization represents a widely applicable and empirically
successful approach for finding lowrank matrices that best fit the given data.
For example, for the problem of lowrank matrix completion, this method is
believed to be one of the most accurate and efficient, and formed a major
component of the winning entry in the Netflix Challenge.
In the alternating minimization approach, the lowrank target matrix is
written in a bilinear form, i.e. $X = UV^\dag$; the algorithm then alternates
between finding the best $U$ and the best $V$. Typically, each alternating step
in isolation is convex and tractable. However the overall problem becomes
nonconvex and there has been almost no theoretical understanding of when this
approach yields a good result.
In this paper we present first theoretical analysis of the performance of
alternating minimization for matrix completion, and the related problem of
matrix sensing. For both these problems, celebrated recent results have shown
that they become wellposed and tractable once certain (now standard)
conditions are imposed on the problem. We show that alternating minimization
also succeeds under similar conditions. Moreover, compared to existing results,
our paper shows that alternating minimization guarantees faster (in particular,
geometric) convergence to the true matrix, while allowing a simpler analysis.