
In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
overparameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.

Dealing with the shear size and complexity of today's massive data sets
requires computational platforms that can analyze data in a parallelized and
distributed fashion. A major bottleneck that arises in such modern distributed
computing environments is that some of the worker nodes may run slow. These
nodes a.k.a.~stragglers can significantly slow down computation as the slowest
node may dictate the overall computational time. A recent computational
framework, called encoded optimization, creates redundancy in the data to
mitigate the effect of stragglers. In this paper we develop novel mathematical
understanding for this framework demonstrating its effectiveness in much
broader settings than was previously understood. We also analyze the
convergence behavior of iterative encoded optimization algorithms, allowing us
to characterize fundamental tradeoffs between convergence rate, size of data
set, accuracy, computational load (or data redundancy), and straggler
toleration in this framework.

Fourier ptychographic microscopy enables gigapixelscale imaging, with both
large fieldofview and high resolution. Using a set of lowresolution images
that are recorded under varying illumination angles, the goal is to
computationally reconstruct highresolution phase and amplitude images. To
increase temporal resolution, one may use multiplexed measurements where the
sample is illuminated simultaneously from a subset of the angles. In this
paper, we develop an algorithm for Fourier ptychographic microscopy with such
multiplexed illumination. Specifically, we consider gradient descent type
updates and propose an analytical step size that ensures the convergence of the
iterates to a stationary point. Furthermore, we propose an accelerated version
of our algorithm (with the same step size) which significantly improves the
convergence speed. We demonstrate that the practical performance of our
algorithm is identical to the case where the step size is manually tuned.
Finally, we apply our parameterfree approach to real data and validate its
applicability.

In this paper, we study the problem of maximizing continuous submodular
functions that naturally arise in many learning applications such as those
involving utility functions in active learning and sensing, matrix
approximations and network inference. Despite the apparent lack of convexity in
such functions, we prove that stochastic projected gradient methods can provide
strong approximation guarantees for maximizing continuous submodular functions
with convex constraints. More specifically, we prove that for monotone
continuous DRsubmodular functions, all fixed points of projected gradient
ascent provide a factor $1/2$ approximation to the global maxima. We also study
stochastic gradient and mirror methods and show that after
$\mathcal{O}(1/\epsilon^2)$ iterations these methods reach solutions which
achieve in expectation objective values exceeding
$(\frac{\text{OPT}}{2}\epsilon)$. An immediate application of our results is
to maximize submodular functions that are defined stochastically, i.e. the
submodular function is defined as an expectation over a family of submodular
functions with an unknown distribution. We will show how stochastic gradient
methods are naturally wellsuited for this setting, leading to a factor $1/2$
approximation when the function is monotone. In particular, it allows us to
approximately maximize discrete, monotone submodular optimization problems via
projected gradient descent on a continuous relaxation, directly connecting the
discrete and continuous domains. Finally, experiments on real data demonstrate
that our projected gradient methods consistently achieve the best utility
compared to other continuous baselines while remaining competitive in terms of
computational effort.

In this paper we study the problem of learning Rectified Linear Units (ReLUs)
which are functions of the form $max(0,<w,x>)$ with $w$ denoting the weight
vector. We study this problem in the highdimensional regime where the number
of observations are fewer than the dimension of the weight vector. We assume
that the weight vector belongs to some closed set (convex or nonconvex) which
captures known sideinformation about its structure. We focus on the realizable
model where the inputs are chosen i.i.d.~from a Gaussian distribution and the
labels are generated according to a planted weight vector. We show that
projected gradient descent, when initialization at 0, converges at a linear
rate to the planted model with a number of samples that is optimal up to
numerical constants. Our results on the dynamics of convergence of these very
shallow neural nets may provide some insights towards understanding the
dynamics of deeper architectures.

This paper concerns the problem of recovering an unknown but structured
signal $x \in R^n$ from $m$ quadratic measurements of the form
$y_r=<a_r,x>^2$ for $r=1,2,...,m$. We focus on the underdetermined setting
where the number of measurements is significantly smaller than the dimension of
the signal ($m<<n$). We formulate the recovery problem as a nonconvex
optimization problem where prior structural information about the signal is
enforced through constrains on the optimization variables. We prove that
projected gradient descent, when initialized in a neighborhood of the desired
signal, converges to the unknown signal at a linear rate. These results hold
for any constraint set (convex or nonconvex) providing convergence guarantees
to the global optimum even when the objective function and constraint set is
nonconvex. Furthermore, these results hold with a number of measurements that
is only a constant factor away from the minimal number of measurements required
to uniquely identify the unknown signal. Our results provide the first provably
tractable algorithm for this datapoor regime, breaking local sample complexity
barriers that have emerged in recent literature. In a companion paper we
demonstrate favorable properties for the optimization problem that may enable
similar results to continue to hold more globally (over the entire ambient
space). Collectively these two papers utilize and develop powerful tools for
uniform convergence of empirical processes that may have broader implications
for rigorous understanding of constrained nonconvex optimization heuristics.
The mathematical results in this paper also pave the way for a new generation
of datadriven phaseless imaging systems that can utilize prior information to
significantly reduce acquisition time and enhance image reconstruction,
enabling nanoscale imaging at unprecedented speeds and resolutions.

In this paper we study the problem of recovering a structured but unknown
parameter ${\bf{\theta}}^*$ from $n$ nonlinear observations of the form
$y_i=f(\langle {\bf{x}}_i,{\bf{\theta}}^*\rangle)$ for $i=1,2,\ldots,n$. We
develop a framework for characterizing timedata tradeoffs for a variety of
parameter estimation algorithms when the nonlinear function $f$ is unknown.
This framework includes many popular heuristics such as projected/proximal
gradient descent and stochastic schemes. For example, we show that a projected
gradient descent scheme converges at a linear rate to a reliable solution with
a near minimal number of samples. We provide a sharp characterization of the
convergence rate of such algorithms as a function of sample size, amount of
aprior knowledge available about the parameter and a measure of the
nonlinearity of the function $f$. These results provide a precise understanding
of the various tradeoffs involved between statistical and computational
resources as well as aprior side information available for such nonlinear
parameter estimation problems.

Line spectral estimation is the problem of recovering the frequencies and
amplitudes of a mixture of a few sinusoids from equispaced samples. However, in
a variety of signal processing problems arising in imaging, radar, and
localization we do not have access directly to such equispaced samples. Rather
we only observe a severely undersampled version of these observations through
linear measurements. This paper is about such generalized line spectral
estimation problems. We reformulate these problems as sparse signal recovery
problems over a continuously indexed dictionary which can be solved via a
convex program. We prove that the frequencies and amplitudes of the components
of the mixture can be recovered perfectly from a nearminimal number of
observations via this convex program. This result holds provided the
frequencies are sufficiently separated, and the linear measurements obey
natural conditions that are satisfied in a variety of applications.

Highdimensional data often lie in lowdimensional subspaces corresponding to
different classes they belong to. Finding sparse representations of data points
in a dictionary built using the collection of data helps to uncover
lowdimensional subspaces and address problems such as clustering,
classification, subset selection and more. In this paper, we address the
problem of recovering sparse representations for noisy data points in a
dictionary whose columns correspond to corrupted data lying close to a union of
subspaces. We consider a constrained $\ell_1$minimization and study conditions
under which the solution of the proposed optimization satisfies the approximate
subspacesparse recovery condition. More specifically, we show that each noisy
data point, perturbed from a subspace by a noise of the magnitude of
$\varepsilon$, will be reconstructed using data points from the same subspace
with a small error of the order of $O(\varepsilon)$ and that the coefficients
corresponding to data points in other subspaces will be sufficiently small,
\ie, of the order of $O(\varepsilon)$. We do not impose any randomness
assumption on the arrangement of subspaces or distribution of data points in
each subspace. Our framework is based on a novel generalization of the
nullspace property to the setting where data lie in multiple subspaces, the
number of data points in each subspace exceeds the dimension of the subspace,
and all data points are corrupted by noise. Moreover, assuming a random
distribution for data points, we further show that coefficients from the
desired support not only reconstruct a given point with high accuracy, but also
have sufficiently large values, \ie, of the order of $O(1)$.

In this paper we study the problem of recovering a lowrank matrix from
linear measurements. Our algorithm, which we call Procrustes Flow, starts from
an initial estimate obtained by a thresholding scheme followed by gradient
descent on a nonconvex objective. We show that as long as the measurements
obey a standard restricted isometry property, our algorithm converges to the
unknown matrix at a geometric rate. In the case of Gaussian measurements, such
convergence occurs for a $n_1 \times n_2$ matrix of rank $r$ when the number of
measurements exceeds a constant times $(n_1+n_2)r$.

In this paper we characterize sharp timedata tradeoffs for optimization
problems used for solving linear inverse problems. We focus on the minimization
of a leastsquares objective subject to a constraint defined as the sublevel
set of a penalty function. We present a unified convergence analysis of the
gradient projection algorithm applied to such problems. We sharply characterize
the convergence rate associated with a wide variety of random measurement
ensembles in terms of the number of measurements and structural complexity of
the signal with respect to the chosen penalty function. The results apply to
both convex and nonconvex constraints, demonstrating that a linear convergence
rate is attainable even though the least squares objective is not strongly
convex in these settings. When specialized to Gaussian measurements our results
show that such linear convergence occurs when the number of measurements is
merely 4 times the minimal number required to recover the desired signal at all
(a.k.a. the phase transition). We also achieve a slower but geometric rate of
convergence precisely above the phase transition point. Extensive numerical
results suggest that the derived rates exactly match the empirical performance.

Fourier ptychography is a new computational microscopy technique that
provides gigapixelscale intensity and phase images with both wide
fieldofview and high resolution. By capturing a stack of lowresolution
images under different illumination angles, a nonlinear inverse algorithm can
be used to computationally reconstruct the highresolution complex field. Here,
we compare and classify multiple proposed inverse algorithms in terms of
experimental robustness. We find that the main sources of error are noise,
aberrations and miscalibration (i.e. model mismatch). Using simulations and
experiments, we demonstrate that the choice of cost function plays a critical
role, with amplitudebased cost functions performing better than
intensitybased ones. The reason for this is that Fourier ptychography datasets
consist of images from both brightfield and darkfield illumination,
representing a large range of measured intensities. Both noise (e.g. Poisson
noise) and model mismatch errors are shown to scale with intensity. Hence,
algorithms that use an appropriate cost function will be more tolerant to both
noise and model mismatch. Given these insights, we propose a global Newton's
method algorithm which is robust and computationally efficient. Finally, we
discuss the impact of procedures for algorithmic correction of aberrations and
miscalibration.

We study the problem of recovering the phase from magnitude measurements;
specifically, we wish to reconstruct a complexvalued signal x of C^n about
which we have phaseless samples of the form y_r = < a_r,x >^2, r = 1,2,...,m
(knowledge of the phase of these samples would yield a linear system). This
paper develops a nonconvex formulation of the phase retrieval problem as well
as a concrete solution algorithm. In a nutshell, this algorithm starts with a
careful initialization obtained by means of a spectral method, and then refines
this initial estimate by iteratively applying novel update rules, which have
low computational complexity, much like in a gradient descent scheme. The main
contribution is that this algorithm is shown to rigorously allow the exact
retrieval of phase information from a nearly minimal number of random
measurements. Indeed, the sequence of successive iterates provably converges to
the solution at a geometric rate so that the proposed scheme is efficient both
in terms of computational and data resources. In theory, a variation on this
scheme leads to a nearlinear time algorithm for a physically realizable model
based on coded diffraction patterns. We illustrate the effectiveness of our
methods with various experiments on image data. Underlying our analysis are
insights for the analysis of nonconvex optimization schemes that may have
implications for computational problems beyond phase retrieval.

In this paper we show that for the purposes of dimensionality reduction
certain class of structured random matrices behave similarly to random Gaussian
matrices. This class includes several matrices for which matrixvector multiply
can be computed in loglinear time, providing efficient dimensionality
reduction of general sets. In particular, we show that using such matrices any
set from high dimensions can be embedded into lower dimensions with near
optimal distortion. We obtain our results by connecting dimensionality
reduction of any set to dimensionality reduction of sparse vectors via a
chaining argument.

In this paper we study the identification of a timevarying linear system
from its response to a known input signal. More specifically, we consider
systems whose response to the input signal is given by a weighted superposition
of delayed and Doppler shifted versions of the input. This problem arises in a
multitude of applications such as wireless communications and radar imaging.
Due to practical constraints, the input signal has finite bandwidth B, and the
received signal is observed over a finite time interval of length T only. This
gives rise to a delay and Doppler resolution of 1/B and 1/T. We show that this
resolution limit can be overcome, i.e., we can exactly recover the continuous
delayDoppler pairs and the corresponding attenuation factors, by solving a
convex optimization problem. This result holds provided that the distance
between the delayDoppler pairs is at least 2.37/B in time or 2.37/T in
frequency. Furthermore, this result allows the total number of delayDoppler
pairs to be linear up to a logfactor in BT, the dimensionality of the response
of the system, and thereby the limit for identifiability. Stated differently,
we show that we can estimate the timefrequency components of a signal that is
Ssparse in the continuous dictionary of timefrequency shifts of a random
window function, from a number of measurements, that is linear up to a
logfactor in S.

Subspace clustering refers to the task of finding a multisubspace
representation that best fits a collection of points taken from a
highdimensional space. This paper introduces an algorithm inspired by sparse
subspace clustering (SSC) [In IEEE Conference on Computer Vision and Pattern
Recognition, CVPR (2009) 27902797] to cluster noisy data, and develops some
novel theory demonstrating its correctness. In particular, the theory uses
ideas from geometric functional analysis to show that the algorithm can
accurately recover the underlying subspaces under minimal requirements on their
orientation, and on the number of samples per subspace. Synthetic as well as
real data experiments complement our theoretical study, illustrating our
approach and demonstrating its effectiveness.

This paper considers the question of recovering the phase of an object from
intensityonly measurements, a problem which naturally appears in Xray
crystallography and related disciplines. We study a physically realistic setup
where one can modulate the signal of interest and then collect the intensity of
its diffraction pattern, each modulation thereby producing a sort of coded
diffraction pattern. We show that PhaseLift, a recent convex programming
technique, recovers the phase information exactly from a number of random
modulations, which is polylogarithmic in the number of unknowns. Numerical
experiments with noiseless and noisy data complement our theoretical analysis
and illustrate our approach.

This paper considers the problem of clustering a collection of unlabeled data
points assumed to lie near a union of lowerdimensional planes. As is common in
computer vision or unsupervised learning applications, we do not know in
advance how many subspaces there are nor do we have any information about their
dimensions. We develop a novel geometric analysis of an algorithm named sparse
subspace clustering (SSC) [In IEEE Conference on Computer Vision and Pattern
Recognition, 2009. CVPR 2009 (2009) 27902797. IEEE], which significantly
broadens the range of problems where it is provably effective. For instance, we
show that SSC can recover multiple subspaces, each of dimension comparable to
the ambient dimension. We also prove that SSC can correctly cluster data points
even when the subspaces of interest intersect. Further, we develop an extension
of SSC that succeeds when the data set is corrupted with possibly
overwhelmingly many outliers. Underlying our analysis are clear geometric
insights, which may bear on other sparse recovery problems. A numerical study
complements our theoretical analysis and demonstrates the effectiveness of
these methods.

Discussion of "Latent variable graphical model selection via convex
optimization" by Venkat Chandrasekaran, Pablo A. Parrilo and Alan S. Willsky
[arXiv:1008.1290].

Wireless OFDM channels can be approximated by a time varying filter with
sparse time domain taps. Recent achievements in sparse signal processing such
as compressed sensing have facilitated the use of sparsity in estimation, which
improves the performance significantly. The problem of these sparsebased
methods is the need for a stable transformation matrix which is not fulfilled
in the current transmission setups. To assist the analog filtering at the
receiver, the transmitter leaves some of the subcarriers at both edges of the
bandwidth unused which results in an illconditioned DFT submatrix. To overcome
this difficulty we propose Adaptive Thresholding for Sparse Signal Detection
(ATSSD). Simulation results confirm that the proposed method works well in
timeinvariant and specially timevarying channels where other methods may not
work as well.

In this paper we introduce a new class of codes for overloaded synchronous
wireless CDMA systems which increases the number of users for a fixed number of
chips without introducing any errors. In addition these codes support active
user detection. We derive an upper bound on the number of users with a fixed
spreading factor. Also we propose an ML decoder for a subclass of these codes
that is computationally implementable. Although for our simulations we consider
a scenario that is worse than what occurs in practice, simulation results
indicate that this coding/decoding scheme is robust against additive noise. As
an example, for 64 chips and 88 users we propose a coding/decoding scheme that
can obtain an arbitrary small probability of error which is computationally
feasible and can detect active users. Furthermore, we prove that for this to be
possible the number of users cannot be beyond 230.