
Binary embedding is a nonlinear dimension reduction methodology where high
dimensional data are embedded into the Hamming cube while preserving the
structure of the original space. Specifically, for an arbitrary $N$ distinct
points in $\mathbb{S}^{p1}$, our goal is to encode each point using
$m$dimensional binary strings such that we can reconstruct their geodesic
distance up to $\delta$ uniform distortion. Existing binary embedding
algorithms either lack theoretical guarantees or suffer from running time
$O\big(mp\big)$. We make three contributions: (1) we establish a lower bound
that shows any binary embedding oblivious to the set of points requires $m =
\Omega(\frac{1}{\delta^2}\log{N})$ bits and a similar lower bound for
nonoblivious embeddings into Hamming distance; (2) [DELETED, see comment]; (3)
we also provide an analytic result about embedding a general set of points $K
\subseteq \mathbb{S}^{p1}$ with even infinite size. Our theoretical findings
are supported through experiments on both synthetic and real data sets.

We introduce a new malware detector  ShapeGD  that aggregates permachine
detectors into a robust global detector. ShapeGD is based on two insights: 1.
Structural: actions such as visiting a website (waterhole attack) by nodes
correlate well with malware spread, and create dynamic neighborhoods of nodes
that were exposed to the same attack vector. However, neighborhood sizes vary
unpredictably and require aggregating an unpredictable number of local
detectors' outputs into a global alert. 2. Statistical: feature vectors
corresponding to true and false positives of local detectors have markedly
different conditional distributions  i.e. their shapes differ. The shape of
neighborhoods can identify infected neighborhoods without having to estimate
neighborhood sizes  on 5 years of Symantec detectors' logs, ShapeGD reduces
false positives from ~1M down to ~110K and raises alerts 345 days (on average)
before commercial antivirus products; in a waterhole attack simulated using
Yahoo webservice logs, ShapeGD detects infected machines when only ~100 of
~550K are compromised.

Motivated by the study of controlling (curing) epidemics, we consider the
spread of an SI process on a known graph, where we have a limited budget to use
to transition infected nodes back to the susceptible state (i.e., to cure
nodes). Recent work has demonstrated that under perfect and instantaneous
information (which nodes are/are not infected), the budget required for curing
a graph precisely depends on a combinatorial property called the CutWidth. We
show that this assumption is in fact necessary: even a minor degradation of
perfect information, e.g., a diagnostic test that is 99% accurate, drastically
alters the landscape. Infections that could previously be cured in sublinear
time now may require exponential time, or orderwise larger budget to cure. The
crux of the issue comes down to a tension not present in the full information
case: if a node is suspected (but not certain) to be infected, do we risk
wasting our budget to try to cure an uninfected node, or increase our certainty
by longer observation, at the risk that the infection spreads further? Our
results present fundamental, algorithmindependent bounds that tradeoff budget
required vs. uncertainty.

Behavioral malware detectors promise to expose previously unknown malware and
are an important security primitive. However, even the best behavioral
detectors suffer from high false positives and negatives. In this paper, we
address the challenge of aggregating weak perdevice behavioral detectors in
noisy communities (i.e., ones that produce alerts at unpredictable rates) into
an accurate and robust global anomaly detector (GD).
Our system  Shape GD  combines two insights: Structural: actions such as
visiting a website (waterhole attack) or membership in a shared email thread
(phishing attack) by nodes correlate well with malware spread, and create
dynamic neighborhoods of nodes that were exposed to the same attack vector; and
Statistical: feature vectors corresponding to true and false positives of local
detectors have markedly different conditional distributions. We use
neighborhoods to amplify the transient lowdimensional structure that is latent
in highdimensional feature vectors  but neighborhoods vary unpredictably, and
we use shape to extract robust neighborhoodlevel features that identify
infected neighborhoods.
Unlike prior works that aggregate local detectors' alert bitstreams or
cluster the feature vectors, Shape GD analyzes the feature vectors that led to
local alerts (alertFVs) to separate true and false positives. Shape GD first
filters these alertFVs into neighborhoods and efficiently maps a
neighborhood's alertFVs' statistical shapes into a scalar score. Shape GD then
acts as a neighborhood level anomaly detector  training on benign program
traces to learn the ShapeScore of false positive neighborhoods, and classifying
neighborhoods with anomalous ShapeScores as malicious.
Shape GD detects malware early (~100 infected nodes in a ~100K node system
for waterhole and ~10 of 1000 for phishing) and robustly (with ~100% global TP
and ~1% global FP rates).

We present a novel method for frequentist statistical inference in
$M$estimation problems, based on stochastic gradient descent (SGD) with a
fixed step size: we demonstrate that the average of such SGD sequences can be
used for statistical inference, after proper scaling. An intuitive analysis
using the OrnsteinUhlenbeck process suggests that such averages are
asymptotically normal. From a practical perspective, our SGDbased inference
procedure is a first order method, and is wellsuited for large scale problems.
To show its merits, we apply it to both synthetic and real datasets, and
demonstrate that its accuracy is comparable to classical statistical methods,
while requiring potentially far less computation.

A rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ can be written as a product
$U V^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n
\times r}$. One could exploit this observation in optimization: e.g., consider
the minimization of a convex function $f(X)$ over rank$r$ matrices, where the
set of rank$r$ matrices is modeled via the factorization $UV^\top$. Though
such parameterization reduces the number of variables, and is more
computationally efficient (of particular interest is the case $r \ll \min\{m,
n\}$), it comes at a cost: $f(UV^\top)$ becomes a nonconvex function w.r.t.
$U$ and $V$.
We study such parameterization for optimization of generic convex objectives
$f$, and focus on firstorder, gradient descent algorithmic solutions. We
propose the BiFactored Gradient Descent (BFGD) algorithm, an efficient
firstorder method that operates on the $U, V$ factors. We show that when $f$
is (restricted) smooth, BFGD has local sublinear convergence, and linear
convergence when $f$ is both (restricted) smooth and (restricted) strongly
convex. For several key applications, we provide simple and efficient
initialization schemes that provide approximate solutions good enough for the
above convergence results to hold.

We study the projected gradient descent method on lowrank matrix problems
with a strongly convex objective. We use the BurerMonteiro factorization
approach to implicitly enforce lowrankness; such factorization introduces
nonconvexity in the objective. We focus on constraint sets that include both
positive semidefinite (PSD) constraints and specific matrix normconstraints.
Such criteria appear in quantum state tomography and phase retrieval
applications.
We show that nonconvex projected gradient descent favors local linear
convergence in the factored space. We build our theory on a novel descent
lemma, that nontrivially extends recent results on the unconstrained problem.
The resulting algorithm is Projected Factored Gradient Descent, abbreviated as
ProjFGD, and shows superior performance compared to state of the art on quantum
state tomography and sparse phase retrieval applications.

We consider the nonsquare matrix sensing problem, under restricted isometry
property (RIP) assumptions. We focus on the nonconvex formulation, where any
rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ is represented as $UV^\top$,
where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. In
this paper, we complement recent findings on the nonconvex geometry of the
analogous PSD setting [5], and show that matrix factorization does not
introduce any spurious local minima, under RIP.

We consider the problem of Robust PCA in the fully and partially observed
settings. Without corruptions, this is the wellknown matrix completion
problem. From a statistical standpoint this problem has been recently
wellstudied, and conditions on when recovery is possible (how many
observations do we need, how many corruptions can we tolerate) via
polynomialtime algorithms is by now understood. This paper presents and
analyzes a nonconvex optimization approach that greatly reduces the
computational complexity of the above problems, compared to the best available
algorithms. In particular, in the fully observed case, with $r$ denoting rank
and $d$ dimension, we reduce the complexity from
$\mathcal{O}(r^2d^2\log(1/\varepsilon))$ to
$\mathcal{O}(rd^2\log(1/\varepsilon))$  a big savings when the rank is big.
For the partially observed case, we show the complexity of our algorithm is no
more than $\mathcal{O}(r^4d \log d \log(1/\varepsilon))$. Not only is this the
bestknown runtime for a provable algorithm under partial observation, but in
the setting where $r$ is small compared to $d$, it also allows for
nearlinearin$d$ runtime that can be exploited in the fullyobserved case as
well, by simply running our algorithm on a subset of the observations.

We consider the problem of solving mixed random linear equations with $k$
components. This is the noiseless setting of mixed linear regression. The goal
is to estimate multiple linear models from mixed samples in the case where the
labels (which sample corresponds to which model) are not observed. We give a
tractable algorithm for the mixed linear equation problem, and show that under
some technical conditions, our algorithm is guaranteed to solve the problem
exactly with sample complexity linear in the dimension, and polynomial in $k$,
the number of components. Previous approaches have required either exponential
dependence on $k$, or superlinear dependence on the dimension. The proposed
algorithm is a combination of tensor decomposition and alternating
minimization. Our analysis involves proving that the initialization provided by
the tensor method allows alternating minimization, which is equivalent to EM in
our setting, to converge to the global optimum at a linear rate.

This paper considers the problem of matrix completion when some number of the
columns are completely and arbitrarily corrupted, potentially by a malicious
adversary. It is wellknown that standard algorithms for matrix completion can
return arbitrarily poor results, if even a single column is corrupted. One
direct application comes from robust collaborative filtering. Here, some number
of users are socalled manipulators who try to skew the predictions of the
algorithm by calibrating their inputs to the system. In this paper, we develop
an efficient algorithm for this problem based on a combination of a trimming
procedure and a convex program that minimizes the nuclear norm and the
$\ell_{1,2}$ norm. Our theoretical results show that given a vanishing fraction
of observed entries, it is nevertheless possible to complete the underlying
matrix even when the number of corrupted columns grows. Significantly, our
results hold without any assumptions on the locations or values of the observed
entries of the manipulated columns. Moreover, we show by an
informationtheoretic argument that our guarantees are nearly optimal in terms
of the fraction of sampled entries on the authentic columns, the fraction of
corrupted columns, and the rank of the underlying matrix. Our results therefore
sharply characterize the tradeoffs between sample, robustness and rank in
matrix completion.

Two key traits of 5G cellular networks are much higher base station (BS)
densities  especially in the case of lowpower BSs  and the use of massive
MIMO at these BSs. This paper explores how massive MIMO can be used to jointly
maximize the offloading gains and minimize the interference challenges arising
from adding small cells. We consider two interference management approaches:
joint transmission (JT) with local precoding, where users are served
simultaneously by multiple BSs without requiring channel state information
exchanges among cooperating BSs, and resource blanking, where some macro BS
resources are left blank to reduce the interference in the small cell downlink.
A key advantage offered by massive MIMO is channel hardening, which enables to
predict instantaneous rates a priori. This allows us to develop a unified
framework, where resource allocation is cast as a network utility maximization
(NUM) problem, and to demonstrate large gains in celledge rates based on the
NUM solution. We propose an efficient dual subgradient based algorithm, which
converges towards the NUM solution. A scheduling scheme is also proposed to
approach the NUM solution. Simulations illustrate more than 2x rate gain for
10th percentile users vs. an optimal association without interference
management.

Latent variable models are a fundamental modeling tool in machine learning
applications, but they present significant computational and analytical
challenges. The popular EM algorithm and its variants, is a much used
algorithmic tool; yet our rigorous understanding of its performance is highly
incomplete. Recently, work in Balakrishnan et al. (2014) has demonstrated that
for an important class of problems, EM exhibits linear local convergence. In
the highdimensional setting, however, the Mstep may not be well defined. We
address precisely this setting through a unified treatment using
regularization. While regularization for highdimensional problems is by now
well understood, the iterative EM algorithm requires a careful balancing of
making progress towards the solution while identifying the right structure
(e.g., sparsity or lowrank). In particular, regularizing the Mstep using the
stateoftheart highdimensional prescriptions (e.g., Wainwright (2014)) is
not guaranteed to provide this balance. Our algorithm and analysis are linked
in a way that reveals the balance between optimization and statistical errors.
We specialize our general framework to sparse gaussian mixture models,
highdimensional mixed regression, and regression with missing variables,
obtaining statistical guarantees for each of these examples.

Linear regression studies the problem of estimating a model parameter
$\beta^* \in \mathbb{R}^p$, from $n$ observations
$\{(y_i,\mathbf{x}_i)\}_{i=1}^n$ from linear model $y_i = \langle
\mathbf{x}_i,\beta^* \rangle + \epsilon_i$. We consider a significant
generalization in which the relationship between $\langle \mathbf{x}_i,\beta^*
\rangle$ and $y_i$ is noisy, quantized to a single bit, potentially nonlinear,
noninvertible, as well as unknown. This model is known as the singleindex
model in statistics, and, among other things, it represents a significant
generalization of onebit compressed sensing. We propose a novel spectralbased
estimation procedure and show that we can recover $\beta^*$ in settings (i.e.,
classes of link function $f$) where previous algorithms fail. In general, our
algorithm requires only very mild restrictions on the (unknown) functional
relationship between $y_i$ and $\langle \mathbf{x}_i,\beta^* \rangle$. We also
consider the high dimensional setting where $\beta^*$ is sparse ,and introduce
a twostage nonconvex framework that addresses estimation challenges in high
dimensional regimes where $p \gg n$. For a broad class of link functions
between $\langle \mathbf{x}_i,\beta^* \rangle$ and $y_i$, we establish minimax
lower bounds that demonstrate the optimality of our estimators in both the
classical and high dimensional regimes.

We consider a server serving a timeslotted queued system of multiple
packetbased flows, with exogenous packet arrivals and timevarying service
rates. At each time, the server can observe instantaneous service rates for
only a subset of flows (from within a fixed collection of observable subsets)
before scheduling a flow in the subset for service. We are interested in
queuelength aware scheduling to keep the queues short, and develop scheduling
algorithms that use only partial service rate information from subsets of
channels to minimize the likelihood of queue overflow in the system.
Specifically, we present a new joint subsetsampling and scheduling algorithm
called MaxExp that uses only the current queue lengths to pick a subset of
flows, and subsequently schedules a flow using the Exponential rule. When the
collection of observable subsets is disjoint, we show that MaxExp achieves the
best exponential decay rate, among all scheduling algorithms using partial
information, of the tail of the longest queue in the system. Towards this, we
employ novel analytical techniques for studying the performance of scheduling
algorithms using partial state, which may be of independent interest. These
include new samplepath large deviations results for processes obtained by
nonrandom, predictable sampling of sequences of independent and identically
distributed random variables. A consequence of these results is that scheduling
with partial state information yields a rate function significantly different
from scheduling with full channel information. In the special case when the
observable subsets are singleton flows, i.e., when there is effectively no a
priori channelstate information, MaxExp reduces to simply serving the flow
with the longest queue; thus, our results show that to always serve the longest
queue in the absence of any channelstate information is largedeviations
optimal.

Many models for sparse regression typically assume that the covariates are
known completely, and without noise. Particularly in highdimensional
applications, this is often not the case. This paper develops efficient
OMPlike algorithms to deal with precisely this setting. Our algorithms are as
efficient as OMP, and improve on the bestknown results for missing and noisy
data in regression, both in the highdimensional setting where we seek to
recover a sparse vector from only a few measurements, and in the classical
lowdimensional setting where we recover an unstructured regressor. In the
highdimensional setting, our supportrecovery algorithm requires no knowledge
of even the statistics of the noise. Along the way, we also obtain improved
performance guarantees for OMP for the standard sparse regression problem with
Gaussian noise.

We propose FrogWild, a novel algorithm for fast approximation of high
PageRank vertices, geared towards reducing network costs of running traditional
PageRank algorithms. Our algorithm can be seen as a quantized version of power
iteration that performs multiple parallel random walks over a directed graph.
One important innovation is that we introduce a modification to the GraphLab
framework that only partially synchronizes mirror vertices. This partial
synchronization vastly reduces the network traffic generated by traditional
PageRank algorithms, thus greatly reducing the periteration cost of PageRank.
On the other hand, this partial synchronization also creates dependencies
between the random walks used to estimate PageRank. Our main theoretical
innovation is the analysis of the correlations introduced by this partial
synchronization process and a bound establishing that our approximation is
close to the true PageRank vector.
We implement our algorithm in GraphLab and compare it against the default
PageRank implementation. We show that our algorithm is very fast, performing
each iteration in less than one second on the Twitter graph and can be up to 7x
faster compared to the standard GraphLab PageRank implementation.

We consider the mixed regression problem with two components, under
adversarial and stochastic noise. We give a convex optimization formulation
that provably recovers the true solution, and provide upper bounds on the
recovery errors for both arbitrary noise and stochastic noise settings. We also
give matching minimax lower bounds (up to log factors), showing that under
certain assumptions, our algorithm is informationtheoretically optimal. Our
results represent the first tractable algorithm guaranteeing successful
recovery with tight bounds on recovery errors and sample complexity.

Cellular network performance can significantly benefit from direct
devicetodevice (D2D) communication, but interference from cochannel D2D
communication limits the performance gain. In hybrid networks consisting of D2D
and cellular links, finding the optimal interference management is challenging.
In particular, we show that the problem of maximizing network throughput while
guaranteeing predefined service levels to cellular users is non convex and
hence intractable. Instead, we adopt a distributed approach that is
computationally extremely efficient, and requires minimal coordination,
communication and cooperation among the nodes. The key algorithmic idea is a
signaling mechanism that can be seen as a fictional pricing mechanism, that the
base stations optimize and transmit to the D2D users, who then play a best
response (i.e., selfishly) to this signal. Numerical results show that our
algorithms converge quickly, have low overhead, and achieve a significant
throughput gain, while maintaining the quality of cellular links at a
predefined service level.

We consider the problem of subspace clustering: given points that lie on or
near the union of many lowdimensional linear subspaces, recover the subspaces.
To this end, one first identifies sets of points close to the same subspace and
uses the sets to estimate the subspaces. As the geometric structure of the
clusters (linear subspaces) forbids proper performance of general distance
based approaches such as Kmeans, many modelspecific methods have been
proposed. In this paper, we provide new simple and efficient algorithms for
this problem. Our statistical analysis shows that the algorithms are guaranteed
exact (perfect) clustering performance under certain conditions on the number
of points and the affinity between subspaces. These conditions are weaker than
those considered in the standard statistical literature. Experimental results
on synthetic data generated from the standard unions of subspaces model
demonstrate our theory. We also show that our algorithm performs competitively
against stateoftheart algorithms on realworld applications such as motion
segmentation and face clustering, with much simpler implementation and lower
computational cost.

We develop a flexible and accurate framework for devicetodevice (D2D)
communication in the context of a conventional cellular network, which allows
for timefrequency resources to be either shared or orthogonally partitioned
between the two networks. Using stochastic geometry, we provide accurate
expressions for SINR distributions and average rates, under an assumption of
interference randomization via time and/or frequency hopping, for both
dedicated and shared spectrum approaches. We obtain analytical results in
closed or semiclosed form in high SNR regime, that allow us to easily explore
the impact of key parameters (e.g., the load and hopping probabilities) on the
network performance. In particular, unlike other models, the expressions we
obtain are tractable, i.e., they can be efficiently optimized without extensive
simulation. Using these, we optimize the hopping probabilities for the D2D
links, i.e., how often they should request a time or frequency slot. This can
be viewed as an optimized lower bound to other more sophisticated scheduling
schemes. We also investigate the optimal resource partitions between D2D and
cellular networks when they use orthogonal resources.

Mixed linear regression involves the recovery of two (or more) unknown
vectors from unlabeled linear measurements; that is, where each sample comes
from exactly one of the vectors, but we do not know which one. It is a classic
problem, and the natural and empirically most popular approach to its solution
has been the EM algorithm. As in other settings, this is prone to bad local
minima; however, each iteration is very fast (alternating between guessing
labels, and solving with those labels).
In this paper we provide a new initialization procedure for EM, based on
finding the leading two eigenvectors of an appropriate matrix. We then show
that with this, a resampled version of the EM algorithm provably converges to
the correct vectors, under natural assumptions on the sampling distribution,
and with nearly optimal (unimprovable) sample complexity. This provides not
only the first characterization of EM's performance, but also much lower sample
complexity as compared to both standard (randomly initialized) EM, and other
methods for this problem.

We consider the problem of detecting an epidemic in a population where
individual diagnoses are extremely noisy. The motivation for this problem is
the plethora of examples (influenza strains in humans, or computer viruses in
smartphones, etc.) where reliable diagnoses are scarce, but noisy data
plentiful. In flu/phoneviruses, exceedingly few infected people/phones are
professionally diagnosed (only a small fraction go to a doctor) but less
reliable secondary signatures (e.g., people staying home, or
greaterthantypical upload activity) are more readily available. These
secondary data are often plagued by unreliability: many people with the flu do
not stay home, and many people that stay home do not have the flu. This paper
identifies the precise regime where knowledge of the contact network enables
finding the needle in the haystack: we provide a distributed, efficient and
robust algorithm that can correctly identify the existence of a spreading
epidemic from highly unreliable local data. Our algorithm requires only
localneighbor knowledge of this graph, and in a broad array of settings that
we describe, succeeds even when false negatives and false positives make up an
overwhelming fraction of the data available. Our results show it succeeds in
the presence of partial information about the contact network, and also when
there is not a single "patient zero", but rather many (hundreds, in our
examples) of initial patientzeroes, spread across the graph.

Newly developed HTTPbased video streaming technologies enable flexible
rateadaptation under varying channel conditions. Accurately predicting the
users' Quality of Experience (QoE) for rateadaptive HTTP video streams is thus
critical to achieve efficiency. An important aspect of understanding and
modeling QoE is predicting the uptothemoment subjective quality of a video
as it is played, which is difficult due to hysteresis effects and
nonlinearities in human behavioral responses. This paper presents a
HammersteinWiener model for predicting the timevarying subjective quality
(TVSQ) of rateadaptive videos. To collect data for model parameterization and
validation, a database of longerduration videos with timevarying distortions
was built and the TVSQs of the videos were measured in a largescale subjective
study. The proposed method is able to reliably predict the TVSQ of rate
adaptive videos. Since the HammersteinWiener model has a very simple
structure, the proposed method is suitable for online TVSQ prediction in HTTP
based streaming.

Realtime video demands qualityofservice (QoS) guarantees such as delay
bounds for enduser satisfaction. Furthermore, the tolerable delay varies
depending on the use case such as live streaming or twoway video conferencing.
Due to the inherently stochastic nature of wireless fading channels,
deterministic delay bounds are difficult to guarantee. Instead, we propose
providing statistical delay guarantees using the concept of effective capacity.
We consider a multiuser setup whereby different users have (possibly different)
delay QoS constraints. We derive the resource allocation policy that maximizes
the sum video quality and applies to any quality metric with concave
ratequality mapping. We show that the optimal operating point per user is such
that the ratedistortion slope is the inverse of the supported video source
rate per unit bandwidth, a key metric we refer to as the source spectral
efficiency. We also solve the alternative problem of fairnessbased resource
allocation whereby the objective is to maximize the minimum video quality
across users. Finally, we derive user admission and scheduling policies that
enable selecting a maximal user subset such that all selected users can meet
their statistical delay requirement. Results show that video users with
differentiated QoS requirements can achieve similar video quality with vastly
different resource requirements. Thus, QoSaware scheduling and resource
allocation enable supporting significantly more users under the same resource
constraints.