
Distributed machine learning algorithms enable learning of models from
datasets that are distributed over a network without gathering the data at a
centralized location. While efficient distributed algorithms have been
developed under the assumption of faultless networks, failures that can render
these algorithms nonfunctional occur frequently in the real world. This paper
focuses on the problem of Byzantine failures, which are the hardest to
safeguard against in distributed algorithms. While Byzantine fault tolerance
has a rich history, existing work does not translate into efficient and
practical algorithms for highdimensional learning in fully distributed (also
known as decentralized) settings. In this paper, an algorithm termed
Byzantineresilient distributed coordinate descent (ByRDiE) is developed and
analyzed that enables distributed learning in the presence of Byzantine
failures. Theoretical analysis (convex settings) and numerical experiments
(convex and nonconvex settings) highlight its usefulness for highdimensional
distributed learning in the presence of Byzantine failures.

Motivated by machine learning applications in networks of sensors,
internetofthings (IoT) devices, and autonomous agents, we propose techniques
for distributed stochastic convex learning from highrate data streams. The
setup involves a network of nodeseach one of which has a stream of data
arriving at a constant ratethat solve a stochastic convex optimization
problem by collaborating with each other over ratelimited communication links.
To this end, we present and analyze two algorithmstermed distributed
stochastic approximation mirror descent (DSAMD) and accelerated distributed
stochastic approximation mirror descent (ADSAMD)that are based on two
stochastic variants of mirror descent and in which nodes collaborate via
approximate averaging of the local, noisy subgradients using distributed
consensus. Our main contributions are (i) bounds on the convergence rates of
DSAMD and ADSAMD in terms of the number of nodes, network topology, and ratio
of the data streaming and communication rates, and (ii) sufficient conditions
for orderoptimum convergence of these algorithms. In particular, we show that
for sufficiently wellconnected networks, distributed learning schemes can
obtain orderoptimum convergence even if the communications rate is small.
Further we find that the use of accelerated methods significantly enlarges the
regime in which orderoptimum convergence is achieved; this is in contrast to
the centralized setting, where accelerated methods usually offer only a modest
improvement. Finally, we demonstrate the effectiveness of the proposed
algorithms using numerical experiments.

This paper derives sufficient conditions for local recovery of coordinate
dictionaries comprising a Kroneckerstructured dictionary that is used for
representing $K$thorder tensor data. Tensor observations are assumed to be
generated from a Kroneckerstructured dictionary multiplied by sparse
coefficient tensors that follow the separable sparsity model. This work
provides sufficient conditions on the underlying coordinate dictionaries,
coefficient and noise distributions, and number of samples that guarantee
recovery of the individual coordinate dictionaries up to a specified error, as
a local minimum of the objective function, with high probability. In
particular, the sample complexity to recover $K$ coordinate dictionaries with
dimensions $m_k \times p_k$ up to estimation error $\varepsilon_k$ is shown to
be $\max_{k \in [K]}\mathcal{O}(m_kp_k^3\varepsilon_k^{2})$.

This paper provides fundamental limits on the sample complexity of estimating
dictionaries for tensor data. The specific focus of this work is on $K$thorder
tensor data and the case where the underlying dictionary can be expressed in
terms of $K$ smaller dictionaries. It is assumed the data are generated by
linear combinations of these structured dictionary atoms and observed through
white Gaussian noise. This work first provides a general lower bound on the
minimax risk of dictionary learning for such tensor data and then adapts the
proof techniques for specialized results in the case of sparse and
sparseGaussian linear combinations. The results suggest the sample complexity
of dictionary learning for tensor data can be significantly lower than that for
unstructured data: for unstructured data it scales linearly with the product of
the dictionary dimensions, whereas for tensorstructured data the bound scales
linearly with the sum of the product of the dimensions of the (smaller)
component dictionaries. A partial converse is provided for the case of
2ndorder tensor data to show that the bounds in this paper can be tight. This
involves developing an algorithm for learning highlystructured dictionaries
from noisy tensor data. Finally, numerical experiments highlight the advantages
associated with explicitly accounting for tensor data structure during
dictionary learning.

Statistical inference can be computationally prohibitive in
ultrahighdimensional linear models. Correlationbased variable screening, in
which one leverages marginal correlations for removal of irrelevant variables
from the model prior to statistical inference, can be used to overcome this
challenge. Prior works on correlationbased variable screening either impose
strong statistical priors on the linear model or assume specific postscreening
inference methods. This paper first extends the analysis of correlationbased
variable screening to arbitrary linear models and postscreening inference
techniques. In particular, ($i$) it shows that a conditiontermed the
screening conditionis sufficient for successful correlationbased screening
of linear models, and ($ii$) it provides insights into the dependence of
marginal correlationbased screening on different problem parameters. Numerical
experiments confirm that these insights are not mere artifacts of analysis;
rather, they are reflective of the challenges associated with marginal
correlationbased variable screening. Second, the paper explicitly derives the
screening condition for two families of linear models, namely, subGaussian
linear models and arbitrary (random or deterministic) linear models. In the
process, it establishes thatunder appropriate conditionsit is possible to
reduce the dimension of an ultrahighdimensional, arbitrary linear model to
almost the sample size even when the number of active variables scales almost
linearly with the sample size.

In this paper, we propose a general framework that transforms the problems of
designing sparse finiteimpulseresponse linear equalizers and nonlinear
decisionfeedback equalizers, for multiple antenna systems, into the problem of
sparsestapproximation of a vector in different dictionaries. In addition, we
investigate several choices of the sparsifying dictionaries under this
framework. Furthermore, the worstcase coherences of these dictionaries, which
determine their sparsifying effectiveness, are analytically and/or numerically
evaluated. Moreover, we show how to reduce the computational complexity of the
designed sparse equalizer filters by exploiting the asymptotic equivalence of
Toeplitz and circulant matrices. Finally, the superiority of our proposed
framework over conventional methods is demonstrated through numerical
experiments.

Representation of human actions as a sequence of human body movements or
action attributes enables the development of models for human activity
recognition and summarization. We present an extension of the lowrank
representation (LRR) model, termed the clusteringaware structureconstrained
lowrank representation (CSLRR) model, for unsupervised learning of human
action attributes from video data. Our model is based on the unionofsubspaces
(UoS) framework, and integrates spectral clustering into the LRR optimization
problem for better subspace clustering results. We lay out an efficient linear
alternating direction method to solve the CSLRR optimization problem. We also
introduce a hierarchical subspace clustering approach, termed hierarchical
CSLRR, to learn the attributes without the need for a priori specification of
their number. By visualizing and labeling these action attributes, the
hierarchical model can be used to semantically summarize long video sequences
of human actions at multiple resolutions. A human action or activity can also
be uniquely represented as a sequence of transitions from one action attribute
to another, which can then be used for human action recognition. We demonstrate
the effectiveness of the proposed model for semantic summarization and action
recognition through comprehensive experiments on five realworld human action
datasets.

Subspacebased signal processing traditionally focuses on problems involving
a few subspaces. Recently, a number of problems in different application areas
have emerged that involve a significantly larger number of subspaces relative
to the ambient dimension. It becomes imperative in such settings to first
identify a smaller set of active subspaces that contribute to the observation
before further processing can be carried out. This problem of identification of
a small set of active subspaces among a huge collection of subspaces from a
single (noisy) observation in the ambient space is termed subspace unmixing.
This paper formally poses the subspace unmixing problem under the parsimonious
subspacesum (PS3) model, discusses connections of the PS3 model to problems in
wireless communications, hyperspectral imaging, highdimensional statistics and
compressed sensing, and proposes a lowcomplexity algorithm, termed marginal
subspace detection (MSD), for subspace unmixing. The MSD algorithm turns the
subspace unmixing problem for the PS3 model into a multiple hypothesis testing
(MHT) problem and its analysis in the paper helps control the familywise error
rate of this MHT problem at any level $\alpha \in [0,1]$ under two random
signal generation models. Some other highlights of the analysis of the MSD
algorithm include: (i) it is applicable to an arbitrary collection of subspaces
on the Grassmann manifold; (ii) it relies on properties of the collection of
subspaces that are computable in polynomial time; and ($iii$) it allows for
linear scaling of the number of active subspaces as a function of the ambient
dimension. Finally, numerical results are presented in the paper to better
understand the performance of the MSD algorithm.

The linear model, in which a set of observations is assumed to be given by a
linear combination of columns of a matrix, has long been the mainstay of the
statistics and signal processing literature. One particular challenge for
inference under linear models is understanding the conditions on the dictionary
under which reliable inference is possible. This challenge has attracted
renewed attention in recent years since many modern inference problems deal
with the "underdetermined" setting, in which the number of observations is much
smaller than the number of columns in the dictionary. This paper makes several
contributions for this setting when the set of observations is given by a
linear combination of a small number of groups of columns of the dictionary,
termed the "blocksparse" case. First, it specifies conditions on the
dictionary under which most block subdictionaries are well conditioned. This
result is fundamentally different from prior work on blocksparse inference
because (i) it provides conditions that can be explicitly computed in
polynomial time, (ii) the given conditions translate into nearoptimal scaling
of the number of columns of the block subdictionaries as a function of the
number of observations for a large class of dictionaries, and (iii) it suggests
that the spectral norm and the quadraticmean block coherence of the dictionary
(rather than the worstcase coherences) fundamentally limit the scaling of
dimensions of the wellconditioned block subdictionaries. Second, this paper
investigates the problems of blocksparse recovery and blocksparse regression
in underdetermined settings. Nearoptimal blocksparse recovery and regression
are possible for certain dictionaries as long as the dictionary satisfies
easily computable conditions and the coefficients describing the linear
combination of groups of columns can be modeled through a mild statistical
prior.

Dictionary learning is the problem of estimating the collection of atomic
elements that provide a sparse representation of measured/collected signals or
data. This paper finds fundamental limits on the sample complexity of
estimating dictionaries for tensor data by proving a lower bound on the minimax
risk. This lower bound depends on the dimensions of the tensor and parameters
of the generative model. The focus of this paper is on secondorder tensor
data, with the underlying dictionaries constructed by taking the Kronecker
product of two smaller dictionaries and the observed data generated by sparse
linear combinations of dictionary atoms observed through white Gaussian noise.
In this regard, the paper provides a general lower bound on the minimax risk
and also adapts the proof techniques for equivalent results using sparse and
Gaussian coefficient models. The reported results suggest that the sample
complexity of dictionary learning for tensor data can be significantly lower
than that for unstructured data.

A major performance and complexity limitation in broadband communications is
the long channel delay spread which results in a highlyfrequencyselective
channel frequency response. Channel shortening equalizers (CSEs) are used to
ensure that the cascade of a long channel impulse response (CIR) and the CSE is
approximately equivalent to a target impulse response (TIR) with much shorter
delay spread. In this paper, we propose a general framework that transforms the
problems of design of sparse CSE and TIR finite impulse response (FIR) filters
into the problem of sparsestapproximation of a vector in different
dictionaries. In addition, we compare several choices of sparsifying
dictionaries under this framework. Furthermore, the worstcase coherence of
these dictionaries, which determines their sparsifying effectiveness, are
analytically and/or numerically evaluated. Finally, the usefulness of the
proposed framework for the design of sparse CSE and TIR filters is validated
through numerical experiments.

Complexity of linear finiteimpulseresponse (FIR) equalizers is proportional
to the square of the number of nonzero taps in the filter. This makes
equalization of channels with long impulse responses using either zeroforcing
or minimum mean square error (MMSE) filters computationally expensive. Sparse
equalization is a widelyused technique to solve this problem. In this paper, a
general framework is provided that transforms the problem of sparse linear
equalizers (LEs) design into the problem of sparsestapproximation of a vector
in different dictionaries. In addition, some possible choices of sparsifying
dictionaries in this framework are discussed. Furthermore, the worstcase
coherence of some of these dictionaries, which determines their sparsifying
strength, are analytically and/or numerically evaluated. Finally, the
usefulness of the proposed framework for the design of sparse FIR LEs is
validated through numerical experiments.

This paper studies the problem of dataadaptive representations for big,
distributed data. It is assumed that a number of geographicallydistributed,
interconnected sites have massive local data and they are interested in
collaboratively learning a lowdimensional geometric structure underlying these
data. In contrast to previous works on subspacebased data representations,
this paper focuses on the geometric structure of a union of subspaces (UoS). In
this regard, it proposes a distributed algorithmtermed cloud KSVDfor
collaborative learning of a UoS structure underlying distributed data of
interest. The goal of cloud KSVD is to learn a common overcomplete dictionary
at each individual site such that every sample in the distributed data can be
represented through a small number of atoms of the learned dictionary. Cloud
KSVD accomplishes this goal without requiring exchange of individual samples
between sites. This makes it suitable for applications where sharing of raw
data is discouraged due to either privacy concerns or large volumes of data.
This paper also provides an analysis of cloud KSVD that gives insights into
its properties as well as deviations of the dictionaries learned at individual
sites from a centralized solution in terms of different measures of
local/global data and topology of interconnections. Finally, the paper
numerically illustrates the efficacy of cloud KSVD on real and synthetic
distributed data.

Modern information processing relies on the axiom that highdimensional data
lie near lowdimensional geometric structures. This paper revisits the problem
of datadriven learning of these geometric structures and puts forth two new
nonlinear geometric models for data describing "related" objects/phenomena. The
first one of these models straddles the two extremes of the subspace model and
the unionofsubspaces model, and is termed the metricconstrained
unionofsubspaces (MCUoS) model. The second one of these modelssuited for
data drawn from a mixture of nonlinear manifoldsgeneralizes the kernel
subspace model, and is termed the metricconstrained kernel unionofsubspaces
(MCKUoS) model. The main contributions of this paper in this regard include
the following. First, it motivates and formalizes the problems of MCUoS and
MCKUoS learning. Second, it presents algorithms that efficiently learn an
MCUoS or an MCKUoS underlying data of interest. Third, it extends these
algorithms to the case when parts of the data are missing. Last, but not least,
it reports the outcomes of a series of numerical experiments involving both
synthetic and real data that demonstrate the superiority of the proposed
geometric models and learning algorithms over existing approaches in the
literature. These experiments also help clarify the connections between this
work and the literature on (subspace and kernel kmeans) clustering.

Linear, timevarying (LTV) systems composed of time shifts, frequency shifts,
and complex amplitude scalings are operators that act on continuous
finiteenergy waveforms. This paper presents a novel, resourceefficient method
for identifying the parametric description of such systems, i.e., the time
shifts, frequency shifts, and scalings, from the sampled response to linear
frequency modulated (LFM) waveforms, with emphasis on the application to radar
processing. If the LTV operator is probed with a sufficiently diverse set of
LFM waveforms, then the system can be identified with high accuracy. In the
case of noiseless measurements, the identification is perfect, while in the
case of noisy measurements, the accuracy is inversely proportional to the noise
level. The use of parametric estimation techniques with recently proposed
denoising algorithms allows the estimation of the parameters with high
accuracy.

In a typical MIMO radar scenario, transmit nodes transmit orthogonal
waveforms, while each receive node performs matched filtering with the known
set of transmit waveforms, and forwards the results to the fusion center. Based
on the data it receives from multiple antennas, the fusion center formulates a
matrix, which, in conjunction with standard array processing schemes, such as
MUSIC, leads to target detection and parameter estimation. In MIMO radars with
compressive sensing (MIMOCS), the data matrix is formulated by each receive
node forwarding a small number of compressively obtained samples. In this
paper, it is shown that under certain conditions, in both sampling cases, the
data matrix at the fusion center is lowrank, and thus can be recovered based
on knowledge of a small subset of its entries via matrix completion (MC)
techniques. Leveraging the lowrank property of that matrix, we propose a new
MIMO radar approach, termed, MIMOMC radar, in which each receive node either
performs matched filtering with a small number of randomly selected dictionary
waveforms or obtains subNyquist samples of the received signal at random
sampling instants, and forwards the results to a fusion center. Based on the
received samples, and with knowledge of the sampling scheme, the fusion center
partially fills the data matrix and subsequently applies MC techniques to
estimate the full matrix. MIMOMC radars share the advantages of the recently
proposed MIMOCS radars, i.e., high resolution with reduced amounts of data,
but unlike MIMOCS radars do not require grid discretization. The MIMOMC radar
concept is illustrated through a linear uniform array configuration, and its
target estimation performance is demonstrated via simulations.

This paper investigates the capacity of a channel in which information is
conveyed by the timing of consecutive packets passing through a queue with
independent and identically distributed service times. Such timing channels are
commonly studied under the assumption of a workconserving queue. In contrast,
this paper studies the case of a bufferless queue that drops arriving packets
while a packet is in service. Under this bufferless model, the paper provides
upper bounds on the capacity of timing channels and establishes achievable
rates for the case of bufferless M/M/1 and M/G/1 queues. In particular, it is
shown that a bufferless M/M/1 queue at worst suffers less than 10% reduction in
capacity when compared to an M/M/1 workconserving queue.

We investigate the asymptotic behavior of posterior distributions of
regression coefficients in highdimensional linear models as the number of
dimensions grows with the number of observations. We show that the posterior
distribution concentrates in neighborhoods of the true parameter under simple
sufficient conditions. These conditions hold under popular shrinkage priors
given some sparsity assumptions.

Estimation of the level set of a function (i.e., regions where the function
exceeds some value) is an important problem with applications in digital
elevation mapping, medical imaging, astronomy, etc. In many applications, the
function of interest is not observed directly. Rather, it is acquired through
(linear) projection measurements, such as tomographic projections,
interferometric measurements, codedaperture measurements, and random
projections associated with compressed sensing. This paper describes a new
methodology for rapid and accurate estimation of the level set from such
projection measurements. The key defining characteristic of the proposed
method, called the projective level set estimator, is its ability to estimate
the level set from projection measurements without an intermediate
reconstruction step. This leads to significantly faster computation relative to
heuristic "plugin" methods that first estimate the function, typically with an
iterative algorithm, and then threshold the result. The paper also includes a
rigorous theoretical analysis of the proposed method, which utilizes the recent
results from the nonasymptotic theory of random matrices results from the
literature on concentration of measure and characterizes the estimator's
performance in terms of geometry of the measurement operator and 1norm of the
discretized function.

We consider a colocated MIMO radar scenario, in which the receive antennas
forward their measurements to a fusion center. Based on the received data, the
fusion center formulates a matrix which is then used for target parameter
estimation. When the receive antennas sample the target returns at Nyquist
rate, and assuming that there are more receive antennas than targets, the data
matrix at the fusion center is lowrank. When each receive antenna sends to the
fusion center only a small number of samples, along with the sample index, the
receive data matrix has missing elements, corresponding to the samples that
were not forwarded. Under certain conditions, matrix completion techniques can
be applied to recover the full receive data matrix, which can then be used in
conjunction with array processing techniques, e.g., MUSIC, to obtain target
information. Numerical results indicate that good target recovery can be
achieved with occupancy of the receive data matrix as low as 50%.

In this paper, motivated by the setting of whitespace detection [1], we
present theoretical and empirical results for detection of the zerosupport E
of x \in Cp (xi = 0 for i \in E) with reduceddimension linear measurements. We
propose two low complexity algorithms based on onestep thresholding [2] for
this purpose. The second algorithm is a variant of the first that further
assumes the presence of groupstructure in the target signal [3] x. Performance
guarantees for both algorithms based on the worst case and average coherence
(group coherence) of the measurement matrix is presented along with the
empirical performance of the algorithms.

We carry out a comprehensive study of the resource cost of averaging
consensus in wireless networks. Most previous approaches suppose a graphical
network, which abstracts away crucial features of the wireless medium, and
measure resource consumption only in terms of the total number of transmissions
required to achieve consensus. Under a pathloss dominated model, we study the
resource requirements of consensus with respect to three wirelessappropriate
metrics: total transmit energy, elapsed time, and timebandwidth product. First
we characterize the performance of several popular gossip algorithms, showing
that they may be orderoptimal with respect to transmit energy but are strictly
suboptimal with respect to elapsed time and timebandwidth product. Further, we
propose a new consensus scheme, termed hierarchical averaging, and show that it
is nearly orderoptimal with respect to all three metrics. Finally, we examine
the effects of quantization, showing that hierarchical averaging provides a
nearly orderoptimal tradeoff between resource consumption and quantization
error.

This paper presents a significant modification to the Random Demodulator (RD)
of Tropp et al. for subNyquist sampling of frequencysparse signals. The
modification, termed constrained random demodulator, involves replacing the
random waveform, essential to the operation of the RD, with a constrained
random waveform that has limits on its switching rate because fast switching
waveforms may be hard to generate cleanly. The result is a relaxation on the
hardware requirements with a slight, but manageable, decrease in the recovery
guarantees. The paper also establishes the importance of properly choosing the
statistics of the constrained random waveform. If the power spectrum of the
random waveform matches the distribution on the tones of the input signal
(i.e., the distribution is proportional to the power spectrum), then recovery
of the input signal tones is improved. The theoretical guarantees provided in
the paper are validated through extensive numerical simulations and phase
transition plots.

Group model selection is the problem of determining a small subset of groups
of predictors (e.g., the expression data of genes) that are responsible for
majority of the variation in a response variable (e.g., the malignancy of a
tumor). This paper focuses on group model selection in highdimensional linear
models, in which the number of predictors far exceeds the number of samples of
the response variable. Existing works on highdimensional group model selection
either require the number of samples of the response variable to be
significantly larger than the total number of predictors contributing to the
response or impose restrictive statistical priors on the predictors and/or
nonzero regression coefficients. This paper provides comprehensive
understanding of a lowcomplexity approach to group model selection that avoids
some of these limitations. The proposed approach, termed Group Thresholding
(GroTh), is based on thresholding of marginal correlations of groups of
predictors with the response variable and is reminiscent of existing
thresholdingbased approaches in the literature. The most important
contribution of the paper in this regard is relating the performance of GroTh
to a polynomialtime verifiable property of the predictors for the general case
of arbitrary (random or deterministic) predictors and arbitrary nonzero
regression coefficients.

This paper investigates two parameters that measure the coherence of a frame:
worstcase and average coherence. We first use worstcase and average coherence
to derive nearoptimal probabilistic guarantees on both sparse signal detection
and reconstruction in the presence of noise. Next, we provide a catalog of
nearly tight frames with small worstcase and average coherence. Later, we find
a new lower bound on worstcase coherence; we compare it to the Welch bound and
use it to interpret recently reported signal reconstruction results. Finally,
we give an algorithm that transforms frames in a way that decreases average
coherence without changing the spectral norm or worstcase coherence.