
There is growing interest in estimating and analyzing heterogeneous treatment
effects in experimental and observational studies. We describe a number of
metaalgorithms that can take advantage of any supervised learning or
regression method in machine learning and statistics to estimate the
Conditional Average Treatment Effect (CATE) function. Metaalgorithms build on
base algorithmssuch as Random Forests (RF), Bayesian Additive Regression
Trees (BART) or neural networksto estimate the CATE, a function that the
base algorithms are not designed to estimate directly. We introduce a new
metaalgorithm, the Xlearner, that is provably efficient when the number of
units in one treatment group is much larger than in the other, and can exploit
structural properties of the CATE function. For example, if the CATE function
is linear and the response functions in treatment and control are Lipschitz
continuous, the Xlearner can still achieve the parametric rate under
regularity conditions. We then introduce versions of the Xlearner that use RF
and BART as base learners. In extensive simulation studies, the Xlearner
performs favorably, although none of the metalearners is uniformly the best.
In two persuasion field experiments from political science, we demonstrate how
our new Xlearner can be used to target treatment regimes and to shed light on
underlying mechanisms. A software package is provided that implements our
methods.

The driving force behind the recent success of LSTMs has been their ability
to learn complex and nonlinear relationships. Consequently, our inability to
describe these relationships has led to LSTMs being characterized as black
boxes. To this end, we introduce contextual decomposition (CD), an
interpretation algorithm for analysing individual predictions made by standard
LSTMs, without any changes to the underlying model. By decomposing the output
of a LSTM, CD captures the contributions of combinations of words or variables
to the final prediction of an LSTM. On the task of sentiment analysis with the
Yelp and SST data sets, we show that CD is able to reliably identify words and
phrases of contrasting sentiment, and how they are combined to yield the LSTM's
final prediction. Using the phraselevel labels in SST, we also demonstrate
that CD is able to successfully extract positive and negative negations from an
LSTM, something which has not previously been done.

The overall performance or expected excess risk of an iterative machine
learning algorithm can be decomposed into training error and generalization
error. While the former is controlled by its convergence analysis, the latter
can be tightly handled by algorithmic stability. The machine learning community
has a rich history investigating convergence and stability separately. However,
the question about the tradeoff between these two quantities remains open. In
this paper, we show that for any iterative algorithm at any iteration, the
overall performance is lower bounded by the minimax statistical error over an
appropriately chosen loss function class. This implies an important tradeoff
between convergence and stability of the algorithm  a faster converging
algorithm has to be less stable, and vice versa. As a direct consequence of
this fundamental tradeoff, new convergence lower bounds can be derived for
classes of algorithms constrained with different stability bounds. In
particular, when the loss function is convex (or strongly convex) and smooth,
we discuss the stability upper bounds of gradient descent (GD) and stochastic
gradient descent and their variants with decreasing step sizes. For Nesterov's
accelerated gradient descent (NAG) and heavy ball method (HB), we provide
stability upper bounds for the quadratic loss function. Applying existing
stability upper bounds for the gradient methods in our tradeoff framework, we
obtain lower bounds matching the wellestablished convergence upper bounds up
to constants for these algorithms and conjecture similar lower bounds for NAG
and HB. Finally, we numerically demonstrate the tightness of our stability
bounds in terms of exponents in the rate and also illustrate via a simulated
logistic regression problem that our stability bounds reflect the
generalization errors better than the simple uniform convergence bounds for GD
and NAG.

Two aspects of improvements are proposed for the OpenCLbased implementation
of the social field pedestrian model. In the aspect of algorithm, a method
based on the idea of divideandconquer is devised in order to overcome the
problem of global memory depletion when fields are of a larger size. This is of
importance for the study of finer pedestrian walking behavior, which usually
requires larger fields. In the aspect of computation, the OpenCL heterogeneous
framework is thoroughly studied. Factors that may affect the numerical
efficiency are evaluated, with regarding to the social field model previously
proposed. This includes usage of local memory, deliberate patch of data
structures for avoidance of bank conflicts, and so on. Numerical experiments
disclose that the numerical efficiency is brought to an even higher level.
Compared to the CPU model and the previous GPU model, the current GPU model can
be at most 71.56 and 13.3 times faster respectively so that it is more
qualified to be a core engine for analysis of superlarge scale crowd.

Nonorthogonal multiple access (NoMA) as an efficient way of radio resource
sharing can root back to the network information theory. For generations of
wireless communication systems design, orthogonal multiple access (OMA) schemes
in time, frequency, or code domain have been the main choices due to the
limited processing capability in the transceiver hardware, as well as the
modest traffic demands in both latency and connectivity. However, for the next
generation radio systems, given its vision to connect everything and the much
evolved hardware capability, NoMA has been identified as a promising technology
to help achieve all the targets in system capacity, user connectivity, and
service latency. This article will provide a systematic overview of the
stateoftheart design of the NoMA transmission based on a unified transceiver
design framework, the related standardization progress, and some promising use
cases in future cellular networks, based on which the interested researchers
can get a quick start in this area.

A linear program with linear complementarity constraints (LPCC) requires the
minimization of a linear objective over a set of linear constraints together
with additional linear complementarity constraints. This class has emerged as a
modeling paradigm for a broad collection of problems, including bilevel
programs, Stackelberg games, inverse quadratic programs, and problems involving
equilibrium constraints. The presence of the complementarity constraints
results in a nonconvex optimization problem. We develop a branchandcut
algorithm to find a global optimum for this class of optimization problems,
where we branch directly on complementarities. We develop branching rules and
feasibility recovery procedures and demonstrate their computational
effectiveness in a comparison with CPLEX. The implementation builds on CPLEX
through the use of callback routines. The computational results show that our
approach is a strong alternative to constructing an integer programming
formulation using big$M$ terms to represent bounds for variables, with testing
conducted on general LPCCs as well as on instances generated from bilevel
programs with convex quadratic lower level problems.

We consider the problem of sampling from a strongly logconcave density in
$\mathbb{R}^d$, and prove a nonasymptotic upper bound on the mixing time of
the Metropolisadjusted Langevin algorithm (MALA). The method draws samples by
running a Markov chain obtained from the discretization of an appropriate
Langevin diffusion, combined with an acceptreject step to ensure the correct
stationary distribution. Relative to known guarantees for the unadjusted
Langevin algorithm (ULA), our bounds show that the use of an acceptreject step
in MALA leads to an exponentially improved dependence on the errortolerance.
Concretely, in order to obtain samples with TV error at most $\delta$ for a
density with condition number $\kappa$, we show that MALA requires $\mathcal{O}
\big(\kappa d \log(1/\delta) \big)$ steps, as compared to the $\mathcal{O}
\big(\kappa^2 d/\delta^2 \big)$ steps established in past work on ULA. We also
demonstrate the gains of MALA over ULA for weakly logconcave densities.
Furthermore, we derive mixing time bounds for a zerothorder method
Metropolized random walk (MRW) and show that it mixes $\mathcal{O}(\kappa d)$
slower than MALA. We provide numerical examples that support our theoretical
findings, and demonstrate the potential gains of MetropolisHastings adjustment
for Langevintype algorithms.

Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genomewide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these highorder interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
featureweighted ensemble of decision trees to detect stable, highorder
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for highorder interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel thirdorder interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest highorder
relationships that are candidates for followup experiments. In humanderived
cells, iRF rediscovered a central role of H3K36me3 in chromatinmediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multivalent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology.

Recent research in Aerosol Optical Depth (AOD) retrieval algorithms for
Multiangle Imaging SpectroRadiometer (MISR) proposed a hierarchical Bayesian
model. However the inference algorithm used in their work was Markov Chain
Monte Carlo (MCMC), which was reported prohibitively slow. The poor speed of
MCMC dramatically limited the production feasibility of the Bayesian framework
if large scale (e.g. global scale) of aerosol retrieval is desired.
In this paper, we present an alternative optimization method to mitigate the
speed problem. In particular we adopt Maximize a Posteriori (MAP) approach, and
apply a gradientfree "hillclimbing" algorithm: the coordinatewise
stochasticsearch. Our method has shown to be much (about 100 times) faster
than MCMC, easier to converge, and insensitive to hyper parameters. To further
scale our approach, we parallelized our method using Apache Spark, which
achieves linear speedup w.r.t number of CPU cores up to 16. Due to these
efforts, we are able to retrieve AOD at much finer resolution (1.1km) with a
tiny fraction of time consumption compared with existing methods.
During our research, we find that in low AOD levels, the Bayesian network
tends to produce overestimated retrievals. We also find that high absorbing
aerosol types are retrieved at the same time. This is likely caused by the
Dirichlet prior for aerosol types, as it is shown to encourage selecting
absorbing types in practice. After changing Dirichlet to uniform, the AOD
retrievals show excellent agreement with ground measurement in all levels.

Deep convolutional neural networks (CNNs) have been successful in many tasks
in machine vision, however, millions of weights in the form of thousands of
convolutional filters in CNNs makes them difficult for human intepretation or
understanding in science. In this article, we introduce CAR, a greedy
structural compression scheme to obtain smaller and more interpretable CNNs,
while achieving close to original accuracy. The compression is based on pruning
filters with the least contribution to the classification accuracy. We
demonstrate the interpretability of CARcompressed CNNs by showing that our
algorithm prunes filters with visually redundant functionalities such as color
filters. These compressed networks are easier to interpret because they retain
the filter diversity of uncompressed networks with order of magnitude less
filters. Finally, a variant of CAR is introduced to quantify the importance of
each image category to each CNN filter. Specifically, the most and the least
important class labels are shown to be meaningful interpretations of each
filter.

This paper is devoted to discussing affine Hirsch foliations on
$3$manifolds. First, we prove that up to isotopic leafconjugacy, every closed
orientable $3$manifold $M$ admits $0$, $1$ or $2$ affine Hirsch foliations.
Furthermore, every case is possible.
Then, we analyze the $3$manifolds admitting two affine Hirsch foliations
(abbreviated as Hirsch manifolds). On the one hand, we construct Hirsch
manifolds by using exchangeable braided links (abbreviated as DEBL Hirsch
manifolds); on the other hand, we show that every Hirsch manifold virtually is
a DEBL Hirsch manifold.
Finally, we show that for every $n\in \mathbb{N}$, there are only finitely
many Hirsch manifolds with strand number $n$. Here the strand number of a
Hirsch manifold $M$ is a positive integer defined by using strand numbers of
braids.

The technological advancements of the modern era have enabled the collection
of huge amounts of data in science and beyond. Extracting useful information
from such massive datasets is an ongoing challenge as traditional data
visualization tools typically do not scale well in highdimensional settings.
An existing visualization technique that is particularly well suited to
visualizing large datasets is the heatmap. Although heatmaps are extremely
popular in fields such as bioinformatics for visualizing large gene expression
datasets, they remain a severely underutilized visualization tool in modern
data analysis. In this paper we introduce superheat, a new R package that
provides an extremely flexible and customizable platform for visualizing large
datasets using extendable heatmaps. Superheat enhances the traditional heatmap
by providing a platform to visualize a wide range of data types simultaneously,
adding to the heatmap a response variable as a scatterplot, model results as
boxplots, correlation information as barplots, text information, and more.
Superheat allows the user to explore their data to greater depths and to take
advantage of the heterogeneity present in the data to inform analysis
decisions. The goal of this paper is twofold: (1) to demonstrate the potential
of the heatmap as a default visualization method for a wide range of data types
using reproducible examples, and (2) to highlight the customizability and ease
of implementation of the superheat package in R for creating beautiful and
extendable heatmaps. The capabilities and fundamental applicability of the
superheat package will be explored via three case studies, each based on
publicly available data sources and accompanied by a file outlining the
stepbystep analytic pipeline (with code).

Several theories of early sensory processing suggest that it whitens sensory
stimuli. Here, we test three key predictions of the whitening theory using
recordings from 152 ganglion cells in salamander retina responding to natural
movies. We confirm the previous finding that firing rates of ganglion cells are
less correlated compared to natural scenes, although significant correlations
remain. We show that while the power spectrum of ganglion cells decays less
steeply than that of natural scenes, it is not completely flattened. Finally,
we find evidence that only the top principal components of the visual stimulus
are transmitted.

The sizes of Markov equivalence classes of directed acyclic graphs play
important roles in measuring the uncertainty and complexity in causal learning.
A Markov equivalence class can be represented by an essential graph and its
undirected subgraphs determine the size of the class. In this paper, we develop
a method to derive the formulas for counting the sizes of Markov equivalence
classes. We first introduce a new concept of core graph. The size of a Markov
equivalence class of interest is a polynomial of the number of vertices given
its core graph. Then, we discuss the recursive and explicit formula of the
polynomial, and provide an algorithm to derive the size formula via symbolic
computation for any given core graph. The proposed size formula derivation
sheds light on the relationships between the size of a Markov equivalence class
and its representation graph, and makes size counting efficient, even when the
essential graphs contain nonsparse undirected subgraphs.

We study the theoretical properties of learning a dictionary from $N$ signals
$\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$minimization. We assume
that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns
from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0
\in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated
from either the $s$sparse Gaussian model or the BernoulliGaussian model.
First, for the population case, we establish a sufficient and almost necessary
condition for the reference dictionary $\mathbf D_0$ to be locally
identifiable, i.e., a local minimum of the expected $l_1$norm objective
function. Our condition covers both sparse and dense cases of the random linear
coefficients and significantly improves the sufficient condition by Gribonval
and Schnass (2010). In addition, we show that for a complete $\mu$coherent
reference dictionary, i.e., a dictionary with absolute pairwise column
innerproduct at most $\mu\in[0,1)$, local identifiability holds even when the
random linear coefficient vector has up to $O(\mu^{2})$ nonzeros on average.
Moreover, our local identifiability results also translate to the finite sample
case with high probability provided that the number of signals $N$ scales as
$O(K\log K)$.

We provide a principled way for investigators to analyze randomized
experiments when the number of covariates is large. Investigators often use
linear multivariate regression to analyze randomized experiments instead of
simply reporting the difference of means between treatment and control groups.
Their aim is to reduce the variance of the estimated treatment effect by
adjusting for covariates. If there are a large number of covariates relative to
the number of observations, regression may perform poorly because of
overfitting. In such cases, the Lasso may be helpful. We study the resulting
Lassobased treatment effect estimator under the NeymanRubin model of
randomized experiments. We present theoretical conditions that guarantee that
the estimator is more efficient than the simple differenceofmeans estimator,
and we provide a conservative estimator of the asymptotic variance, which can
yield tighter confidence intervals than the differenceofmeans estimator.
Simulation and data examples show that Lassobased adjustment can be
advantageous even when the number of covariates is less than the number of
observations. Specifically, a variant using Lasso for selection and OLS for
estimation performs particularly well, and it chooses a smoothing parameter
based on combined performance of Lasso and OLS.

A significant hurdle for analyzing large sample data is the lack of effective
statistical computing and inference methods. An emerging powerful approach for
analyzing large sample data is subsampling, by which one takes a random
subsample from the original full sample and uses it as a surrogate for
subsequent computation and estimation. In this paper, we study subsampling
methods under two scenarios: approximating the full sample ordinary
leastsquare (OLS) estimator and estimating the coefficients in linear
regression. We present two algorithms, weighted estimation algorithm and
unweighted estimation algorithm, and analyze asymptotic behaviors of their
resulting subsample estimators under general conditions. For the weighted
estimation algorithm, we propose a criterion for selecting the optimal sampling
probability by making use of the asymptotic results. On the basis of the
criterion, we provide two novel subsampling methods, the optimal subsampling
and the predictor length subsampling methods. The predictorlength subsampling
method is based on the L2 norm of predictors rather than leverage scores. Its
computational cost is scalable. For unweighted estimation algorithm, we show
that its resulting subsample estimator is not consistent to the full sample OLS
estimator. However, it has better performance than the weighted estimation
algorithm for estimating the coefficients. Simulation studies and a real data
example are used to demonstrate the effectiveness of our proposed subsampling
methods.

Crossvalidation (CV) is often used to select the regularization parameter in
high dimensional problems. However, when applied to the sparse modeling method
Lasso, CV leads to models that are unstable in highdimensions, and
consequently not suited for reliable interpretation. In this paper, we propose
a modelfree criterion ESCV based on a new estimation stability (ES) metric and
CV. Our proposed ESCV finds a locally ESoptimal model smaller than the CV
choice so that the it fits the data and also enjoys estimation stability
property. We demonstrate that ESCV is an effective alternative to CV at a
similar easily parallelizable computational cost. In particular, we compare the
two approaches with respect to several performance measures when applied to the
Lasso on both simulated and real data sets. For dependent predictors common in
practice, our main finding is that, ESCV cuts down false positive rates often
by a large margin, while sacrificing little of true positive rates. ESCV
usually outperforms CV in terms of parameter estimation while giving similar
performance as CV in terms of prediction. For the two real data sets from
neuroscience and cell biology, the models found by ESCV are less than half of
the model sizes by CV. Judged based on subject knowledge, they are more
plausible than those by CV as well. We also discuss some regularization
parameter alignment issues that come up in both approaches.

Filter bank multicarrier with offset quadrature amplitude modulation
(FBMC/OQAM) has been heavily studied as an alternative waveform for 5G systems.
Its advantages of higher spectrum efficiency, localized frequency response and
insensitivity to synchronization errors may enable promising performance when
orthogonal frequency division multiplexing (OFDM) fails. However, performing
channel estimation under the intrinsic interference has been a fundamental
obstacle towards adopting FBMC/OQMA in a practical system. Several schemes are
available but the performance is far from satisfaction. In this paper, we will
show the existing methods are trapped by the paradigm that a clean pilot is
mandatory so as to explicitly carry a reference symbol to the receiver for the
purpose of channel estimation. By breaking this paradigm, a novel dual
dependent pilot scheme is proposed, which gives up the independent pilot and
derives dual pilots from the imposed interference. By doing this, the
interference between pilots can be fully utilized. Consequentially, the new
scheme significantly outperforms existing solutions and the simulation results
show FBMC/OQAM can achieve closetoOFDM performance in a practical system even
with the presence of strong intrinsic interference.

Given a (transitive or nontransitive) Anosov vector field $X$ on a closed
threedimensional manifold $M$, one may try to decompose $(M,X)$ by cutting $M$
along twotori transverse to $X$. We prove that one can find a finite
collection $\{T_1,\dots,T_n\}$ of pairwise disjoint, pairwise nonparallel
incompressible tori transverse to $X$, such that the maximal invariant sets
$\Lambda_1,\dots,\Lambda_m$ of the connected components $V_1,\dots,V_m$ of
$M(T_1\cup\dots\cup T_n)$ satisfy the following properties: 1, each
$\Lambda_i$ is a compact invariant locally maximal transitive set for $X$, 2,
the collection $\{\Lambda_1,\dots,\Lambda_m\}$ is canonically attached to the
pair $(M,X)$ (i.e., it can be defined independently of the collection of tori
$\{T_1,\dots,T_n\}$), 3, the $\Lambda_i$'s are the smallest possible: for
every (possibly infinite) collection $\{S_i\}_{i\in I}$ of tori transverse to
$X$, the $\Lambda_i$'s are contained in the maximal invariant set of $M\cup_i
S_i$. To a certain extent, the sets $\Lambda_1,\dots,\Lambda_m$ are analogs
(for Anosov vector field in dimension 3) of the basic pieces which appear in
the spectral decomposition of a nontransitive axiom A vector field. Then we
discuss the uniqueness of such a decomposition: we prove that the pieces of the
decomposition $V_1,\dots,V_m$, equipped with the restriction of the Anosov
vector field $X$, are "almost unique up to topological equivalence".

Clustering of data sets is a standard problem in many areas of science and
engineering. The method of spectral clustering is based on embedding the data
set using a kernel function, and using the top eigenvectors of the normalized
Laplacian to recover the connected components. We study the performance of
spectral clustering in recovering the latent labels of i.i.d. samples from a
finite mixture of nonparametric distributions. The difficulty of this label
recovery problem depends on the overlap between mixture components and how
easily a mixture component is divided into two nonoverlapping components. When
the overlap is small compared to the indivisibility of the mixture components,
the principal eigenspace of the populationlevel normalized Laplacian operator
is approximately spanned by the squareroot kernelized component densities. In
the finite sample setting, and under the same assumption, embedded samples from
different components are approximately orthogonal with high probability when
the sample size is large. As a corollary we control the fraction of samples
mislabeled by spectral clustering under finite mixtures with nonparametric
components.

Directed graphs have asymmetric connections, yet the current graph clustering
methodologies cannot identify the potentially global structure of these
asymmetries. We give a spectral algorithm called disim that builds on a dual
measure of similarity that correspond to how a node (i) sends and (ii) receives
edges. Using disim, we analyze the global asymmetries in the networks of Enron
emails, political blogs, and the c elegans neural connectome. In each example,
a small subset of nodes have persistent asymmetries; these nodes send edges
with one cluster, but receive edges with another cluster. Previous approaches
would have assigned these asymmetric nodes to only one cluster, failing to
identify their sending/receiving asymmetries. Regularization and "projection"
are two steps of disim that are essential for spectral clustering algorithms
to work in practice. The theoretical results show that these steps make the
algorithm weakly consistent under the degree corrected Stochastic
coBlockmodel, a model that generalizes the Stochastic Blockmodel to allow for
both (i) degree heterogeneity and (ii) the global asymmetries that we intend to
detect. The theoretical results make no assumptions on the smallest degree
nodes. Instead, the theorem requires that the average degree grows sufficiently
fast and that the weak consistency only applies to the subset of the nodes with
sufficiently large leverage scores. The results results also apply to bipartite
graphs.

Crowdsourcing has become an effective and popular tool for humanpowered
computation to label large datasets. Since the workers can be unreliable, it is
common in crowdsourcing to assign multiple workers to one task, and to
aggregate the labels in order to obtain results of high quality. In this paper,
we provide finitesample exponential bounds on the error rate (in probability
and in expectation) of general aggregation rules under the DawidSkene
crowdsourcing model. The bounds are derived for multiclass labeling, and can
be used to analyze many aggregation methods, including majority voting,
weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We
show that the oracle MAP rule approximately optimizes our upper bound on the
mean error rate of weighted majority voting in certain setting. We propose an
iterative weighted majority voting (IWMV) method that optimizes the error rate
bound and approximates the oracle MAP rule. Its one step version has a provable
theoretical guarantee on the error rate. The IWMV method is intuitive and
computationally simple. Experimental results on simulated and real data show
that IWMV performs at least on par with the stateoftheart methods, and it
has a much lower computational cost (around one hundred times faster) than the
stateoftheart methods.

We develop a general framework for proving rigorous guarantees on the
performance of the EM algorithm and a variant known as gradient EM. Our
analysis is divided into two parts: a treatment of these algorithms at the
population level (in the limit of infinite data), followed by results that
apply to updates based on a finite set of samples. First, we characterize the
domain of attraction of any global maximizer of the population likelihood. This
characterization is based on a novel view of the EM updates as a perturbed form
of likelihood ascent, or in parallel, of the gradient EM updates as a perturbed
form of standard gradient ascent. Leveraging this characterization, we then
provide nonasymptotic guarantees on the EM and gradient EM algorithms when
applied to a finite set of samples. We develop consequences of our general
theory for three canonical examples of incompletedata problems: mixture of
Gaussians, mixture of regressions, and linear regression with covariates
missing completely at random. In each case, our theory guarantees that with a
suitable initialization, a relatively small number of EM (or gradient EM) steps
will yield (with high probability) an estimate that is within statistical error
of the MLE. We provide simulations to confirm this theoretically predicted
behavior.

The performance of spectral clustering can be considerably improved via
regularization, as demonstrated empirically in Amini et. al (2012). Here, we
provide an attempt at quantifying this improvement through theoretical
analysis. Under the stochastic block model (SBM), and its extensions, previous
results on spectral clustering relied on the minimum degree of the graph being
sufficiently large for its good performance. By examining the scenario where
the regularization parameter $\tau$ is large we show that the minimum degree
assumption can potentially be removed. As a special case, for an SBM with two
blocks, the results require the maximum degree to be large (grow faster than
$\log n$) as opposed to the minimum degree.
More importantly, we show the usefulness of regularization in situations
where not all nodes belong to welldefined clusters. Our results rely on a
`biasvariance'like tradeoff that arises from understanding the concentration
of the sample Laplacian and the eigen gap as a function of the regularization
parameter. As a byproduct of our bounds, we propose a datadriven technique
\textit{DKest} (standing for estimated DavisKahan bounds) for choosing the
regularization parameter. This technique is shown to work well through
simulations and on a real data set.