
In this paper, we investigate community detection in networks in the presence
of node covariates. In many instances, covariates and networks individually
only give a partial view of the cluster structure. One needs to jointly infer
the full cluster structure by considering both. In statistics, an emerging body
of work has been focused on combining information from both the edges in the
network and the node covariates to infer community memberships. However, so far
the theoretical guarantees have been established in the dense regime, where the
network can lead to perfect clustering under a broad parameter regime, and
hence the role of covariates is often not clear. In this paper, we examine
sparse networks in conjunction with finite dimensional subgaussian mixtures as
covariates under moderate separation conditions. In this setting each
individual source can only cluster a nonvanishing fraction of nodes correctly.
We propose a simple optimization framework which provably improves clustering
accuracy when the two sources carry partial information about the cluster
memberships, and hence perform poorly on their own. Our optimization problem
can be solved using scalable convex optimization algorithms. Using a variety of
simulated and real data examples, we show that the proposed method outperforms
other existing methodology.

Community detection is a fundamental unsupervised learning problem for
unlabeled networks which has a broad range of applications. Many community
detection algorithms assume that the number of clusters $r$ is known apriori.
In this paper, we propose an approach based on semidefinite relaxations, which
does not require prior knowledge of model parameters like many existing convex
relaxation methods and recovers the number of clusters and the clustering
matrix exactly under a broad parameter regime, with probability tending to one.
On a variety of simulated and real data experiments, we show that the proposed
method often outperforms stateoftheart techniques for estimating the number
of clusters.

We consider the problem of estimating community memberships of nodes in a
network, where every node is associated with a vector determining its degree of
membership in each community. Existing provably consistent algorithms often
require strong assumptions about the population, are computationally expensive,
and only provide an overall error bound for the whole community membership
matrix. This paper provides uniform rates of convergence for the inferred
community membership vector of each node in a network generated from the Mixed
Membership Stochastic Blockmodel (MMSB); to our knowledge, this is the first
work to establish pernode rates for overlapping community detection in
networks. We achieve this by establishing sharp rowwise eigenvector deviation
bounds for MMSB. Based on the simplex structure inherent in the
eigendecomposition of the population matrix, we build on established
cornerfinding algorithms from the optimization community to infer the
community membership vectors. Our results hold over a broad parameter regime
where the average degree only grows polylogarithmically with the number of
nodes. Using experiments with simulated and real datasets, we show that our
method achieves better error with lower variability over competing methods, and
processes real world networks of up to 100,000 nodes within tens of seconds.

In this article, we advance divideandconquer strategies for solving the
community detection problem in networks. We propose two algorithms which
perform clustering on a number of small subgraphs and finally patches the
results into a single clustering. The main advantage of these algorithms is
that they bring down significantly the computational cost of traditional
algorithms, including spectral clustering, semidefinite programs, modularity
based methods, likelihood based methods etc., without losing on accuracy and
even improving accuracy at times. These algorithms are also, by nature,
parallelizable. Thus, exploiting the facts that most traditional algorithms are
accurate and the corresponding optimization problems are much simpler in small
problems, our divideandconquer methods provide an omnibus recipe for scaling
traditional algorithms up to large networks. We prove consistency of these
algorithms under various subgraph selection procedures and perform extensive
simulations and realdata analysis to understand the advantages of the
divideandconquer approach in various settings.

Chromosome conformation capture experiments such as HiC are used to map the
threedimensional spatial organization of genomes. One specific feature of the
3D organization is known as topologically associating domains (TADs), which are
densely interacting, contiguous chromatin regions playing important roles in
regulating gene expression. A few algorithms have been proposed to detect TADs.
In particular, the structure of HiC data naturally inspires application of
community detection methods. However, one of the drawbacks of community
detection is that most methods take exchangeability of the nodes in the network
for granted; whereas the nodes in this case, i.e. the positions on the
chromosomes, are not exchangeable. We propose a network model for detecting
TADs using HiC data that takes into account this nonexchangeability. In
addition, our model explicitly makes use of celltype specific CTCF binding
sites as biological covariates and can be used to identify conserved TADs
across multiple cell types. The model leads to a likelihood objective that can
be efficiently optimized via relaxation. We also prove that when suitably
initialized, this model finds the underlying TAD structure with high
probability. Using simulated data, we show the advantages of our method and the
caveats of popular community detection methods, such as spectral clustering, in
this application. Applying our method to real HiC data, we demonstrate the
domains identified have desirable epigenetic features and compare them across
different cell types.

The problem of finding overlapping communities in networks has gained much
attention recently. Optimizationbased approaches use nonnegative matrix
factorization (NMF) or variants, but the global optimum cannot be provably
attained in general. Modelbased approaches, such as the popular
mixedmembership stochastic blockmodel or MMSB (Airoldi et al., 2008), use
parameters for each node to specify the overlapping communities, but standard
inference techniques cannot guarantee consistency. We link the two approaches,
by (a) establishing sufficient conditions for the symmetric NMF optimization to
have a unique solution under MMSB, and (b) proposing a computationally
efficient algorithm called GeoNMF that is provably optimal and hence consistent
for a broad parameter regime. We demonstrate its accuracy on both simulated and
realworld datasets.

In this paper, we study convergence properties of the gradient
ExpectationMaximization algorithm \cite{lange1995gradient} for Gaussian
Mixture Models for general number of clusters and mixing coefficients. We
derive the convergence rate depending on the mixing coefficients, minimum and
maximum pairwise distances between the true centers and dimensionality and
number of components; and obtain a nearoptimal local contraction radius. While
there have been some recent notable works that derive local convergence rates
for EM in the two equal mixture symmetric GMM, in the more general case, the
derivations need structurally different and nontrivial arguments. We use
recent tools from learning theory and empirical processes to achieve our
theoretical results.

Clustering is one of the most important unsupervised problems in machine
learning and statistics. Among many existing algorithms, kernel kmeans has
drawn much research attention due to its ability to find nonlinear cluster
boundaries and its inherent simplicity. There are two main approaches for
kernel kmeans: SVD of the kernel matrix and convex relaxations. Despite the
attention kernel clustering has received both from theoretical and applied
quarters, not much is known about robustness of the methods. In this paper we
first introduce a semidefinite programming relaxation for the kernel clustering
problem, then prove that under a suitable model specification, both the KSVD
and SDP approaches are consistent in the limit, albeit SDP is strongly
consistent, i.e. achieves exact recovery, whereas KSVD is weakly consistent,
i.e. the fraction of misclassified nodes vanish.

Community detection, which focuses on clustering nodes or detecting
communities in (mostly) a single network, is a problem of considerable
practical interest and has received a great deal of attention in the research
community. While being able to cluster within a network is important, there are
emerging needs to be able to cluster multiple networks. This is largely
motivated by the routine collection of network data that are generated from
potentially different populations, such as brain networks of subjects from
different disease groups, genders, or biological networks generated under
different experimental conditions, etc. We propose a simple and general
framework for clustering multiple networks based on a mixture model on
graphons. Our clustering method employs graphon estimation as a first step and
performs spectral clustering on the matrix of distances between estimated
graphons. This is illustrated through both simulated and real data sets, and
theoretical justification of the algorithm is given in terms of consistency.

Spectral clustering is a technique that clusters elements using the top few
eigenvectors of their (possibly normalized) similarity matrix. The quality of
spectral clustering is closely tied to the convergence properties of these
principal eigenvectors. This rate of convergence has been shown to be identical
for both the normalized and unnormalized variants in recent random matrix
theory literature. However, normalization for spectral clustering is commonly
believed to be beneficial [Stat. Comput. 17 (2007) 395416]. Indeed, our
experiments show that normalization improves prediction accuracy. In this
paper, for the popular stochastic blockmodel, we theoretically show that
normalization shrinks the spread of points in a class by a constant fraction
under a broad parameter regime. As a byproduct of our work, we also obtain
sharp deviation bounds of empirical principal eigenvalues of graphs generated
from a stochastic blockmodel.

Crowdsourcing has become a popular means of acquiring labeled data for a
wide variety of tasks where humans are more accurate than computers, e.g.,
labeling images, matching objects, or analyzing sentiment. However, relying
solely on the crowd is often impractical even for data sets with thousands of
items, due to time and cost constraints of acquiring human input (which cost
pennies and minutes per label). In this paper, we propose algorithms for
integrating machine learning into crowdsourced databases, with the goal of
allowing crowdsourcing applications to scale, i.e., to handle larger datasets
at lower costs. The key observation is that, in many of the above tasks, humans
and machine learning algorithms can be complementary, as humans are often more
accurate but slow and expensive, while algorithms are usually less accurate,
but faster and cheaper.
Based on this observation, we present two new active learning algorithms to
combine humans and algorithms together in a crowdsourced database. Our
algorithms are based on the theory of nonparametric bootstrap, which makes our
results applicable to a broad class of machine learning models. Our results, on
three reallife datasets collected with Amazon's Mechanical Turk, and on 15
wellknown UCI data sets, show that our methods on average ask humans to label
one to two orders of magnitude fewer items to achieve the same accuracy as a
baseline that labels random images, and two to eight times fewer questions than
previous active learning schemes.

Community detection in networks is a key exploratory tool with applications
in a diverse set of areas, ranging from finding communities in social and
biological networks to identifying link farms in the World Wide Web. The
problem of finding communities or clusters in a network has received much
attention from statistics, physics and computer science. However, most
clustering algorithms assume knowledge of the number of clusters k. In this
paper we propose to automatically determine k in a graph generated from a
Stochastic Blockmodel. Our main contribution is twofold; first, we
theoretically establish the limiting distribution of the principal eigenvalue
of the suitably centered and scaled adjacency matrix, and use that distribution
for our hypothesis test. Secondly, we use this test to design a recursive
bipartitioning algorithm. Using quantifiable classification tasks on real world
networks with ground truth, we show that our algorithm outperforms existing
probabilistic models for learning overlapping clusters, and on unlabeled
networks, we show that we uncover nested community structure.

We propose a nonparametric approach to link prediction in largescale dynamic
networks. Our model uses graphbased features of pairs of nodes as well as
those of their local neighborhoods to predict whether those nodes will be
linked at each time step. The model allows for different types of evolution in
different parts of the graph (e.g, growing or shrinking communities). We focus
on largescale graphs and present an implementation of our model that makes use
of localitysensitive hashing to allow it to be scaled to large problems.
Experiments with simulated data as well as five realworld dynamic graphs show
that we outperform the state of the art, especially when sharp fluctuations or
nonlinearities are present. We also establish theoretical properties of our
estimator, in particular consistency and weak convergence, the latter making
use of an elaboration of Stein's method for dependency graphs.

The bootstrap provides a simple and powerful means of assessing the quality
of estimators. However, in settings involving large datasetswhich are
increasingly prevalentthe computation of bootstrapbased quantities can be
prohibitively demanding computationally. While variants such as subsampling and
the $m$ out of $n$ bootstrap can be used in principle to reduce the cost of
bootstrap computations, we find that these methods are generally not robust to
specification of hyperparameters (such as the number of subsampled data
points), and they often require use of more prior information (such as rates of
convergence of estimators) than the bootstrap. As an alternative, we introduce
the Bag of Little Bootstraps (BLB), a new procedure which incorporates features
of both the bootstrap and subsampling to yield a robust, computationally
efficient means of assessing the quality of estimators. BLB is well suited to
modern parallel and distributed computing architectures and furthermore retains
the generic applicability and statistical efficiency of the bootstrap. We
demonstrate BLB's favorable statistical performance via a theoretical analysis
elucidating the procedure's properties, as well as a simulation study comparing
BLB to the bootstrap, the $m$ out of $n$ bootstrap, and subsampling. In
addition, we present results from a largescale distributed implementation of
BLB demonstrating its computational superiority on massive data, a method for
adaptively selecting BLB's hyperparameters, an empirical study applying BLB to
several real datasets, and an extension of BLB to time series data.

We propose a nonparametric link prediction algorithm for a sequence of graph
snapshots over time. The model predicts links based on the features of its
endpoints, as well as those of the local neighborhood around the endpoints.
This allows for different types of neighborhoods in a graph, each with its own
dynamics (e.g, growing or shrinking communities). We prove the consistency of
our estimator, and give a fast implementation based on localitysensitive
hashing. Experiments with simulated as well as five realworld dynamic graphs
show that we outperform the state of the art, especially when sharp
fluctuations or nonlinearities are present.

The bootstrap provides a simple and powerful means of assessing the quality
of estimators. However, in settings involving large datasets, the computation
of bootstrapbased quantities can be prohibitively demanding. As an
alternative, we present the Bag of Little Bootstraps (BLB), a new procedure
which incorporates features of both the bootstrap and subsampling to obtain a
robust, computationally efficient means of assessing estimator quality. BLB is
well suited to modern parallel and distributed computing architectures and
retains the generic applicability, statistical efficiency, and favorable
theoretical properties of the bootstrap. We provide the results of an extensive
empirical and theoretical investigation of BLB's behavior, including a study of
its statistical correctness, its largescale implementation and performance,
selection of hyperparameters, and performance on real data.

Recently there has been much interest in graphbased learning, with
applications in collaborative filtering for recommender networks, link
prediction for social networks and fraud detection. These networks can consist
of millions of entities, and so it is very important to develop highly
efficient techniques. We are especially interested in accelerating random walk
approaches to compute some very interesting proximity measures of these kinds
of graphs. These measures have been shown to do well empirically (LibenNowell
& Kleinberg, 2003; Brand, 2005). We introduce a truncated variation on a
wellknown measure, namely commute times arising from random walks on graphs.
We present a very novel algorithm to compute all interesting pairs of
approximate nearest neighbors in truncated commute times, without computing it
between all pairs. We show results on both simulated and real graphs of size up
to 100; 000 entities, which indicate nearlinear scaling in computation time.

Hybrid human/computer systems promise to greatly expand the usefulness of
query processing by incorporating the crowd for data gathering and other tasks.
Such systems raise many database system implementation questions. Perhaps most
fundamental is that the closed world assumption underlying relational query
semantics does not hold in such systems. As a consequence the meaning of even
simple queries can be called into question. Furthermore query progress
monitoring becomes difficult due to nonuniformities in the arrival of
crowdsourced data and peculiarities of how people work in crowdsourcing
systems. To address these issues, we develop statistical tools that enable
users and systems developers to reason about tradeoffs between time/cost and
completeness. These tools can also help drive query execution and crowdsourcing
strategies. We evaluate our techniques using experiments on a popular
crowdsourcing platform.