
While many statistical models and methods are now available for network
analysis, resampling network data remains a challenging problem.
Crossvalidation is a useful general tool for model selection and parameter
tuning, but is not directly applicable to networks since splitting network
nodes into groups requires deleting edges and destroys some of the network
structure. Here we propose a new network resampling strategy based on splitting
node pairs rather than nodes applicable to crossvalidation for a wide range of
network model selection tasks. We provide a theoretical justification for our
method in a general setting and examples of how our method can be used in
specific network model selection and parameter tuning tasks. Numerical results
on simulated networks and on a citation network of statisticians show that this
crossvalidation approach works well for model selection.

The stochastic block model is widely used for detecting community structures
in network data. How to test the goodnessoffit of the model is one of the
fundamental problems and has gained growing interests in recent years. In this
article, we propose a novel goodnessoffit test based on the maximum entry of
the centered and rescaled adjacency matrix for the stochastic block model. One
noticeable advantage of the proposed test is that the number of communities can
be allowed to grow linearly with the number of nodes ignoring a logarithmic
factor. We prove that the null distribution of the test statistic converges in
distribution to a Gumbel distribution, and we show that both the number of
communities and the membership vector can be tested via the proposed method.
Further, we show that the proposed test has asymptotic power guarantee against
a class of alternatives. We also demonstrate that the proposed method can be
extended to the degreecorrected stochastic block model. Both simulation
studies and realworld data examples indicate that the proposed method works
well.

Prediction algorithms typically assume the training data are independent
samples, but in many modern applications samples come from individuals
connected by a network. For example, in adolescent health studies of
risktaking behaviors, information on the subjects' social network is often
available and plays an important role through network cohesion, the empirically
observed phenomenon of friends behaving similarly. Taking cohesion into account
in prediction models should allow us to improve their performance. Here we
propose a networkbased penalty on individual node effects to encourage
similarity between predictions for linked nodes, and show that incorporating it
into prediction leads to improvement over traditional models both theoretically
and empirically when network cohesion is present. The penalty can be used with
many lossbased prediction methods, such as regression, generalized linear
models, and Cox's proportional hazard model. Applications to predicting levels
of recreational activity and marijuana usage among teenagers from the AddHealth
study based on both demographic covariates and friendship networks are
discussed in detail and show that our approach to taking friendships into
account can significantly improve predictions of behavior while providing
interpretable estimates of covariate effects.

Link prediction in networks is typically accomplished by estimating or
ranking the probabilities of edges for all pairs of nodes. In practice,
especially for social networks, the data are often collected by egocentric
sampling, which means selecting a subset of nodes and recording all of their
edges. This sampling mechanism requires different prediction tools than the
typical assumption of links missing at random. We propose a new computationally
efficient link prediction algorithm for egocentrically sampled networks, which
estimates the underlying probability matrix by estimating its row space. For
networks created by sampling rows, our method outperforms many popular link
prediction and graphon estimation techniques.

Estimating the number of principal components is one of the fundamental
problems in many scientific fields such as signal processing (or the spiked
covariance model). In this paper, for fixed $p$, we select the number of
signals $k$ by the iterated logarithm penalty (ILP) and demonstrate that the
ILP is consistent. Moreover, we demonstrate that any penalty term of the form
$k'(pk'/2+1/2)C_n$ may lead to an asymptotically consistent estimator under
the condition that $C_n\to\infty$ and $C_n/n\to0$. We also extend our results
to the case $n,p\to\infty$, with $p/n\to c>0$. In this case, for
$k=o(n^{\frac{1}{3}})$, we first investigate the limiting law for the leading
eigenvalues of the sample covariance matrix $S_n$ under the condition that
$\lambda_k>1+\sqrt{c}$. At low SNR, since the AIC tends to underestimate the
number of signals $k$, the AIC should be redefined in this case. As a natural
extension of the AIC for fixed $p$, we propose the extended AIC (EAIC), i.e.,
the AICtype method with tuning parameter
$\gamma=\varphi(c)=1/2+\sqrt{1/c}\log(1+\sqrt{c})/c$, and demonstrate that the
EAICtype method, i.e., the AICtype method with tuning parameter
$\gamma>\varphi(c)$, can select the number of signals $k$ consistently. In the
following two cases, (1) $p$ fixed, $n\to\infty$, (2) $n,p\to\infty$ with
$p/n\to 0$, if the AIC is defined as the degeneration of the EAIC in the case
$n,p\to\infty$ with $p/n\to c>0$, i.e., $\gamma=\lim_{c\rightarrow
0+0}\varphi(c)=1$, then we have essentially demonstrated that, to achieve the
consistency of the AICtype method in the above two cases, $\gamma>1$ is
required. Moreover, we show that the EAICtype method is essentially
tuningfree and outperforms the wellknown KN estimator proposed in Kritchman
and Nadler (2008) and the BCF estimator proposed in Bai, Choi and Fujikoshi
(2018). Numerical studies indicate that the proposed method works well.

The estimation of probabilities of network edges from the observed adjacency
matrix has important applications to predicting missing links and network
denoising. It has usually been addressed by estimating the graphon, a function
that determines the matrix of edge probabilities, but this is illdefined
without strong assumptions on the network structure. Here we propose a novel
computationally efficient method, based on neighborhood smoothing to estimate
the expectation of the adjacency matrix directly, without making the structural
assumptions that graphon estimation requires. The neighborhood smoothing method
requires little tuning, has a competitive meansquared error rate, and
outperforms many benchmark methods on link prediction in simulated and real
networks.

The ExpectationMaximization (EM) algorithm is an iterative method to
maximize the loglikelihood function for parameter estimation. Previous works
on the convergence analysis of the EM algorithm have established results on the
asymptotic (population level) convergence rate of the algorithm. In this paper,
we give a dataadaptive analysis of the sample level local convergence rate of
the EM algorithm. In particular, we show that the local convergence rate of the
EM algorithm is a random variable $\overline{K}_{n}$ derived from the data
generating distribution, which adaptively yields the convergence rate of the EM
algorithm on each finite sample data set from the same population distribution.
We then give a nonasymptotic concentration bound of $\overline{K}_{n}$ on the
population level optimal convergence rate $\overline{\kappa}$ of the EM
algorithm, which implies that $\overline{K}_{n}\to\overline{\kappa}$ in
probability as the sample size $n\to\infty$. Our theory identifies the effect
of sample size on the convergence behavior of sample EM sequence, and explains
a surprising phenomenon in applications of the EM algorithm, i.e. the finite
sample version of the algorithm sometimes converges faster even than the
population version. We apply our theory to the EM algorithm on three canonical
models and obtain specific forms of the adaptive convergence theorem for each
model.

Networks are a useful representation for data on connections between units of
interests, but the observed connections are often noisy and/or include missing
values. One common approach to network analysis is to treat the network as a
realization from a random graph model, and estimate the underlying edge
probability matrix, which is sometimes referred to as network denoising. Here
we propose a generalized linear model with low rank effects to model network
edges. This model can be applied to various types of networks, including
directed and undirected, binary and weighted, and it can naturally utilize
additional information such as node and/or edge covariates. We develop an
efficient projected gradient ascent algorithm to fit the model, establish
asymptotic consistency, and demonstrate empirical performance of the method on
both simulated and real networks.

Although much progress has been made in classification with highdimensional
features \citep{Fan_Fan:2008, JGuo:2010, CaiSun:2014, PRXu:2014},
classification with ultrahighdimensional features, wherein the features much
outnumber the sample size, defies most existing work. This paper introduces a
novel and computationally feasible multivariate screening and classification
method for ultrahighdimensional data. Leveraging interfeature correlations,
the proposed method enables detection of marginally weak and sparse signals and
recovery of the true informative feature set, and achieves asymptotic optimal
misclassification rates. We also show that the proposed procedure provides more
powerful discovery boundaries compared to those in \citet{CaiSun:2014} and
\citet{JJin:2009}. The performance of the proposed procedure is evaluated using
simulation studies and demonstrated via classification of patients with
different posttransplantation renal functional types.

While graphical models for continuous data (Gaussian graphical models) and
discrete data (Ising models) have been extensively studied, there is little
work on graphical models linking both continuous and discrete variables (mixed
data), which are common in many scientific applications. We propose a novel
graphical model for mixed data, which is simple enough to be suitable for
highdimensional data, yet flexible enough to represent all possible graph
structures. We develop a computationally efficient regressionbased algorithm
for fitting the model by focusing on the conditional loglikelihood of each
variable given the rest. The parameters have a natural group structure, and
sparsity in the fitted graph is attained by incorporating a group lasso
penalty, approximated by a weighted $\ell_1$ penalty for computational
efficiency. We demonstrate the effectiveness of our method through an extensive
simulation study and apply it to a music annotation data set (CAL500),
obtaining a sparse and interpretable graphical model relating the continuous
features of the audio signal to categorical variables such as genre, emotions,
and usage associated with particular songs. While we focus on binary discrete
variables, we also show that the proposed methodology can be easily extended to
general discrete variables.

Although asymptotic analyses of undirected network models based on degree
sequences have started to appear in recent literature, it remains an open
problem to study statistical properties of directed network models. In this
paper, we provide for the first time a rigorous analysis of directed
exponential random graph models using the indegrees and outdegrees as
sufficient statistics with binary as well as continuous weighted edges. We
establish the uniform consistency and the asymptotic normality for the maximum
likelihood estimate, when the number of parameters grows and only one realized
observation of the graph is available. One key technique in the proofs is to
approximate the inverse of the Fisher information matrix using a simple matrix
with high accuracy. Numerical studies confirm our theoretical findings.

We consider the problem of jointly estimating a collection of graphical
models for discrete data, corresponding to several categories that share some
common structure. An example for such a setting is voting records of
legislators on different issues, such as defense, energy, and healthcare. We
develop a Markov graphical model to characterize the heterogeneous dependence
structures arising from such data. The model is fitted via a joint estimation
method that preserves the underlying common graph structure, but also allows
for differences between the networks. The method employs a group penalty that
targets the common zero interaction effects across all the networks. We apply
the method to describe the internal networks of the U.S. Senate on several
important issues. Our analysis reveals individual structure for each issue,
distinct from the underlying wellknown bipartisan structure common to all
categories which we are able to extract separately. We also establish
consistency of the proposed method both for parameter estimation and model
selection, and evaluate its numerical performance on a number of simulated
examples.

Many methods have been proposed for community detection in networks, but most
of them do not take into account additional information on the nodes that is
often available in practice. In this paper, we propose a new joint community
detection criterion that uses both the network edge information and the node
features to detect community structures. One advantage our method has over
existing joint detection approaches is the flexibility of learning the impact
of different features which may differ across communities. Another advantage is
the flexibility of choosing the amount of influence the feature information has
on communities. The method is asymptotically consistent under the block model
with additional assumptions on the feature distributions, and performs well on
simulated and real networks.

Community detection is a fundamental problem in network analysis, with
applications in many diverse areas. The stochastic block model is a common tool
for modelbased community detection, and asymptotic tools for checking
consistency of community detection under the block model have been recently
developed. However, the block model is limited by its assumption that all nodes
within a community are stochastically equivalent, and provides a poor fit to
networks with hubs or highly varying node degrees within communities, which are
common in practice. The degreecorrected stochastic block model was proposed to
address this shortcoming and allows variation in node degrees within a
community while preserving the overall block community structure. In this paper
we establish general theory for checking consistency of community detection
under the degreecorrected stochastic block model and compare several community
detection criteria under both the standard and the degreecorrected models. We
show which criteria are consistent under which models and constraints, as well
as compare their relative performance in practice. We find that methods based
on the degreecorrected block model, which includes the standard block model as
a special case, are consistent under a wider class of models and that
modularitytype methods require parameter constraints for consistency, whereas
likelihoodbased methods do not. On the other hand, in practice, the degree
correction involves estimating many more parameters, and empirically we find it
is only worth doing if the node degrees within communities are indeed highly
variable. We illustrate the methods on simulated networks and on a network of
political blogs.

Community detection is a fundamental problem in network analysis which is
made more challenging by overlaps between communities which often occur in
practice. Here we propose a general, flexible, and interpretable generative
model for overlapping communities, which can be thought of as a generalization
of the degreecorrected stochastic block model. We develop an efficient
spectral algorithm for estimating the community memberships, which deals with
the overlaps by employing the Kmedians algorithm rather than the usual Kmeans
for clustering in the spectral domain. We show that the algorithm is
asymptotically consistent when networks are not too sparse and the overlaps
between communities not too large. Numerical experiments on both simulated
networks and many real social networks demonstrate that our method performs
very well compared to a number of benchmark methods for overlapping community
detection.

In this paper, we show the Wilks type of results for the BradleyTerry model.
Specifically, for some simple and composite null hypotheses of interest, we
show that the likelihood ratio test statistic $\Lambda$ enjoys a chisquare
approximation in the sense that $(2p)^{1/2}(2\log \Lambda
p)\stackrel{L}{\rightarrow}N(0,1)$ as $p$ goes to infinity, where $p$ is the
corresponding degrees of freedom. Simulation studies and an application to NBA
data illustrate the theoretical results.

The primary motivation and application in this article come from brain
imaging studies on cognitive impairment in elderly subjects with brain
disorders. We propose a regularized Haar waveletbased approach for the
analysis of threedimensional brain image data in the framework of functional
data analysis, which automatically takes into account the spatial information
among neighboring voxels. We conduct extensive simulation studies to evaluate
the prediction performance of the proposed approach and its ability to identify
related regions to the outcome of interest, with the underlying assumption that
only few relatively small subregions are truly predictive of the outcome of
interest. We then apply the proposed approach to searching for brain subregions
that are associated with cognition using PET images of patients with
Alzheimer's disease, patients with mild cognitive impairment and normal
controls.

This paper presents an asynchronous distributed algorithm to manage multiple
trees for peertopeer streaming in a flow level model. It is assumed that
videos are cut into substreams, with or without source coding, to be
distributed to all nodes. The algorithm guarantees that each node receives
sufficiently many substreams within delay logarithmic in the number of peers.
The algorithm works by constantly updating the topology so that each substream
is distributed through trees to as many nodes as possible without interference.
Competition among trees for limited upload capacity is managed so that both
coverage and balance are achieved. The algorithm is robust in that it
efficiently eliminates cycles and maintains tree structures in a distributed
way. The algorithm favors nodes with higher degree, so it not only works for
live streaming and video on demand, but also in the case a few nodes with large
degree act as servers and other nodes act as clients.
A proof of convergence of the algorithm is given assuming instantaneous
update of depth information, and for the case of a single tree it is shown that
the convergence time is stochastically tightly bounded by a small constant
times the log of the number of nodes. These theoretical results are
complemented by simulations showing that the algorithm works well even when
most assumptions for the theoretical tractability do not hold.

In this paper we study the effective degrees of freedom of a general class of
reduced rank estimators for multivariate regression in the framework of Stein's
unbiased risk estimation (SURE). We derive a finitesample exact unbiased
estimator that admits a closedform expression in terms of the singular values
or thresholded singular values of the least squares solution and hence readily
computable. The results continue to hold in the highdimensional scenario when
both the predictor and response dimensions are allowed to be larger than the
sample size. The derived analytical form facilitates the investigation of its
theoretical properties and provides new insights into the empirical behaviors
of the degrees of freedom. In particular, we examine the differences and
connections between the proposed estimator and a commonlyused naive estimator,
i.e., the number of free parameters. The use of the proposed estimator leads to
efficient and accurate prediction risk estimation and model selection, as
demonstrated by simulation studies and a data example.

Link prediction is one of the fundamental problems in network analysis. In
many applications, notably in genetics, a partially observed network may not
contain any negative examples of absent edges, which creates a difficulty for
many existing supervised learning approaches. We develop a new method which
treats the observed network as a sample of the true network with different
sampling rates for positive and negative examples. We obtain a relative ranking
of potential links by their probabilities, utilizing information on node
covariates as well as on network topology. Empirically, the method performs
well under many settings, including when the observed network is sparse. We
apply the method to a proteinprotein interaction network and a school
friendship network.

There has been a lot of work fitting Ising models to multivariate binary data
in order to understand the conditional dependency relationships between the
variables. However, additional covariates are frequently recorded together with
the binary data, and may influence the dependence relationships. Motivated by
such a dataset on genomic instability collected from tumor samples of several
types, we propose a sparse covariate dependent Ising model to study both the
conditional dependency within the binary data and its relationship with the
additional covariates. This results in subjectspecific Ising models, where the
subject's covariates influence the strength of association between the genes.
As in all exploratory data analysis, interpretability of results is important,
and we use L1 penalties to induce sparsity in the fitted graphs and in the
number of selected covariates. Two algorithms to fit the model are proposed and
compared on a set of simulated data, and asymptotic results are established.
The results on the tumor dataset and their biological significance are
discussed in detail.

This paper focuses on the stationary portion of file download in an
unstructured peertopeer network, which typically follows for many hours after
a flash crowd initiation. The model includes the case that peers can have some
pieces at the time of arrival. The contribution of the paper is to identify how
much help is needed from the seeds, either fixed seeds or peer seeds (which are
peers remaining in the system after obtaining a complete collection) to
stabilize the system. The dominant cause for instability is the missing piece
syndrome, whereby one piece becomes very rare in the network. It is shown that
stability can be achieved with only a small amount of help from peer
seedseven with very little help from a fixed seed, peers need dwell as peer
seeds on average only long enough to upload one additional piece. The region of
stability is insensitive to the piece selection policy. Network coding can
substantially increase the region of stability in case a portion of the new
peers arrive with randomly coded pieces.

Typical protocols for peertopeer file sharing over the Internet divide
files to be shared into pieces. New peers strive to obtain a complete
collection of pieces from other peers and from a seed. In this paper we
investigate a problem that can occur if the seeding rate is not large enough.
The problem is that, even if the statistics of the system are symmetric in the
pieces, there can be symmetry breaking, with one piece becoming very rare. If
peers depart after obtaining a complete collection, they can tend to leave
before helping other peers receive the rare piece. Assuming that peers arrive
with no pieces, there is a single seed, random peer contacts are made, random
useful pieces are downloaded, and peers depart upon receiving the complete
file, the system is stable if the seeding rate (in pieces per time unit) is
greater than the arrival rate, and is unstable if the seeding rate is less than
the arrival rate. The result persists for any piece selection policy that
selects from among useful pieces, such as rarest first, and it persists with
the use of network coding.

We propose a computationally intensive method, the random lasso method, for
variable selection in linear models. The method consists of two major steps. In
step 1, the lasso method is applied to many bootstrap samples, each using a set
of randomly selected covariates. A measure of importance is yielded from this
step for each covariate. In step 2, a similar procedure to the first step is
implemented with the exception that for each bootstrap sample, a subset of
covariates is randomly selected with unequal selection probabilities determined
by the covariates' importance. Adaptive lasso may be used in the second step
with weights determined by the importance measures. The final set of covariates
and their coefficients are determined by averaging bootstrap results obtained
from step 2. The proposed method alleviates some of the limitations of lasso,
elasticnet and related methods noted especially in the context of microarray
data analysis: it tends to remove highly correlated variables altogether or
select them all, and maintains maximal flexibility in estimating their
coefficients, particularly with different signs; the number of selected
variables is no longer limited by the sample size; and the resulting prediction
accuracy is competitive or superior compared to the alternatives. We illustrate
the proposed method by extensive simulation studies. The proposed method is
also applied to a Glioblastoma microarray data analysis.

In many organisms the expression levels of each gene are controlled by the
activation levels of known "Transcription Factors" (TF). A problem of
considerable interest is that of estimating the "Transcription Regulation
Networks" (TRN) relating the TFs and genes. While the expression levels of
genes can be observed, the activation levels of the corresponding TFs are
usually unknown, greatly increasing the difficulty of the problem. Based on
previous experimental work, it is often the case that partial information about
the TRN is available. For example, certain TFs may be known to regulate a given
gene or in other cases a connection may be predicted with a certain
probability. In general, the biology of the problem indicates there will be
very few connections between TFs and genes. Several methods have been proposed
for estimating TRNs. However, they all suffer from problems such as unrealistic
assumptions about prior knowledge of the network structure or computational
limitations. We propose a new approach that can directly utilize prior
information about the network structure in conjunction with observed gene
expression data to estimate the TRN. Our approach uses $L_1$ penalties on the
network to ensure a sparse structure. This has the advantage of being
computationally efficient as well as making many fewer assumptions about the
network structure. We use our methodology to construct the TRN for E. coli and
show that the estimate is biologically sensible and compares favorably with
previous estimates.