
This paper studies model selection consistency for high dimensional sparse
regression when data exhibits both crosssectional and serial dependency. Most
commonlyused model selection methods fail to consistently recover the true
model when the covariates are highly correlated. Motivated by econometric
studies, we consider the case where covariate dependence can be reduced through
factor model, and propose a consistent strategy named FactorAdjusted
Regularized Model Selection (FarmSelect). By separating the latent factors from
idiosyncratic components, we transform the problem from model selection with
highly correlated covariates to that with weakly correlated variables. Model
selection consistency as well as optimal rates of convergence are obtained
under mild conditions. Numerical studies demonstrate the nice finite sample
performance in terms of both model selection and outofsample prediction.
Moreover, our method is flexible in a sense that it pays no price for weakly
correlated and uncorrelated cases. Our method is applicable to a wide range of
high dimensional sparse regression problems. An Rpackage FarmSelect is also
provided for implementation.

This paper is concerned with the problem of top$K$ ranking from pairwise
comparisons. Given a collection of $n$ items and a few pairwise comparisons
across them, one wishes to identify the set of $K$ items that receive the
highest ranks. To tackle this problem, we adopt the logistic parametric model
 the BradleyTerryLuce model, where each item is assigned a latent
preference score, and where the outcome of each pairwise comparison depends
solely on the relative scores of the two items involved. Recent works have made
significant progress towards characterizing the performance (e.g. the mean
square error for estimating the scores) of several classical methods, including
the spectral method and the maximum likelihood estimator (MLE). However, where
they stand regarding top$K$ ranking remains unsettled.
We demonstrate that under a natural random sampling model, the spectral
method alone, or the regularized MLE alone, is minimax optimal in terms of the
sample complexity  the number of paired comparisons needed to ensure exact
top$K$ identification, for the fixed dynamic range regime. This is
accomplished via optimal control of the entrywise error of the score estimates.
We complement our theoretical studies by numerical experiments, confirming that
both methods yield low entrywise errors for estimating the underlying scores.
Our theory is established via a novel leaveoneout trick, which proves
effective for analyzing both iterative and noniterative procedures. Along the
way, we derive an elementary eigenvector perturbation bound for probability
transition matrices, which parallels the DavisKahan $\sin\Theta$ theorem for
symmetric matrices. This also allows us to close the gap between the $\ell_2$
error upper bound for the spectral method and the minimax lower limit.

Principal component analysis (PCA) is fundamental to statistical machine
learning. It extracts latent principal factors that contribute to the most
variation of the data. When data are stored across multiple machines, however,
communication cost can prohibit the computation of PCA in a central location
and distributed algorithms for PCA are thus needed. This paper proposes and
studies a distributed PCA algorithm: each node machine computes the top $K$
eigenvectors and transmits them to the central server; the central server then
aggregates the information from all the node machines and conducts a PCA based
on the aggregated information. We investigate the bias and variance for the
resulting distributed estimator of the top $K$ eigenvectors. In particular, we
show that for distributions with symmetric innovation, the empirical top
eigenspaces are unbiased and hence the distributed PCA is "unbiased". We derive
the rate of convergence for distributed PCA estimators, which depends
explicitly on the effective rank of covariance, eigengap, and the number of
machines. We show that when the number of machines is not unreasonably large,
the distributed PCA performs as well as the whole sample PCA, even without full
access of whole data. The theoretical results are verified by an extensive
simulation study. We also extend our analysis to the heterogeneous case where
the population covariance matrices are different across local machines but
share similar top eigenstructures.

Recovering lowrank structures via eigenvector perturbation analysis is a
common problem in statistical machine learning, such as in factor analysis,
community detection, ranking, matrix completion, among others. While a large
variety of bounds are available for average errors between empirical and
population statistics of eigenvectors, few results are tight for entrywise
analyses, which are critical for a number of problems such as community
detection.
This paper investigates entrywise behaviors of eigenvectors for a large class
of random matrices whose expectations are lowrank, which helps settle the
conjecture in Abbe et al. (2014b) that the spectral algorithm achieves exact
recovery in the stochastic block model without any trimming or cleaning steps.
The key is a firstorder approximation of eigenvectors under the $\ell_\infty$
norm: $$u_k \approx \frac{A u_k^*}{\lambda_k^*},$$ where $\{u_k\}$ and
$\{u_k^*\}$ are eigenvectors of a random matrix $A$ and its expectation
$\mathbb{E} A$, respectively. The fact that the approximation is both tight and
linear in $A$ facilitates sharp comparisons between $u_k$ and $u_k^*$. In
particular, it allows for comparing the signs of $u_k$ and $u_k^*$ even if $\
u_k  u_k^*\_{\infty}$ is large. The results are further extended to
perturbations of eigenspaces, yielding new $\ell_\infty$type bounds for
synchronization ($\mathbb{Z}_2$spiked Wigner model) and noisy matrix
completion.

This paper is devoted to an interacting particle system that provides
probabilistic interpretation of the wave equation on graphs. A FeynmanKactype
formula is established, connecting the expectation of the process with the wave
equation on graphs. Nonasymptotic $L^2$ estimates are presented. It is then
shown that the highdensity hydrodynamic limit of the system is given by the
wave equation in Euclidean space. The sharpness of scaling limit result is
demonstrated by a phase transition phenomenon.