
Analysis of sleep for the diagnosis of sleep disorders such as Type1
Narcolepsy (T1N) currently requires visual inspection of polysomnography
records by trained scoring technicians. Here, we used neural networks in
approximately 3,000 normal and abnormal sleep recordings to automate sleep
stage scoring, producing a hypnodensity graph  a probability distribution
conveying more information than classical hypnograms. Accuracy of sleep stage
scoring was validated in 70 subjects assessed by six scorers. The best model
performed better than any individual scorer (87% versus consensus). It also
reliably scores sleep down to 5 instead of 30 second scoring epochs. A T1N
marker based on unusual sleepstage overlaps achieved a specificity of 96% and
a sensitivity of 91%, validated in independent datasets. Addition of
HLADQB1*06:02 typing increased specificity to 99%. Our method can reduce time
spent in sleep clinics and automates T1N diagnosis. It also opens the
possibility of diagnosing T1N using home sleep studies.

Consider a sequence of real data points $X_1,\ldots, X_n$ with underlying
means $\theta^*_1,\dots,\theta^*_n$. This paper starts from studying the
setting that $\theta^*_i$ is both piecewise constant and monotone as a function
of the index $i$. For this, we establish the exact minimax rate of estimating
such monotone functions, and thus give a nontrivial answer to an open problem
in the shapeconstrained analysis literature. The minimax rate involves an
interesting iterated logarithmic dependence on the dimension, a phenomenon that
is revealed through characterizing the interplay between the isotonic shape
constraint and model selection complexity. We then develop a penalized
leastsquares procedure for estimating the vector
$\theta^*=(\theta^*_1,\dots,\theta^*_n)^T$. This estimator is shown to achieve
the derived minimax rate adaptively. For the proposed estimator, we further
allow the model to be misspecified and derive oracle inequalities with the
optimal rates, and show there exists a computationally efficient algorithm to
compute the exact solution.

Motivated by challenges on studying a new correlation measurement being
popularized in evaluating online ranking algorithms' performance, this
manuscript explores the validity of uncertainty assessment for weighted
Ustatistics. Without any commonly adopted assumption, we verify Efron's
bootstrap and a new resampling procedure's inference validity. Specifically, in
its full generality, our theory allows both kernels and weights asymmetric and
data points not identically distributed, which are all new issues that
historically have not been addressed. For achieving strict generalization, for
example, we have to carefully control the order of the "degenerate" term in
Ustatistics which are no longer degenerate under the empirical measure for
noni.i.d. data. Our result applies to the motivating task, giving the region
at which solid statistical inference can be made.

There has been an increasing interest in testing the equality of large
Pearson's correlation matrices. However, in many applications it is more
important to test the equality of large rankbased correlation matrices since
they are more robust to outliers and nonlinearity. Unlike the Pearson's case,
testing the equality of large rankbased statistics has not been well explored
and requires us to develop new methods and theory. In this paper, we provide a
framework for testing the equality of two large Ustatistic based correlation
matrices, which include the rankbased correlation matrices as special cases.
Our approach exploits extreme value statistics and the Jackknife estimator for
uncertainty assessment and is valid under a fully nonparametric model.
Theoretically, we develop a theory for testing the equality of Ustatistic
based correlation matrices. We then apply this theory to study the problem of
testing large Kendall's tau correlation matrices and demonstrate its
optimality. For proving this optimality, a novel construction of least
favourable distributions is developed for the correlation matrix comparison.

This paper proposes a regularized pairwise difference approach for estimating
the linear component coefficient in a partially linear model, with consistency
and exact rates of convergence obtained in high dimensions under mild scaling
requirements. Our analysis reveals interesting features such as (i) the
bandwidth parameter automatically adapts to the model and is actually
tuninginsensitive; and (ii) the procedure could even maintain fast rate of
convergence for $\alpha$H\"older class of $\alpha\leq1/2$. Simulation studies
show the advantage of the proposed method, and application of our approach to a
brain imaging data reveals some biological patterns which fail to be recovered
using competing methods.

In many applications, linear models fit the data poorly. This article studies
an appealing alternative, the generalized regression model. This model only
assumes that there exists an unknown monotonically increasing link function
connecting the response $Y$ to a single index $X^T\beta^*$ of explanatory
variables $X\in\mathbb{R}^d$. The generalized regression model is flexible and
covers many widely used statistical models. It fits the data generating
mechanisms well in many real problems, which makes it useful in a variety of
applications where regression models are regularly employed. In low dimensions,
rankbased Mestimators are recommended to deal with the generalized regression
model, giving root$n$ consistent estimators of $\beta^*$. Applications of
these estimators to high dimensional data, however, are questionable. This
article studies, both theoretically and practically, a simple yet powerful
smoothing approach to handle the high dimensional generalized regression model.
Theoretically, a family of smoothing functions is provided, and the amount of
smoothing necessary for efficient inference is carefully calculated.
Practically, our study is motivated by an important and challenging scientific
problem: decoding gene regulation by predicting transcription factors that bind
to cisregulatory elements. Applying our proposed method to this problem shows
substantial improvement over the stateoftheart alternative in real data.

We consider the testing of mutual independence among all entries in a
$d$dimensional random vector based on $n$ independent observations. We study
two families of distributionfree test statistics, which include Kendall's tau
and Spearman's rho as important examples. We show that under the null
hypothesis the test statistics of these two families converge weakly to Gumbel
distributions, and propose tests that control the type I error in the
highdimensional setting where $d>n$. We further show that the two tests are
rateoptimal in terms of power against sparse alternatives, and outperform
competitors in simulations, especially when $d$ is large.

Recently, Chernozhukov, Chetverikov, and Kato [Ann. Statist. 42 (2014)
15641597] developed a new Gaussian comparison inequality for approximating
the suprema of empirical processes. This paper exploits this technique to
devise sharp inference on spectra of large random matrices. In particular, we
show that two longstanding problems in random matrix theory can be solved: (i)
simple bootstrap inference on sample eigenvalues when true eigenvalues are
tied; (ii) conducting twosample Roy's covariance test in high dimensions. To
establish the asymptotic results, a generalized $\epsilon$net argument
regarding the matrix rescaled spectral norm and several new empirical process
bounds are developed and of independent interest.

The relationship of scientific knowledge development to technological
development is widely recognized as one of the most important and complex
aspects of technological evolution. This paper adds to our understanding of the
relationship through use of a more rigorous structure for differentiating among
technologies based upon technological domains (defined as consisting of the
artifacts over time that fulfill a specific generic function using a specific
body of technical knowledge).

The family of Ustatistics plays a fundamental role in statistics. This paper
proves a novel exponential inequality for Ustatistics under the time series
setting. Explicit mixing conditions are given for guaranteeing fast
convergence, the bound proves to be analogous to the one under independence,
and extension to nonstationary time series is straightforward. The proof
relies on a novel decomposition of Ustatistics via exploiting the temporal
correlatedness structure. Such results are of interest in many fields where
high dimensional time series data are present. In particular, applications to
high dimensional time series inference are discussed.

We present a robust alternative to principal component analysis (PCA) 
called elliptical component analysis (ECA)  for analyzing high dimensional,
elliptically distributed data. ECA estimates the eigenspace of the covariance
matrix of the elliptical data. To cope with heavytailed elliptical
distributions, a multivariate rank statistic is exploited. At the modellevel,
we consider two settings: either that the leading eigenvectors of the
covariance matrix are nonsparse or that they are sparse. Methodologically, we
propose ECA procedures for both nonsparse and sparse settings. Theoretically,
we provide both nonasymptotic and asymptotic analyses quantifying the
theoretical performances of ECA. In the nonsparse setting, we show that ECA's
performance is highly related to the effective rank of the covariance matrix.
In the sparse setting, the results are twofold: (i) We show that the sparse ECA
estimator based on a combinatoric program attains the optimal rate of
convergence; (ii) Based on some recent developments in estimating sparse
leading eigenvectors, we show that a computationally efficient sparse ECA
estimator attains the optimal rate of convergence under a suboptimal scaling.

Correlation matrices play a key role in many multivariate methods (e.g.,
graphical model estimation and factor analysis). The current stateoftheart
in estimating large correlation matrices focuses on the use of Pearson's sample
correlation matrix. Although Pearson's sample correlation matrix enjoys various
good properties under Gaussian models, it is not an effective estimator when
facing heavytailed distributions. As a robust alternative, Han and Liu [J. Am.
Stat. Assoc. 109 (2015) 275287] advocated the use of a transformed version of
the Kendall's tau sample correlation matrix in estimating high dimensional
latent generalized correlation matrix under the transelliptical distribution
family (or elliptical copula). The transelliptical family assumes that after
unspecified marginal monotone transformations, the data follow an elliptical
distribution. In this paper, we study the theoretical properties of the
Kendall's tau sample correlation matrix and its transformed version proposed in
Han and Liu [J. Am. Stat. Assoc. 109 (2015) 275287] for estimating the
population Kendall's tau correlation matrix and the latent Pearson's
correlation matrix under both spectral and restricted spectral norms. With
regard to the spectral norm, we highlight the role of "effective rank" in
quantifying the rate of convergence. With regard to the restricted spectral
norm, we for the first time present a "sign subGaussian condition" which is
sufficient to guarantee that the rankbased correlation matrix estimator
attains the fast rate of convergence. In both cases, we do not need any moment
condition.

We propose a bootstrapbased robust highconfidence level upper bound (Robust
HCLUB) for assessing the risks of large portfolios. The proposed approach
exploits rankbased and quantilebased estimators, and can be viewed as a
robust extension of the HCLUB method (Fan et al., 2015). Such an extension
allows us to handle possibly misspecified models and heavytailed data. Under
mixing conditions, we analyze the proposed approach and demonstrate its
advantage over the HCLUB. We further provide thorough numerical results to
back up the developed theory. We also apply the proposed method to analyze a
stock market dataset.

Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with smallscale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in highconfidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions.

The vector autoregressive (VAR) model is a powerful tool in modeling complex
time series and has been exploited in many fields. However, fitting high
dimensional VAR model poses some unique challenges: On one hand, the
dimensionality, caused by modeling a large number of time series and higher
order autoregressive processes, is usually much higher than the time series
length; On the other hand, the temporal dependence structure in the VAR model
gives rise to extra theoretical challenges. In high dimensions, one popular
approach is to assume the transition matrix is sparse and fit the VAR model
using the "least squares" method with a lassotype penalty. In this manuscript,
we propose an alternative way in estimating the VAR model. The main idea is,
via exploiting the temporal dependence structure, to formulate the estimating
problem into a linear program. There is instant advantage for the proposed
approach over the lassotype estimators: The estimation equation can be
decomposed into multiple subequations and accordingly can be efficiently
solved in a parallel fashion. In addition, our method brings new theoretical
insights into the VAR model analysis. So far the theoretical results developed
in high dimensions (e.g., Song and Bickel (2011) and Kock and Callot (2012))
mainly pose assumptions on the design matrix of the formulated regression
problems. Such conditions are indirect about the transition matrices and not
transparent. In contrast, our results show that the operator norm of the
transition matrices plays an important role in estimation accuracy. We provide
explicit rates of convergence for both estimation and prediction. In addition,
we provide thorough experiments on both synthetic and realworld equity data to
show that there are empirical advantages of our method over the lassotype
estimators in both parameter estimation and forecasting.

Statisticians and quantitative neuroscientists have actively promoted the use
of independence relationships for investigating brain networks, genomic
networks, and other measurement technologies. Estimation of these graphs
depends on two steps. First is a feature extraction by summarizing measurements
within a parcellation, regional or set definition to create nodes. Secondly,
these summaries are then used to create a graph representing relationships of
interest. In this manuscript we study the impact of dimension reduction on
graphs that describe different notions of relations among a set of random
variables. We are particularly interested in undirected graphs that capture the
random variables' independence and conditional independence relations. A
dimension reduction procedure can be any mapping from high dimensional spaces
to low dimensional spaces. We exploit a general framework for modeling the raw
data and advocate that in estimating the undirected graphs, any acceptable
dimension reduction procedure should be a graphhomotopic mapping, i.e., the
graphical structure of the data after dimension reduction should inherit the
main characteristics of the graphical structure of the raw data. We show that,
in terms of inferring undirected graphs that characterize the conditional
independence relations among random variables, many dimension reduction
procedures, such as the mean, median, or principal components, cannot be
theoretically guaranteed to be a graphhomotopic mapping. The implications of
this work are broad. In the most charitable setting for researchers, where the
correct node definition is known, graphical relationships can be contaminated
merely via the dimension reduction. The manuscript ends with a concrete
example, characterizing a subset of graphical structures such that the
dimension reduction procedure using the principal components can be a
graphhomotopic mapping.

In this manuscript we consider the problem of jointly estimating multiple
graphical models in high dimensions. We assume that the data are collected from
n subjects, each of which consists of T possibly dependent observations. The
graphical models of subjects vary, but are assumed to change smoothly
corresponding to a measure of closeness between subjects. We propose a kernel
based method for jointly estimating all graphical models. Theoretically, under
a double asymptotic framework, where both (T,n) and the dimension d can
increase, we provide the explicit rate of convergence in parameter estimation.
It characterizes the strength one can borrow across different individuals and
impact of data dependence on parameter estimation. Empirically, experiments on
both synthetic and real resting state functional magnetic resonance imaging
(rsfMRI) data illustrate the effectiveness of the proposed method.

We propose a new high dimensional semiparametric principal component analysis
(PCA) method, named Copula Component Analysis (COCA). The semiparametric model
assumes that, after unspecified marginally monotone transformations, the
distributions are multivariate Gaussian. COCA improves upon PCA and sparse PCA
in three aspects: (i) It is robust to modeling assumptions; (ii) It is robust
to outliers and data contamination; (iii) It is scaleinvariant and yields more
interpretable results. We prove that the COCA estimators obtain fast estimation
rates and are feature selection consistent when the dimension is nearly
exponentially large relative to the sample size. Careful experiments confirm
that COCA outperforms sparse PCA on both synthetic and realworld datasets.

In this manuscript a unified framework for conducting inference on complex
aggregated data in high dimensional settings is proposed. The data are assumed
to be a collection of multiple nonGaussian realizations with underlying
undirected graphical structures. Utilizing the concept of median graphs in
summarizing the commonality across these graphical structures, a novel
semiparametric approach to modeling such complex aggregated data is provided
along with robust estimation of the median graph, which is assumed to be
sparse. The estimator is proved to be consistent in graph recovery and an upper
bound on the rate of convergence is given. Experiments on both synthetic and
real datasets are conducted to illustrate the empirical usefulness of the
proposed models and methods.

We study sparse principal component analysis for high dimensional vector
autoregressive time series under a doubly asymptotic framework, which allows
the dimension $d$ to scale with the series length $T$. We treat the transition
matrix of time series as a nuisance parameter and directly apply sparse
principal component analysis on multivariate time series as if the data are
independent. We provide explicit nonasymptotic rates of convergence for
leading eigenvector estimation and extend this result to principal subspace
estimation. Our analysis illustrates that the spectral norm of the transition
matrix plays an essential role in determining the final rates. We also
characterize sufficient conditions under which sparse principal component
analysis attains the optimal parametric rate. Our theoretical results are
backed up by thorough numerical studies.

We study the feasibility of a sterile neutrino search at the China Advanced
Research Reactor by measuring $\bar {\nu}_e$ survival probability with a
baseline of less than 15 m. Both hydrogen and deuteron have been considered as
potential targets. The sensitivity to steriletoregular neutrino mixing is
investigated under the "3(active)+1(sterile)" framework. We find that the
mixing parameter $\sin^2(2\theta_{14})$ can be severely constrained by such
measurement if the mass square difference $\Delta m_{14}^2$ is of the order of
$\sim$1 eV$^2$.

In this paper, we propose a semiparametric approach, named nonparanormal
skeptic, for efficiently and robustly estimating high dimensional undirected
graphical models. To achieve modeling flexibility, we consider Gaussian Copula
graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To
achieve estimation robustness, we exploit nonparametric rankbased correlation
coefficient estimators, including Spearman's rho and Kendall's tau. In high
dimensional settings, we prove that the nonparanormal skeptic achieves the
optimal parametric rate of convergence in both graph and parameter estimation.
This celebrating result suggests that the Gaussian copula graphical models can
be used as a safe replacement of the popular Gaussian graphical models, even
when the data are truly Gaussian. Besides theoretical analysis, we also conduct
thorough numerical simulations to compare different estimators for their graph
recovery performance under both ideal and noisy settings. The proposed methods
are then applied on a largescale genomic dataset to illustrate their empirical
usefulness. The R language software package huge implementing the proposed
methods is available on the Comprehensive R Archive Network: http://cran.
rproject.org/.