• Analysis of sleep for the diagnosis of sleep disorders such as Type-1 Narcolepsy (T1N) currently requires visual inspection of polysomnography records by trained scoring technicians. Here, we used neural networks in approximately 3,000 normal and abnormal sleep recordings to automate sleep stage scoring, producing a hypnodensity graph - a probability distribution conveying more information than classical hypnograms. Accuracy of sleep stage scoring was validated in 70 subjects assessed by six scorers. The best model performed better than any individual scorer (87% versus consensus). It also reliably scores sleep down to 5 instead of 30 second scoring epochs. A T1N marker based on unusual sleep-stage overlaps achieved a specificity of 96% and a sensitivity of 91%, validated in independent datasets. Addition of HLA-DQB1*06:02 typing increased specificity to 99%. Our method can reduce time spent in sleep clinics and automates T1N diagnosis. It also opens the possibility of diagnosing T1N using home sleep studies.
  • Consider a sequence of real data points $X_1,\ldots, X_n$ with underlying means $\theta^*_1,\dots,\theta^*_n$. This paper starts from studying the setting that $\theta^*_i$ is both piecewise constant and monotone as a function of the index $i$. For this, we establish the exact minimax rate of estimating such monotone functions, and thus give a non-trivial answer to an open problem in the shape-constrained analysis literature. The minimax rate involves an interesting iterated logarithmic dependence on the dimension, a phenomenon that is revealed through characterizing the interplay between the isotonic shape constraint and model selection complexity. We then develop a penalized least-squares procedure for estimating the vector $\theta^*=(\theta^*_1,\dots,\theta^*_n)^T$. This estimator is shown to achieve the derived minimax rate adaptively. For the proposed estimator, we further allow the model to be misspecified and derive oracle inequalities with the optimal rates, and show there exists a computationally efficient algorithm to compute the exact solution.
  • Motivated by challenges on studying a new correlation measurement being popularized in evaluating online ranking algorithms' performance, this manuscript explores the validity of uncertainty assessment for weighted U-statistics. Without any commonly adopted assumption, we verify Efron's bootstrap and a new resampling procedure's inference validity. Specifically, in its full generality, our theory allows both kernels and weights asymmetric and data points not identically distributed, which are all new issues that historically have not been addressed. For achieving strict generalization, for example, we have to carefully control the order of the "degenerate" term in U-statistics which are no longer degenerate under the empirical measure for non-i.i.d. data. Our result applies to the motivating task, giving the region at which solid statistical inference can be made.
  • There has been an increasing interest in testing the equality of large Pearson's correlation matrices. However, in many applications it is more important to test the equality of large rank-based correlation matrices since they are more robust to outliers and nonlinearity. Unlike the Pearson's case, testing the equality of large rank-based statistics has not been well explored and requires us to develop new methods and theory. In this paper, we provide a framework for testing the equality of two large U-statistic based correlation matrices, which include the rank-based correlation matrices as special cases. Our approach exploits extreme value statistics and the Jackknife estimator for uncertainty assessment and is valid under a fully nonparametric model. Theoretically, we develop a theory for testing the equality of U-statistic based correlation matrices. We then apply this theory to study the problem of testing large Kendall's tau correlation matrices and demonstrate its optimality. For proving this optimality, a novel construction of least favourable distributions is developed for the correlation matrix comparison.
  • This paper proposes a regularized pairwise difference approach for estimating the linear component coefficient in a partially linear model, with consistency and exact rates of convergence obtained in high dimensions under mild scaling requirements. Our analysis reveals interesting features such as (i) the bandwidth parameter automatically adapts to the model and is actually tuning-insensitive; and (ii) the procedure could even maintain fast rate of convergence for $\alpha$-H\"older class of $\alpha\leq1/2$. Simulation studies show the advantage of the proposed method, and application of our approach to a brain imaging data reveals some biological patterns which fail to be recovered using competing methods.
  • In many applications, linear models fit the data poorly. This article studies an appealing alternative, the generalized regression model. This model only assumes that there exists an unknown monotonically increasing link function connecting the response $Y$ to a single index $X^T\beta^*$ of explanatory variables $X\in\mathbb{R}^d$. The generalized regression model is flexible and covers many widely used statistical models. It fits the data generating mechanisms well in many real problems, which makes it useful in a variety of applications where regression models are regularly employed. In low dimensions, rank-based M-estimators are recommended to deal with the generalized regression model, giving root-$n$ consistent estimators of $\beta^*$. Applications of these estimators to high dimensional data, however, are questionable. This article studies, both theoretically and practically, a simple yet powerful smoothing approach to handle the high dimensional generalized regression model. Theoretically, a family of smoothing functions is provided, and the amount of smoothing necessary for efficient inference is carefully calculated. Practically, our study is motivated by an important and challenging scientific problem: decoding gene regulation by predicting transcription factors that bind to cis-regulatory elements. Applying our proposed method to this problem shows substantial improvement over the state-of-the-art alternative in real data.
  • We consider the testing of mutual independence among all entries in a $d$-dimensional random vector based on $n$ independent observations. We study two families of distribution-free test statistics, which include Kendall's tau and Spearman's rho as important examples. We show that under the null hypothesis the test statistics of these two families converge weakly to Gumbel distributions, and propose tests that control the type I error in the high-dimensional setting where $d>n$. We further show that the two tests are rate-optimal in terms of power against sparse alternatives, and outperform competitors in simulations, especially when $d$ is large.
  • Recently, Chernozhukov, Chetverikov, and Kato [Ann. Statist. 42 (2014) 1564--1597] developed a new Gaussian comparison inequality for approximating the suprema of empirical processes. This paper exploits this technique to devise sharp inference on spectra of large random matrices. In particular, we show that two long-standing problems in random matrix theory can be solved: (i) simple bootstrap inference on sample eigenvalues when true eigenvalues are tied; (ii) conducting two-sample Roy's covariance test in high dimensions. To establish the asymptotic results, a generalized $\epsilon$-net argument regarding the matrix rescaled spectral norm and several new empirical process bounds are developed and of independent interest.
  • The relationship of scientific knowledge development to technological development is widely recognized as one of the most important and complex aspects of technological evolution. This paper adds to our understanding of the relationship through use of a more rigorous structure for differentiating among technologies based upon technological domains (defined as consisting of the artifacts over time that fulfill a specific generic function using a specific body of technical knowledge).
  • The family of U-statistics plays a fundamental role in statistics. This paper proves a novel exponential inequality for U-statistics under the time series setting. Explicit mixing conditions are given for guaranteeing fast convergence, the bound proves to be analogous to the one under independence, and extension to non-stationary time series is straightforward. The proof relies on a novel decomposition of U-statistics via exploiting the temporal correlatedness structure. Such results are of interest in many fields where high dimensional time series data are present. In particular, applications to high dimensional time series inference are discussed.
  • We present a robust alternative to principal component analysis (PCA) --- called elliptical component analysis (ECA) --- for analyzing high dimensional, elliptically distributed data. ECA estimates the eigenspace of the covariance matrix of the elliptical data. To cope with heavy-tailed elliptical distributions, a multivariate rank statistic is exploited. At the model-level, we consider two settings: either that the leading eigenvectors of the covariance matrix are non-sparse or that they are sparse. Methodologically, we propose ECA procedures for both non-sparse and sparse settings. Theoretically, we provide both non-asymptotic and asymptotic analyses quantifying the theoretical performances of ECA. In the non-sparse setting, we show that ECA's performance is highly related to the effective rank of the covariance matrix. In the sparse setting, the results are twofold: (i) We show that the sparse ECA estimator based on a combinatoric program attains the optimal rate of convergence; (ii) Based on some recent developments in estimating sparse leading eigenvectors, we show that a computationally efficient sparse ECA estimator attains the optimal rate of convergence under a suboptimal scaling.
  • Correlation matrices play a key role in many multivariate methods (e.g., graphical model estimation and factor analysis). The current state-of-the-art in estimating large correlation matrices focuses on the use of Pearson's sample correlation matrix. Although Pearson's sample correlation matrix enjoys various good properties under Gaussian models, it is not an effective estimator when facing heavy-tailed distributions. As a robust alternative, Han and Liu [J. Am. Stat. Assoc. 109 (2015) 275-287] advocated the use of a transformed version of the Kendall's tau sample correlation matrix in estimating high dimensional latent generalized correlation matrix under the transelliptical distribution family (or elliptical copula). The transelliptical family assumes that after unspecified marginal monotone transformations, the data follow an elliptical distribution. In this paper, we study the theoretical properties of the Kendall's tau sample correlation matrix and its transformed version proposed in Han and Liu [J. Am. Stat. Assoc. 109 (2015) 275-287] for estimating the population Kendall's tau correlation matrix and the latent Pearson's correlation matrix under both spectral and restricted spectral norms. With regard to the spectral norm, we highlight the role of "effective rank" in quantifying the rate of convergence. With regard to the restricted spectral norm, we for the first time present a "sign sub-Gaussian condition" which is sufficient to guarantee that the rank-based correlation matrix estimator attains the fast rate of convergence. In both cases, we do not need any moment condition.
  • We propose a bootstrap-based robust high-confidence level upper bound (Robust H-CLUB) for assessing the risks of large portfolios. The proposed approach exploits rank-based and quantile-based estimators, and can be viewed as a robust extension of the H-CLUB method (Fan et al., 2015). Such an extension allows us to handle possibly misspecified models and heavy-tailed data. Under mixing conditions, we analyze the proposed approach and demonstrate its advantage over the H-CLUB. We further provide thorough numerical results to back up the developed theory. We also apply the proposed method to analyze a stock market dataset.
  • Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
  • The vector autoregressive (VAR) model is a powerful tool in modeling complex time series and has been exploited in many fields. However, fitting high dimensional VAR model poses some unique challenges: On one hand, the dimensionality, caused by modeling a large number of time series and higher order autoregressive processes, is usually much higher than the time series length; On the other hand, the temporal dependence structure in the VAR model gives rise to extra theoretical challenges. In high dimensions, one popular approach is to assume the transition matrix is sparse and fit the VAR model using the "least squares" method with a lasso-type penalty. In this manuscript, we propose an alternative way in estimating the VAR model. The main idea is, via exploiting the temporal dependence structure, to formulate the estimating problem into a linear program. There is instant advantage for the proposed approach over the lasso-type estimators: The estimation equation can be decomposed into multiple sub-equations and accordingly can be efficiently solved in a parallel fashion. In addition, our method brings new theoretical insights into the VAR model analysis. So far the theoretical results developed in high dimensions (e.g., Song and Bickel (2011) and Kock and Callot (2012)) mainly pose assumptions on the design matrix of the formulated regression problems. Such conditions are indirect about the transition matrices and not transparent. In contrast, our results show that the operator norm of the transition matrices plays an important role in estimation accuracy. We provide explicit rates of convergence for both estimation and prediction. In addition, we provide thorough experiments on both synthetic and real-world equity data to show that there are empirical advantages of our method over the lasso-type estimators in both parameter estimation and forecasting.
  • Statisticians and quantitative neuroscientists have actively promoted the use of independence relationships for investigating brain networks, genomic networks, and other measurement technologies. Estimation of these graphs depends on two steps. First is a feature extraction by summarizing measurements within a parcellation, regional or set definition to create nodes. Secondly, these summaries are then used to create a graph representing relationships of interest. In this manuscript we study the impact of dimension reduction on graphs that describe different notions of relations among a set of random variables. We are particularly interested in undirected graphs that capture the random variables' independence and conditional independence relations. A dimension reduction procedure can be any mapping from high dimensional spaces to low dimensional spaces. We exploit a general framework for modeling the raw data and advocate that in estimating the undirected graphs, any acceptable dimension reduction procedure should be a graph-homotopic mapping, i.e., the graphical structure of the data after dimension reduction should inherit the main characteristics of the graphical structure of the raw data. We show that, in terms of inferring undirected graphs that characterize the conditional independence relations among random variables, many dimension reduction procedures, such as the mean, median, or principal components, cannot be theoretically guaranteed to be a graph-homotopic mapping. The implications of this work are broad. In the most charitable setting for researchers, where the correct node definition is known, graphical relationships can be contaminated merely via the dimension reduction. The manuscript ends with a concrete example, characterizing a subset of graphical structures such that the dimension reduction procedure using the principal components can be a graph-homotopic mapping.
  • In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T,n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.
  • We propose a new high dimensional semiparametric principal component analysis (PCA) method, named Copula Component Analysis (COCA). The semiparametric model assumes that, after unspecified marginally monotone transformations, the distributions are multivariate Gaussian. COCA improves upon PCA and sparse PCA in three aspects: (i) It is robust to modeling assumptions; (ii) It is robust to outliers and data contamination; (iii) It is scale-invariant and yields more interpretable results. We prove that the COCA estimators obtain fast estimation rates and are feature selection consistent when the dimension is nearly exponentially large relative to the sample size. Careful experiments confirm that COCA outperforms sparse PCA on both synthetic and real-world datasets.
  • In this manuscript a unified framework for conducting inference on complex aggregated data in high dimensional settings is proposed. The data are assumed to be a collection of multiple non-Gaussian realizations with underlying undirected graphical structures. Utilizing the concept of median graphs in summarizing the commonality across these graphical structures, a novel semiparametric approach to modeling such complex aggregated data is provided along with robust estimation of the median graph, which is assumed to be sparse. The estimator is proved to be consistent in graph recovery and an upper bound on the rate of convergence is given. Experiments on both synthetic and real datasets are conducted to illustrate the empirical usefulness of the proposed models and methods.
  • We study sparse principal component analysis for high dimensional vector autoregressive time series under a doubly asymptotic framework, which allows the dimension $d$ to scale with the series length $T$. We treat the transition matrix of time series as a nuisance parameter and directly apply sparse principal component analysis on multivariate time series as if the data are independent. We provide explicit non-asymptotic rates of convergence for leading eigenvector estimation and extend this result to principal subspace estimation. Our analysis illustrates that the spectral norm of the transition matrix plays an essential role in determining the final rates. We also characterize sufficient conditions under which sparse principal component analysis attains the optimal parametric rate. Our theoretical results are backed up by thorough numerical studies.
  • We study the feasibility of a sterile neutrino search at the China Advanced Research Reactor by measuring $\bar {\nu}_e$ survival probability with a baseline of less than 15 m. Both hydrogen and deuteron have been considered as potential targets. The sensitivity to sterile-to-regular neutrino mixing is investigated under the "3(active)+1(sterile)" framework. We find that the mixing parameter $\sin^2(2\theta_{14})$ can be severely constrained by such measurement if the mass square difference $\Delta m_{14}^2$ is of the order of $\sim$1 eV$^2$.
  • In this paper, we propose a semiparametric approach, named nonparanormal skeptic, for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider Gaussian Copula graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This celebrating result suggests that the Gaussian copula graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare different estimators for their graph recovery performance under both ideal and noisy settings. The proposed methods are then applied on a large-scale genomic dataset to illustrate their empirical usefulness. The R language software package huge implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran. r-project.org/.