• Although a generalized spike population model has been actively studied in random matrix theory, its application to real data has been rarely explored. We find that most methods for determining the number of spikes based on the Johnstone's spike population model choose far too many spikes in RNA-seq gene expression data or often fail to determine the number of spikes by indicating that all components are spikes. In this paper, we propose a new algorithm for the estimation of the number of spikes based on a generalized spike population model. Also, we suggest a new noise model for RNA-seq data based on population spectral distribution ideas, which provides a biologically reasonable number of spikes using the proposed algorithm. Furthermore, we propose a graphical tool for assessing the performance of the underlying noise model.
  • Integrative analysis of disparate data blocks measured on a common set of experimental subjects is a major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. In this paper we introduce Angle-Based Joint and Individual Variation Explained capturing both joint and individual variation within each data block. This is a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Important mathematical contributions are the use of score subspaces as the principal descriptors of variation structure and the use of perturbation theory as the guide for variation segmentation. This leads to an exploratory data analysis method which is insensitive to the heterogeneity among data blocks and does not require separate normalization. An application to cancer data reveals different behaviors of each type of signal in characterizing tumor subtypes. An application to a mortality data set reveals interesting historical lessons. Software and data are available at GitHub <https://github.com/MeileiJiang/AJIVE_Project>.
  • Recent interest in treespaces as well-founded mathematical domains for phylogenetic inference and statistical analysis for populations of anatomical trees has motivated research into efficient and rigorous methods for optimization problems on treespaces. A central problem in this area is computing an average of phylogenetic trees, which is equivalently characterized as the minimizer of the Fr\'echet function. The Fr\'echet mean can be used for statistical inference and exploratory data analysis: for example it can be leveraged as a test statistic to compare groups via permutation tests, or to find trends in data over time via kernel smoothing. By analyzing the differential properties of the Fr\'echet function along geodesics in treespace we obtained a theorem describing a decomposition of the derivative along a geodesic. This decomposition theorem is used to formulate optimality conditions which are used as a logical basis for an algorithm to verify relative optimality at points where the Fr\'echet function gradient does not exist.
  • We illustrate the advantages of distance weighted discrimination for classification and feature extraction in a High Dimension Low Sample Size (HDLSS) situation. The HDLSS context is a gender classification problem of face images in which the dimension of the data is several orders of magnitude larger than the sample size. We compare distance weighted discrimination with Fisher's linear discriminant, support vector machines, and principal component analysis by exploring their classification interpretation through insightful visuanimations and by examining the classifiers' discriminant errors. This analysis enables us to make new contributions to the understanding of the drivers of human discrimination between males and females.
  • Motivated by the challenge of using DNA-seq data to identify viruses in human blood samples, we propose a novel classification algorithm called "Radial Distance Weighted Discrimination" (or Radial DWD). This classifier is designed for binary classification, assuming one class is surrounded by the other class in very diverse radial directions, which is seen to be typical for our virus detection data. This separation of the 2 classes in multiple radial directions naturally motivates the development of Radial DWD. While classical machine learning methods such as the Support Vector Machine and linear Distance Weighted Discrimination can sometimes give reasonable answers for a given data set, their generalizability is severely compromised because of the linear separating boundary. Radial DWD addresses this challenge by using a more appropriate (in this particular case) spherical separating boundary. Simulations show that for appropriate radial contexts, this gives much better generalizability than linear methods, and also much better than conventional kernel based (nonlinear) Support Vector Machines, because the latter methods essentially use much of the information in the data for determining the shape of the separating boundary. The effectiveness of Radial DWD is demonstrated for real virus detection.
  • The abundance of functional observations in scientific endeavors has led to a significant development in tools for functional data analysis (FDA). This kind of data comes with several challenges: infinite-dimensionality of function spaces, observation noise, and so on. However, there is another interesting phenomena that creates problems in FDA. The functional data often comes with lateral displacements/deformations in curves, a phenomenon which is different from the height or amplitude variability and is termed phase variation. The presence of phase variability artificially often inflates data variance, blurs underlying data structures, and distorts principal components. While the separation and/or removal of phase from amplitude data is desirable, this is a difficult problem. In particular, a commonly used alignment procedure, based on minimizing the $\mathbb{L}^2$ norm between functions, does not provide satisfactory results. In this paper we motivate the importance of dealing with the phase variability and summarize several current ideas for separating phase and amplitude components. These approaches differ in the following: (1) the definition and mathematical representation of phase variability, (2) the objective functions that are used in functional data alignment, and (3) the algorithmic tools for solving estimation/optimization problems. We use simple examples to illustrate various approaches and to provide useful contrast between them.
  • Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets.
  • Binary classification is a common statistical learning problem in which a model is estimated on a set of covariates for some outcome indicating the membership of one of two classes. In the literature, there exists a distinction between hard and soft classification. In soft classification, the conditional class probability is modeled as a function of the covariates. In contrast, hard classification methods only target the optimal prediction boundary. While hard and soft classification methods have been studied extensively, not much work has been done to compare the actual tasks of hard and soft classification. In this paper we propose a spectrum of statistical learning problems which span the hard and soft classification tasks based on fitting multiple decision rules to the data. By doing so, we reveal a novel collection of learning tasks of increasing complexity. We study the problems using the framework of large-margin classifiers and a class of piecewise linear convex surrogates, for which we derive statistical properties and a corresponding sub-gradient descent algorithm. We conclude by applying our approach to simulation settings and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.
  • The goal of this article is to select important variables that can distinguish one class of data from another. A marginal variable selection method ranks the marginal effects for classification of individual variables, and is a useful and efficient approach for variable selection. Our focus here is to consider the bivariate effect, in addition to the marginal effect. In particular, we are interested in those pairs of variables that can lead to accurate classification predictions when they are viewed jointly. To accomplish this, we propose a permutation test called Significance test of Joint Effect (SigJEff). In the absence of joint effect in the data, SigJEff is similar or equivalent to many marginal methods. However, when joint effects exist, our method can significantly boost the performance of variable selection. Such joint effects can help to provide additional, and sometimes dominating, advantage for classification. We illustrate and validate our approach using both simulated example and a real glioblastoma multiforme data set, which provide promising results.
  • Principal Components Analysis (PCA) is a common way to study the sources of variation in a high-dimensional data set. Typically, the leading principal components are used to understand the variation in the data or to reduce the dimension of the data for subsequent analysis. The remaining principal components are ignored since they explain little of the variation in the data. However, evolutionary biologists gain important insights from these low variation directions. Specifically, they are interested in directions of low genetic variability that are biologically interpretable. These directions are called genetic constraints and indicate directions in which a trait cannot evolve through selection. Here, we propose studying the subspace spanned by low variance principal components by determining vectors in this subspace that are simplest. Our method and accompanying graphical displays enhance the biologist's ability to visualize the subspace and identify interpretable directions of low genetic variability that align with simple directions.
  • Given a probability distribution on an open book (a metric space obtained by gluing a disjoint union of copies of a half-space along their boundary hyperplanes), we define a precise concept of when the Fr\'{e}chet mean (barycenter) is sticky. This nonclassical phenomenon is quantified by a law of large numbers (LLN) stating that the empirical mean eventually almost surely lies on the (codimension $1$ and hence measure $0$) spine that is the glued hyperplane, and a central limit theorem (CLT) stating that the limiting distribution is Gaussian and supported on the spine. We also state versions of the LLN and CLT for the cases where the mean is nonsticky (i.e., not lying on the spine) and partly sticky (i.e., is, on the spine but not sticky).
  • A general asymptotic framework is developed for studying consis- tency properties of principal component analysis (PCA). Our frame- work includes several previously studied domains of asymptotics as special cases and allows one to investigate interesting connections and transitions among the various domains. More importantly, it enables us to investigate asymptotic scenarios that have not been considered before, and gain new insights into the consistency, subspace consistency and strong inconsistency regions of PCA and the boundaries among them. We also establish the corresponding convergence rate within each region. Under general spike covariance models, the dimension (or the number of variables) discourages the consistency of PCA, while the sample size and spike information (the relative size of the population eigenvalues) encourages PCA consistency. Our framework nicely illustrates the relationship among these three types of information in terms of dimension, sample size and spike size, and rigorously characterizes how their relationships affect PCA consistency.
  • Motivated by the analysis of nonnegative data objects, a novel Nested Nonnegative Cone Analysis (NNCA) approach is proposed to overcome some drawbacks of existing methods. The application of traditional PCA/SVD method to nonnegative data often cause the approximation matrix leave the nonnegative cone, which leads to non-interpretable and sometimes nonsensical results. The nonnegative matrix factorization (NMF) approach overcomes this issue, however the NMF approximation matrices suffer several drawbacks: 1) the factorization may not be unique, 2) the resulting approximation matrix at a specific rank may not be unique, and 3) the subspaces spanned by the approximation matrices at different ranks may not be nested. These drawbacks will cause troubles in determining the number of components and in multi-scale (in ranks) interpretability. The NNCA approach proposed in this paper naturally generates a nested structure, and is shown to be unique at each rank. Simulations are used in this paper to illustrate the drawbacks of the traditional methods, and the usefulness of the NNCA method.
  • Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are huge single spikes in the eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues which leads to a much improved SigClust. These major improvements in SigClust performance are shown by both theoretical work and an extensive simulation study. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.
  • Research in several fields now requires the analysis of data sets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such data sets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data and provides new directions for the visual exploration of joint and individual structures. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types. Data and software are available at https://genome.unc.edu/jive/
  • There are often two important types of variation in functional data: the horizontal (or phase) variation and the vertical (or amplitude) variation. These two types of variation have been appropriately separated and modeled through a domain warping method (or curve registration) based on the Fisher Rao metric. This paper focuses on the analysis of the horizontal variation, captured by the domain warping functions. The square-root velocity function representation transforms the manifold of the warping functions to a Hilbert sphere. Motivated by recent results on manifold analogs of principal component analysis, we propose to analyze the horizontal variation via a Principal Nested Spheres approach. Compared with earlier approaches, such as approximating tangent plane principal component analysis, this is seen to be the most efficient and interpretable approach to decompose the horizontal variation in some examples.
  • Diffusion tensor imaging provides important information on tissue structure and orientation of fiber tracts in brain white matter in vivo. It results in diffusion tensors, which are $3\times3$ symmetric positive definite (SPD) matrices, along fiber bundles. This paper develops a functional data analysis framework to model diffusion tensors along fiber tracts as functional data in a Riemannian manifold with a set of covariates of interest, such as age and gender. We propose a statistical model with varying coefficient functions to characterize the dynamic association between functional SPD matrix-valued responses and covariates. We calculate weighted least squares estimators of the varying coefficient functions for the log-Euclidean metric in the space of SPD matrices. We also develop a global test statistic to test specific hypotheses about these coefficient functions and construct their simultaneous confidence bands. Simulated data are further used to examine the finite sample performance of the estimated varying coefficient functions. We apply our model to study potential gender differences and find a statistically significant aspect of the development of diffusion tensors along the right internal capsule tract in a clinical study of neurodevelopment.
  • The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal a surprising asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector.In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution.Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.
  • Object oriented data analysis (OODA) aims at statistically analyzing populations of complicated objects. This paper is motivated by a study of cell images in cell culture biology, which highlights a common critical issue: choice of data objects. Instead of conventionally treating either the individual cells or the wells (a container in which the cells are grown) as data objects, a new type of data object is proposed, that is the union of a well with its corresponding set of cells. This paper contains two parts. The first part is the image data analysis, which suggests empirically that the cell-well unions can be a better choice of data objects than the cells or the wells alone. The second part discusses the benefit of choosing cell-well unions as data objects using an illustrative example and simulations. This research suggests that OODA is not simply a frame work for understanding the structure of the data analysis. It leads to useful interdisciplinary discussion that gives better results through more appropriate choice of data objects, especially for complex data analyses.
  • Principal component analysis is a useful dimension reduction and data visualization method. However, in high dimension, low sample size asymptotic contexts, where the sample size is fixed and the dimension goes to infinity,a paradox has arisen. In particular, despite the useful real data insights commonly obtained from principal component score visualization, these scores are not consistent even when the sample eigen-vectors are consistent. This paradox is resolved by asymptotic study of the ratio between the sample and population principal component scores. In particular, it is seen that this proportion converges to a non-degenerate random variable. The realization is the same for each data point, i.e. there is a common random rescaling, which appears for each eigen-direction. This then gives inconsistent axis labels for the standard scores plot, yet the relative positions of the points (typically the main visual content) are consistent. This paradox disappears when the sample size goes to infinity.
  • Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A major challenge that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested. In this paper we propose several novel approaches to the problem of virtual screening based on Canonical Correlation Analysis (CCA) and on a kernel-based extension. Spectral learning ideas motivate our proposed new method called Indefinite Kernel CCA (IKCCA). We show the strong performance of this approach both for a toy problem as well as using real world data with dramatic improvements in predictive accuracy of virtual screening over an existing methodology.
  • We introduce a novel geometric framework for separating the phase and the amplitude variability in functional data of the type frequently studied in growth curve analysis. This framework uses the Fisher-Rao Riemannian metric to derive a proper distance on the quotient space of functions modulo the time-warping group. A convenient square-root velocity function (SRVF) representation transforms the Fisher-Rao metric into the standard $\ltwo$ metric, simplifying the computations. This distance is then used to define a Karcher mean template and warp the individual functions to align them with the Karcher mean template. The strength of this framework is demonstrated by deriving a consistent estimator of a signal observed under random warping, scaling, and vertical translation. These ideas are demonstrated using both simulated and real data from different application domains: the Berkeley growth study, handwritten signature curves, neuroscience spike trains, and gene expression signals. The proposed method is empirically shown to be be superior in performance to several recently published methods for functional alignment.
  • We propose a new approach to analyze data that naturally lie on manifolds. We focus on a special class of manifolds, called direct product manifolds, whose intrinsic dimension could be very high. Our method finds a low-dimensional representation of the manifold that can be used to find and visualize the principal modes of variation of the data, as Principal Component Analysis (PCA) does in linear spaces. The proposed method improves upon earlier manifold extensions of PCA by more concisely capturing important nonlinear modes. For the special case of data on a sphere, variation following nongeodesic arcs is captured in a single mode, compared to the two modes needed by previous methods. Several computational and statistical challenges are resolved. The development on spheres forms the basis of principal arc analysis on more complicated manifolds. The benefits of the method are illustrated by a data example using medial representations in image analysis.
  • In this paper we examine rigorously the evidence for dependence among data size, transfer rate and duration in Internet flows. We emphasize two statistical approaches for studying dependence, including Pearson's correlation coefficient and the extremal dependence analysis method. We apply these methods to large data sets of packet traces from three networks. Our major results show that Pearson's correlation coefficients between size and duration are much smaller than one might expect. We also find that correlation coefficients between size and rate are generally small and can be strongly affected by applying thresholds to size or duration. Based on Transmission Control Protocol connection startup mechanisms, we argue that thresholds on size should be more useful than thresholds on duration in the analysis of correlations. Using extremal dependence analysis, we draw a similar conclusion, finding remarkable independence for extremal values of size and rate.
  • Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a $\rho$-mixing condition and a broad range of sphericity measures of the covariance matrix.