-
Although a generalized spike population model has been actively studied in
random matrix theory, its application to real data has been rarely explored. We
find that most methods for determining the number of spikes based on the
Johnstone's spike population model choose far too many spikes in RNA-seq gene
expression data or often fail to determine the number of spikes by indicating
that all components are spikes. In this paper, we propose a new algorithm for
the estimation of the number of spikes based on a generalized spike population
model. Also, we suggest a new noise model for RNA-seq data based on population
spectral distribution ideas, which provides a biologically reasonable number of
spikes using the proposed algorithm. Furthermore, we propose a graphical tool
for assessing the performance of the underlying noise model.
-
Integrative analysis of disparate data blocks measured on a common set of
experimental subjects is a major challenge in modern data analysis. This data
structure naturally motivates the simultaneous exploration of the joint and
individual variation within each data block resulting in new insights. For
instance, there is a strong desire to integrate the multiple genomic data sets
in The Cancer Genome Atlas to characterize the common and also the unique
aspects of cancer genetics and cell biology for each source. In this paper we
introduce Angle-Based Joint and Individual Variation Explained capturing both
joint and individual variation within each data block. This is a major
improvement over earlier approaches to this challenge in terms of a new
conceptual understanding, much better adaption to data heterogeneity and a fast
linear algebra computation. Important mathematical contributions are the use of
score subspaces as the principal descriptors of variation structure and the use
of perturbation theory as the guide for variation segmentation. This leads to
an exploratory data analysis method which is insensitive to the heterogeneity
among data blocks and does not require separate normalization. An application
to cancer data reveals different behaviors of each type of signal in
characterizing tumor subtypes. An application to a mortality data set reveals
interesting historical lessons. Software and data are available at GitHub
<https://github.com/MeileiJiang/AJIVE_Project>.
-
Recent interest in treespaces as well-founded mathematical domains for
phylogenetic inference and statistical analysis for populations of anatomical
trees has motivated research into efficient and rigorous methods for
optimization problems on treespaces. A central problem in this area is
computing an average of phylogenetic trees, which is equivalently characterized
as the minimizer of the Fr\'echet function. The Fr\'echet mean can be used for
statistical inference and exploratory data analysis: for example it can be
leveraged as a test statistic to compare groups via permutation tests, or to
find trends in data over time via kernel smoothing. By analyzing the
differential properties of the Fr\'echet function along geodesics in treespace
we obtained a theorem describing a decomposition of the derivative along a
geodesic. This decomposition theorem is used to formulate optimality conditions
which are used as a logical basis for an algorithm to verify relative
optimality at points where the Fr\'echet function gradient does not exist.
-
We illustrate the advantages of distance weighted discrimination for
classification and feature extraction in a High Dimension Low Sample Size
(HDLSS) situation. The HDLSS context is a gender classification problem of face
images in which the dimension of the data is several orders of magnitude larger
than the sample size. We compare distance weighted discrimination with Fisher's
linear discriminant, support vector machines, and principal component analysis
by exploring their classification interpretation through insightful
visuanimations and by examining the classifiers' discriminant errors. This
analysis enables us to make new contributions to the understanding of the
drivers of human discrimination between males and females.
-
Motivated by the challenge of using DNA-seq data to identify viruses in human
blood samples, we propose a novel classification algorithm called "Radial
Distance Weighted Discrimination" (or Radial DWD). This classifier is designed
for binary classification, assuming one class is surrounded by the other class
in very diverse radial directions, which is seen to be typical for our virus
detection data. This separation of the 2 classes in multiple radial directions
naturally motivates the development of Radial DWD. While classical machine
learning methods such as the Support Vector Machine and linear Distance
Weighted Discrimination can sometimes give reasonable answers for a given data
set, their generalizability is severely compromised because of the linear
separating boundary. Radial DWD addresses this challenge by using a more
appropriate (in this particular case) spherical separating boundary.
Simulations show that for appropriate radial contexts, this gives much better
generalizability than linear methods, and also much better than conventional
kernel based (nonlinear) Support Vector Machines, because the latter methods
essentially use much of the information in the data for determining the shape
of the separating boundary. The effectiveness of Radial DWD is demonstrated for
real virus detection.
-
The abundance of functional observations in scientific endeavors has led to a
significant development in tools for functional data analysis (FDA). This kind
of data comes with several challenges: infinite-dimensionality of function
spaces, observation noise, and so on. However, there is another interesting
phenomena that creates problems in FDA. The functional data often comes with
lateral displacements/deformations in curves, a phenomenon which is different
from the height or amplitude variability and is termed phase variation. The
presence of phase variability artificially often inflates data variance, blurs
underlying data structures, and distorts principal components. While the
separation and/or removal of phase from amplitude data is desirable, this is a
difficult problem. In particular, a commonly used alignment procedure, based on
minimizing the $\mathbb{L}^2$ norm between functions, does not provide
satisfactory results. In this paper we motivate the importance of dealing with
the phase variability and summarize several current ideas for separating phase
and amplitude components. These approaches differ in the following: (1) the
definition and mathematical representation of phase variability, (2) the
objective functions that are used in functional data alignment, and (3) the
algorithmic tools for solving estimation/optimization problems. We use simple
examples to illustrate various approaches and to provide useful contrast
between them.
-
Cluster analysis has proved to be an invaluable tool for the exploratory and
unsupervised analysis of high dimensional datasets. Among methods for
clustering, hierarchical approaches have enjoyed substantial popularity in
genomics and other fields for their ability to simultaneously uncover multiple
layers of clustering structure. A critical and challenging question in cluster
analysis is whether the identified clusters represent important underlying
structure or are artifacts of natural sampling variation. Few approaches have
been proposed for addressing this problem in the context of hierarchical
clustering, for which the problem is further complicated by the natural tree
structure of the partition, and the multiplicity of tests required to parse the
layers of nested clusters. In this paper, we propose a Monte Carlo based
approach for testing statistical significance in hierarchical clustering which
addresses these issues. The approach is implemented as a sequential testing
procedure guaranteeing control of the family-wise error rate. Theoretical
justification is provided for our approach, and its power to detect true
clustering structure is illustrated through several simulation studies and
applications to two cancer gene expression datasets.
-
Binary classification is a common statistical learning problem in which a
model is estimated on a set of covariates for some outcome indicating the
membership of one of two classes. In the literature, there exists a distinction
between hard and soft classification. In soft classification, the conditional
class probability is modeled as a function of the covariates. In contrast, hard
classification methods only target the optimal prediction boundary. While hard
and soft classification methods have been studied extensively, not much work
has been done to compare the actual tasks of hard and soft classification. In
this paper we propose a spectrum of statistical learning problems which span
the hard and soft classification tasks based on fitting multiple decision rules
to the data. By doing so, we reveal a novel collection of learning tasks of
increasing complexity. We study the problems using the framework of
large-margin classifiers and a class of piecewise linear convex surrogates, for
which we derive statistical properties and a corresponding sub-gradient descent
algorithm. We conclude by applying our approach to simulation settings and a
magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease
Neuroimaging Initiative (ADNI) study.
-
The goal of this article is to select important variables that can
distinguish one class of data from another. A marginal variable selection
method ranks the marginal effects for classification of individual variables,
and is a useful and efficient approach for variable selection. Our focus here
is to consider the bivariate effect, in addition to the marginal effect. In
particular, we are interested in those pairs of variables that can lead to
accurate classification predictions when they are viewed jointly. To accomplish
this, we propose a permutation test called Significance test of Joint Effect
(SigJEff). In the absence of joint effect in the data, SigJEff is similar or
equivalent to many marginal methods. However, when joint effects exist, our
method can significantly boost the performance of variable selection. Such
joint effects can help to provide additional, and sometimes dominating,
advantage for classification. We illustrate and validate our approach using
both simulated example and a real glioblastoma multiforme data set, which
provide promising results.
-
Principal Components Analysis (PCA) is a common way to study the sources of
variation in a high-dimensional data set. Typically, the leading principal
components are used to understand the variation in the data or to reduce the
dimension of the data for subsequent analysis. The remaining principal
components are ignored since they explain little of the variation in the data.
However, evolutionary biologists gain important insights from these low
variation directions. Specifically, they are interested in directions of low
genetic variability that are biologically interpretable. These directions are
called genetic constraints and indicate directions in which a trait cannot
evolve through selection. Here, we propose studying the subspace spanned by low
variance principal components by determining vectors in this subspace that are
simplest. Our method and accompanying graphical displays enhance the
biologist's ability to visualize the subspace and identify interpretable
directions of low genetic variability that align with simple directions.
-
Given a probability distribution on an open book (a metric space obtained by
gluing a disjoint union of copies of a half-space along their boundary
hyperplanes), we define a precise concept of when the Fr\'{e}chet mean
(barycenter) is sticky. This nonclassical phenomenon is quantified by a law of
large numbers (LLN) stating that the empirical mean eventually almost surely
lies on the (codimension $1$ and hence measure $0$) spine that is the glued
hyperplane, and a central limit theorem (CLT) stating that the limiting
distribution is Gaussian and supported on the spine. We also state versions of
the LLN and CLT for the cases where the mean is nonsticky (i.e., not lying on
the spine) and partly sticky (i.e., is, on the spine but not sticky).
-
A general asymptotic framework is developed for studying consis- tency
properties of principal component analysis (PCA). Our frame- work includes
several previously studied domains of asymptotics as special cases and allows
one to investigate interesting connections and transitions among the various
domains. More importantly, it enables us to investigate asymptotic scenarios
that have not been considered before, and gain new insights into the
consistency, subspace consistency and strong inconsistency regions of PCA and
the boundaries among them. We also establish the corresponding convergence rate
within each region. Under general spike covariance models, the dimension (or
the number of variables) discourages the consistency of PCA, while the sample
size and spike information (the relative size of the population eigenvalues)
encourages PCA consistency. Our framework nicely illustrates the relationship
among these three types of information in terms of dimension, sample size and
spike size, and rigorously characterizes how their relationships affect PCA
consistency.
-
Motivated by the analysis of nonnegative data objects, a novel Nested
Nonnegative Cone Analysis (NNCA) approach is proposed to overcome some
drawbacks of existing methods. The application of traditional PCA/SVD method to
nonnegative data often cause the approximation matrix leave the nonnegative
cone, which leads to non-interpretable and sometimes nonsensical results. The
nonnegative matrix factorization (NMF) approach overcomes this issue, however
the NMF approximation matrices suffer several drawbacks: 1) the factorization
may not be unique, 2) the resulting approximation matrix at a specific rank may
not be unique, and 3) the subspaces spanned by the approximation matrices at
different ranks may not be nested. These drawbacks will cause troubles in
determining the number of components and in multi-scale (in ranks)
interpretability. The NNCA approach proposed in this paper naturally generates
a nested structure, and is shown to be unique at each rank. Simulations are
used in this paper to illustrate the drawbacks of the traditional methods, and
the usefulness of the NNCA method.
-
Clustering methods have led to a number of important discoveries in
bioinformatics and beyond. A major challenge in their use is determining which
clusters represent important underlying structure, as opposed to spurious
sampling artifacts. This challenge is especially serious, and very few methods
are available when the data are very high in dimension. Statistical
Significance of Clustering (SigClust) is a recently developed cluster
evaluation tool for high dimensional low sample size data. An important
component of the SigClust approach is the very definition of a single cluster
as a subset of data sampled from a multivariate Gaussian distribution. The
implementation of SigClust requires the estimation of the eigenvalues of the
covariance matrix for the null multivariate Gaussian distribution. We show that
the original eigenvalue estimation can lead to a test that suffers from severe
inflation of type-I error, in the important case where there are huge single
spikes in the eigenvalues. This paper addresses this critical challenge using a
novel likelihood based soft thresholding approach to estimate these eigenvalues
which leads to a much improved SigClust. These major improvements in SigClust
performance are shown by both theoretical work and an extensive simulation
study. Applications to some cancer genomic data further demonstrate the
usefulness of these improvements.
-
Research in several fields now requires the analysis of data sets in which
multiple high-dimensional types of data are available for a common set of
objects. In particular, The Cancer Genome Atlas (TCGA) includes data from
several diverse genomic technologies on the same cancerous tumor samples. In
this paper we introduce Joint and Individual Variation Explained (JIVE), a
general decomposition of variation for the integrated analysis of such data
sets. The decomposition consists of three terms: a low-rank approximation
capturing joint variation across data types, low-rank approximations for
structured variation individual to each data type, and residual noise. JIVE
quantifies the amount of joint variation between data types, reduces the
dimensionality of the data and provides new directions for the visual
exploration of joint and individual structures. The proposed method represents
an extension of Principal Component Analysis and has clear advantages over
popular two-block methods such as Canonical Correlation Analysis and Partial
Least Squares. A JIVE analysis of gene expression and miRNA data on
Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and
provides better characterization of tumor types. Data and software are
available at https://genome.unc.edu/jive/
-
There are often two important types of variation in functional data: the
horizontal (or phase) variation and the vertical (or amplitude) variation.
These two types of variation have been appropriately separated and modeled
through a domain warping method (or curve registration) based on the Fisher Rao
metric. This paper focuses on the analysis of the horizontal variation,
captured by the domain warping functions. The square-root velocity function
representation transforms the manifold of the warping functions to a Hilbert
sphere. Motivated by recent results on manifold analogs of principal component
analysis, we propose to analyze the horizontal variation via a Principal Nested
Spheres approach. Compared with earlier approaches, such as approximating
tangent plane principal component analysis, this is seen to be the most
efficient and interpretable approach to decompose the horizontal variation in
some examples.
-
Diffusion tensor imaging provides important information on tissue structure
and orientation of fiber tracts in brain white matter in vivo. It results in
diffusion tensors, which are $3\times3$ symmetric positive definite (SPD)
matrices, along fiber bundles. This paper develops a functional data analysis
framework to model diffusion tensors along fiber tracts as functional data in a
Riemannian manifold with a set of covariates of interest, such as age and
gender. We propose a statistical model with varying coefficient functions to
characterize the dynamic association between functional SPD matrix-valued
responses and covariates. We calculate weighted least squares estimators of the
varying coefficient functions for the log-Euclidean metric in the space of SPD
matrices. We also develop a global test statistic to test specific hypotheses
about these coefficient functions and construct their simultaneous confidence
bands. Simulated data are further used to examine the finite sample performance
of the estimated varying coefficient functions. We apply our model to study
potential gender differences and find a statistically significant aspect of the
development of diffusion tensors along the right internal capsule tract in a
clinical study of neurodevelopment.
-
The aim of this paper is to establish several deep theoretical properties of
principal component analysis for multiple-component spike covariance models.
Our new results reveal a surprising asymptotic conical structure in critical
sample eigendirections under the spike models with distinguishable (or
indistinguishable) eigenvalues, when the sample size and/or the number of
variables (or dimension) tend to infinity. The consistency of the sample
eigenvectors relative to their population counterparts is determined by the
ratio between the dimension and the product of the sample size with the spike
size. When this ratio converges to a nonzero constant, the sample eigenvector
converges to a cone, with a certain angle to its corresponding population
eigenvector.In the High Dimension, Low Sample Size case, the angle between the
sample eigenvector and its population counterpart converges to a limiting
distribution.Several generalizations of the multi-spike covariance models are
also explored, and additional theoretical results are presented.
-
Object oriented data analysis (OODA) aims at statistically analyzing
populations of complicated objects. This paper is motivated by a study of cell
images in cell culture biology, which highlights a common critical issue:
choice of data objects. Instead of conventionally treating either the
individual cells or the wells (a container in which the cells are grown) as
data objects, a new type of data object is proposed, that is the union of a
well with its corresponding set of cells. This paper contains two parts. The
first part is the image data analysis, which suggests empirically that the
cell-well unions can be a better choice of data objects than the cells or the
wells alone. The second part discusses the benefit of choosing cell-well unions
as data objects using an illustrative example and simulations. This research
suggests that OODA is not simply a frame work for understanding the structure
of the data analysis. It leads to useful interdisciplinary discussion that
gives better results through more appropriate choice of data objects,
especially for complex data analyses.
-
Principal component analysis is a useful dimension reduction and data
visualization method. However, in high dimension, low sample size asymptotic
contexts, where the sample size is fixed and the dimension goes to infinity,a
paradox has arisen. In particular, despite the useful real data insights
commonly obtained from principal component score visualization, these scores
are not consistent even when the sample eigen-vectors are consistent. This
paradox is resolved by asymptotic study of the ratio between the sample and
population principal component scores. In particular, it is seen that this
proportion converges to a non-degenerate random variable. The realization is
the same for each data point, i.e. there is a common random rescaling, which
appears for each eigen-direction. This then gives inconsistent axis labels for
the standard scores plot, yet the relative positions of the points (typically
the main visual content) are consistent. This paradox disappears when the
sample size goes to infinity.
-
Drug discovery is the process of identifying compounds which have potentially
meaningful biological activity. A major challenge that arises is that the
number of compounds to search over can be quite large, sometimes numbering in
the millions, making experimental testing intractable. For this reason
computational methods are employed to filter out those compounds which do not
exhibit strong biological activity. This filtering step, also called virtual
screening reduces the search space, allowing for the remaining compounds to be
experimentally tested. In this paper we propose several novel approaches to the
problem of virtual screening based on Canonical Correlation Analysis (CCA) and
on a kernel-based extension. Spectral learning ideas motivate our proposed new
method called Indefinite Kernel CCA (IKCCA). We show the strong performance of
this approach both for a toy problem as well as using real world data with
dramatic improvements in predictive accuracy of virtual screening over an
existing methodology.
-
We introduce a novel geometric framework for separating the phase and the
amplitude variability in functional data of the type frequently studied in
growth curve analysis. This framework uses the Fisher-Rao Riemannian metric to
derive a proper distance on the quotient space of functions modulo the
time-warping group. A convenient square-root velocity function (SRVF)
representation transforms the Fisher-Rao metric into the standard $\ltwo$
metric, simplifying the computations. This distance is then used to define a
Karcher mean template and warp the individual functions to align them with the
Karcher mean template. The strength of this framework is demonstrated by
deriving a consistent estimator of a signal observed under random warping,
scaling, and vertical translation. These ideas are demonstrated using both
simulated and real data from different application domains: the Berkeley growth
study, handwritten signature curves, neuroscience spike trains, and gene
expression signals. The proposed method is empirically shown to be be superior
in performance to several recently published methods for functional alignment.
-
We propose a new approach to analyze data that naturally lie on manifolds. We
focus on a special class of manifolds, called direct product manifolds, whose
intrinsic dimension could be very high. Our method finds a low-dimensional
representation of the manifold that can be used to find and visualize the
principal modes of variation of the data, as Principal Component Analysis (PCA)
does in linear spaces. The proposed method improves upon earlier manifold
extensions of PCA by more concisely capturing important nonlinear modes. For
the special case of data on a sphere, variation following nongeodesic arcs is
captured in a single mode, compared to the two modes needed by previous
methods. Several computational and statistical challenges are resolved. The
development on spheres forms the basis of principal arc analysis on more
complicated manifolds. The benefits of the method are illustrated by a data
example using medial representations in image analysis.
-
In this paper we examine rigorously the evidence for dependence among data
size, transfer rate and duration in Internet flows. We emphasize two
statistical approaches for studying dependence, including Pearson's correlation
coefficient and the extremal dependence analysis method. We apply these methods
to large data sets of packet traces from three networks. Our major results show
that Pearson's correlation coefficients between size and duration are much
smaller than one might expect. We also find that correlation coefficients
between size and rate are generally small and can be strongly affected by
applying thresholds to size or duration. Based on Transmission Control Protocol
connection startup mechanisms, we argue that thresholds on size should be more
useful than thresholds on duration in the analysis of correlations. Using
extremal dependence analysis, we draw a similar conclusion, finding remarkable
independence for extremal values of size and rate.
-
Principal Component Analysis (PCA) is an important tool of dimension
reduction especially when the dimension (or the number of variables) is very
high. Asymptotic studies where the sample size is fixed, and the dimension
grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly
relevant. We investigate the asymptotic behavior of the Principal Component
(PC) directions. HDLSS asymptotics are used to study consistency, strong
inconsistency and subspace consistency. We show that if the first few
eigenvalues of a population covariance matrix are large enough compared to the
others, then the corresponding estimated PC directions are consistent or
converge to the appropriate subspace (subspace consistency) and most other PC
directions are strongly inconsistent. Broad sets of sufficient conditions for
each of these cases are specified and the main theorem gives a catalogue of
possible combinations. In preparation for these results, we show that the
geometric representation of HDLSS data holds under general conditions, which
includes a $\rho$-mixing condition and a broad range of sphericity measures of
the covariance matrix.