• With the rapid growth of modern technology, many large-scale biomedical studies have been/are being/will be conducted to collect massive datasets with large volumes of multi-modality imaging, genetic, neurocognitive, and clinical information from increasingly large cohorts. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function, and brain-related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention, and treatment of numerous complex brain-related disorders (e.g., schizophrenia and Alzheimer). However, the development of analytical methods for the joint analysis of both high-dimensional imaging phenotypes and high-dimensional genetic data, called big data squared (BD$^2$), presents major computational and theoretical challenges for existing analytical methods. Besides the high-dimensional nature of BD$^2$, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for the joint analysis of BD$^2$, including massive univariate and voxel-wise approaches, reduced rank regression, mixture models, and group sparse multi-task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research.
  • This paper investigates the central limit theorem for linear spectral statistics of high dimensional sample covariance matrices of the form $\mathbf{B}_n=n^{-1}\sum_{j=1}^{n}\mathbf{Q}\mathbf{x}_j\mathbf{x}_j^{*}\mathbf{Q}^{*}$ where $\mathbf{Q}$ is a nonrandom matrix of dimension $p\times k$, and $\{\mathbf{x}_j\}$ is a sequence of independent $k$-dimensional random vector with independent entries, under the assumption that $p/n\to y>0$. A key novelty here is that the dimension $k\ge p$ can be arbitrary, possibly infinity. This new model of sample covariance matrices $\mathbf{B}_n$ covers most of the known models as its special cases. For example, standard sample covariance matrices are obtained with $k=p$ and $\mathbf{Q}=\mathbf{T}_n^{1/2}$ for some positive definite Hermitian matrix $\mathbf{T}_n$. Also with $k=\infty$ our model covers the case of repeated linear processes considered in recent high-dimensional time series literature. The CLT found in this paper substantially generalizes the seminal CLT in Bai and Silverstein (2004). Applications of this new CLT are proposed for testing the structure of a high-dimensional covariance matrix. The derived tests are then used to analyse a large fMRI data set regarding its temporary correlation structure.
  • Ambulatory cardiovascular (CV) measurements provide valuable insights into individuals' health conditions in "real-life," everyday settings. Current methods of modeling ambulatory CV data do not consider the dynamic characteristics of the full data set and their relationships with covariates such as caffeine use and stress. We propose a stochastic differential equation (SDE) in the form of a dual nonlinear Ornstein--Uhlenbeck (OU) model with person-specific covariates to capture the morning surge and nighttime dipping dynamics of ambulatory CV data. To circumvent the data analytic constraint that empirical measurements are typically collected at irregular and much larger time intervals than those evaluated in simulation studies of SDEs, we adopt a Bayesian approach with a regularized Brownian Bridge sampler (RBBS) and an efficient multiresolution (MR) algorithm to fit the proposed SDE. The MR algorithm can produce more efficient MCMC samples that is crucial for valid parameter estimation and inference. Using this model and algorithm to data from the Duke Behavioral Investigation of Hypertension Study, results indicate that age, caffeine intake, gender and race have effects on distinct dynamic characteristics of the participants' CV trajectories.
  • The aim of this paper is to develop a class of spatial transformation models (STM) to spatially model the varying association between imaging measures in a three-dimensional (3D) volume (or 2D surface) and a set of covariates. Our STMs include a varying Box-Cox transformation model for dealing with the issue of non-Gaussian distributed imaging data and a Gaussian Markov Random Field model for incorporating spatial smoothness of the imaging data. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. Simulations and real data analysis demonstrate that the STM significantly outperforms the voxel-wise linear model with Gaussian noise in recovering meaningful geometric patterns. Our STM is able to reveal important brain regions with morphological changes in children with attention deficit hyperactivity disorder.
  • Identification of regions of interest (ROI) associated with certain disease has a great impact on public health. Imposing sparsity of pixel values and extracting active regions simultaneously greatly complicate the image analysis. We address these challenges by introducing a novel region-selection penalty in the framework of image-on-scalar regression. Our penalty combines the Smoothly Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the SCAD of total variation (TV) regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zero-valued background. Efficient algorithm is based on the alternative direction method of multipliers (ADMM) which decomposes the non-convex problem into two iterative optimization problems with explicit solutions. Another virtue of the proposed method is that a divide and conquer learning algorithm is developed, thereby allowing scaling to large images. Several examples are presented and the experimental results are compared with other state-of-the-art approaches.
  • We propose an extrinsic regression framework for modeling data with manifold valued responses and Euclidean predictors. Regression with manifold responses has wide applications in shape analysis, neuroscience, medical imaging and many other areas. Our approach embeds the manifold where the responses lie onto a higher dimensional Euclidean space, obtains a local regression estimate in that space, and then projects this estimate back onto the image of the manifold. Outside the regression setting both intrinsic and extrinsic approaches have been proposed for modeling i.i.d manifold-valued data. However, to our knowledge our work is the first to take an extrinsic approach to the regression problem. The proposed extrinsic regression framework is general, computationally efficient and theoretically appealing. Asymptotic distributions and convergence rates of the extrinsic regression estimates are derived and a large class of examples are considered indicating the wide applicability of our approach.
  • Many neuroimaging studies have collected ultra-high dimensional imaging data in order to identify imaging biomarkers that are related to normal biological processes, diseases, and the response to treatment, among many others. These imaging data are often represented in the form of a multi-dimensional array, called a tensor. Existing statistical methods are insufficient for analysis of these tensor data due to their ultra-high dimensionality as well as their complex structure. The aim of this paper is to develop a tensor partition regression modeling (TPRM) framework to establish an association between low-dimensional clinical outcomes and ultra-high dimensional tensor covariates. Our TPRM is a hierarchical model and efficiently integrates four components: (i) a partition model; (ii) a canonical polyadic decomposition model; (iii) a factor model; and (iv) a generalized linear model. Under this framework, ultra-high dimensionality is not only reduced to a manageable level, resulting inefficient estimation, but also prediction accuracy is optimized to search for informative sub-tensors. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. Simulation shows that TPRM outperforms several other competing methods. We apply TPRM to predict disease status (Alzheimer versus control) by using structural magnetic resonance imaging data obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI) study.
  • In the event-related functional magnetic resonance imaging (fMRI) data analysis, there is an extensive interest in accurately and robustly estimating the hemodynamic response function (HRF) and its associated statistics (e.g., the magnitude and duration of the activation). Most methods to date are developed in the time domain and they have utilized almost exclusively the temporal information of fMRI data without accounting for the spatial information. The aim of this paper is to develop a multiscale adaptive smoothing model (MASM) in the frequency domain by integrating the spatial and frequency information to adaptively and accurately estimate HRFs pertaining to each stimulus sequence across all voxels in a three-dimensional (3D) volume. We use two sets of simulation studies and a real data set to examine the finite sample performance of MASM in estimating HRFs. Our real and simulated data analyses confirm that MASM outperforms several other state-of-the-art methods, such as the smooth finite impulse response (sFIR) model.
  • Motivated by recent work on studying massive imaging data in various neuroimaging studies, we propose a novel spatially varying coefficient model (SVCM) to spatially model the varying association between imaging measures in a three-dimensional (3D) volume (or 2D surface) with a set of covariates. Two key features of most neuorimaging data are the presence of multiple piecewise smooth regions with unknown edges and jumps and substantial spatial correlations. To specifically account for these two features, SVCM includes a measurement model with multiple varying coefficient functions, a jumping surface model for each varying coefficient function, and a functional principal component model. We develop a three-stage estimation procedure to simultaneously estimate the varying coefficient functions and the spatial correlations. The estimation procedure includes a fast multiscale adaptive estimation and testing procedure to independently estimate each varying coefficient function, while preserving its edges among different piecewise-smooth regions. We systematically investigate the asymptotic properties (e.g., consistency and asymptotic normality) of the multiscale adaptive parameter estimates. We also establish the uniform convergence rate of the estimated spatial covariance function and its associated eigenvalue and eigenfunctions. Our Monte Carlo simulation and real data analysis have confirmed the excellent performance of SVCM.
  • Diffusion tensor imaging provides important information on tissue structure and orientation of fiber tracts in brain white matter in vivo. It results in diffusion tensors, which are $3\times3$ symmetric positive definite (SPD) matrices, along fiber bundles. This paper develops a functional data analysis framework to model diffusion tensors along fiber tracts as functional data in a Riemannian manifold with a set of covariates of interest, such as age and gender. We propose a statistical model with varying coefficient functions to characterize the dynamic association between functional SPD matrix-valued responses and covariates. We calculate weighted least squares estimators of the varying coefficient functions for the log-Euclidean metric in the space of SPD matrices. We also develop a global test statistic to test specific hypotheses about these coefficient functions and construct their simultaneous confidence bands. Simulated data are further used to examine the finite sample performance of the estimated varying coefficient functions. We apply our model to study potential gender differences and find a statistically significant aspect of the development of diffusion tensors along the right internal capsule tract in a clinical study of neurodevelopment.
  • The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal a surprising asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector.In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution.Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.
  • Motivated by recent work studying massive imaging data in the neuroimaging literature, we propose multivariate varying coefficient models (MVCM) for modeling the relation between multiple functional responses and a set of covariates. We develop several statistical inference procedures for MVCM and systematically study their theoretical properties. We first establish the weak convergence of the local linear estimate of coefficient functions, as well as its asymptotic bias and variance, and then we derive asymptotic bias and mean integrated squared error of smoothed individual functions and their uniform convergence rate. We establish the uniform convergence rate of the estimated covariance function of the individual functions and its associated eigenvalue and eigenfunctions. We propose a global test for linear hypotheses of varying coefficient functions, and derive its asymptotic distribution under the null hypothesis. We also propose a simultaneous confidence band for each individual effect curve. We conduct Monte Carlo simulation to examine the finite-sample performance of the proposed procedures. We apply MVCM to investigate the development of white matter diffusivities along the genu tract of the corpus callosum in a clinical study of neurodevelopment.
  • Principal component analysis is a useful dimension reduction and data visualization method. However, in high dimension, low sample size asymptotic contexts, where the sample size is fixed and the dimension goes to infinity,a paradox has arisen. In particular, despite the useful real data insights commonly obtained from principal component score visualization, these scores are not consistent even when the sample eigen-vectors are consistent. This paradox is resolved by asymptotic study of the ratio between the sample and population principal component scores. In particular, it is seen that this proportion converges to a non-degenerate random variable. The realization is the same for each data point, i.e. there is a common random rescaling, which appears for each eigen-direction. This then gives inconsistent axis labels for the standard scores plot, yet the relative positions of the points (typically the main visual content) are consistent. This paradox disappears when the sample size goes to infinity.
  • Cook's distance [Technometrics 19 (1977) 15-18] is one of the most important diagnostic tools for detecting influential individual or subsets of observations in linear regression for cross-sectional data. However, for many complex data structures (e.g., longitudinal data), no rigorous approach has been developed to address a fundamental issue: deleting subsets with different numbers of observations introduces different degrees of perturbation to the current model fitted to the data, and the magnitude of Cook's distance is associated with the degree of the perturbation. The aim of this paper is to address this issue in general parametric models with complex data structures. We propose a new quantity for measuring the degree of the perturbation introduced by deleting a subset. We use stochastic ordering to quantify the stochastic relationship between the degree of the perturbation and the magnitude of Cook's distance. We develop several scaled Cook's distances to resolve the comparison of Cook's distance for different subset deletions. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of these scaled Cook's distances in a formal influence analysis.
  • Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate covariates of more complex form such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality as well as complex structure. In this article, we propose a new family of tensor regression models that efficiently exploit the special structure of tensor covariates. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction. A fast and highly scalable estimation algorithm is proposed for maximum likelihood estimation and its associated asymptotic properties are studied. Effectiveness of the new methods is demonstrated on both synthetic and real MRI imaging data.
  • Longitudinal imaging studies are essential to understanding the neural development of neuropsychiatric disorders, substance use disorders, and the normal brain. The main objective of this paper is to develop a two-stage adjusted exponentially tilted empirical likelihood (TETEL) for the spatial analysis of neuroimaging data from longitudinal studies. The TETEL method as a frequentist approach allows us to efficiently analyze longitudinal data without modeling temporal correlation and to classify different time-dependent covariate types. To account for spatial dependence, the TETEL method developed here specifically combines all the data in the closest neighborhood of each voxel (or pixel) on a 3-dimensional (3D) volume (or 2D surface) with appropriate weights to calculate adaptive parameter estimates and adaptive test statistics. Simulation studies are used to examine the finite sample performance of the adjusted exponential tilted likelihood ratio statistic and TETEL. We demonstrate the application of our statistical methods to the detection of the difference in the morphological changes of the hippocampus across time between schizophrenia patients and healthy subjects in a longitudinal schizophrenia study.
  • Cook's [J. Roy. Statist. Soc. Ser. B 48 (1986) 133--169] local influence approach based on normal curvature is an important diagnostic tool for assessing local influence of minor perturbations to a statistical model. However, no rigorous approach has been developed to address two fundamental issues: the selection of an appropriate perturbation and the development of influence measures for objective functions at a point with a nonzero first derivative. The aim of this paper is to develop a differential--geometrical framework of a perturbation model (called the perturbation manifold) and utilize associated metric tensor and affine curvatures to resolve these issues. We will show that the metric tensor of the perturbation manifold provides important information about selecting an appropriate perturbation of a model. Moreover, we will introduce new influence measures that are applicable to objective functions at any point. Examples including linear regression models and linear mixed models are examined to demonstrate the effectiveness of using new influence measures for the identification of influential observations.
  • Many important problems in psychology and biomedical studies require testing for overdispersion, correlation and heterogeneity in mixed effects and latent variable models, and score tests are particularly useful for this purpose. But the existing testing procedures depend on restrictive assumptions. In this paper we propose a class of test statistics based on a general mixed effects model to test the homogeneity hypothesis that all of the variance components are zero. Under some mild conditions, not only do we derive asymptotic distributions of the test statistics, but also propose a resampling procedure for approximating their asymptotic distributions conditional on the observed data. To overcome the technical challenge, we establish an invariance principle for random quadratic forms indexed by a parameter. A simulation study is conducted to investigate the empirical performance of the test statistics. A real data set is analyzed to illustrate the application of our theoretical results.