
With the rapid growth of modern technology, many largescale biomedical
studies have been/are being/will be conducted to collect massive datasets with
large volumes of multimodality imaging, genetic, neurocognitive, and clinical
information from increasingly large cohorts. Simultaneously extracting and
integrating rich and diverse heterogeneous information in neuroimaging and/or
genomics from these big datasets could transform our understanding of how
genetic variants impact brain structure and function, cognitive function, and
brainrelated disease risk across the lifespan. Such understanding is critical
for diagnosis, prevention, and treatment of numerous complex brainrelated
disorders (e.g., schizophrenia and Alzheimer). However, the development of
analytical methods for the joint analysis of both highdimensional imaging
phenotypes and highdimensional genetic data, called big data squared (BD$^2$),
presents major computational and theoretical challenges for existing analytical
methods. Besides the highdimensional nature of BD$^2$, various neuroimaging
measures often exhibit strong spatial smoothness and dependence and genetic
markers may have a natural dependence structure arising from linkage
disequilibrium. We review some recent developments of various statistical
techniques for the joint analysis of BD$^2$, including massive univariate and
voxelwise approaches, reduced rank regression, mixture models, and group
sparse multitask regression. By doing so, we hope that this review may
encourage others in the statistical community to enter into this new and
exciting field of research.

This paper investigates the central limit theorem for linear spectral
statistics of high dimensional sample covariance matrices of the form
$\mathbf{B}_n=n^{1}\sum_{j=1}^{n}\mathbf{Q}\mathbf{x}_j\mathbf{x}_j^{*}\mathbf{Q}^{*}$
where $\mathbf{Q}$ is a nonrandom matrix of dimension $p\times k$, and
$\{\mathbf{x}_j\}$ is a sequence of independent $k$dimensional random vector
with independent entries, under the assumption that $p/n\to y>0$. A key novelty
here is that the dimension $k\ge p$ can be arbitrary, possibly infinity. This
new model of sample covariance matrices $\mathbf{B}_n$ covers most of the known
models as its special cases. For example, standard sample covariance matrices
are obtained with $k=p$ and $\mathbf{Q}=\mathbf{T}_n^{1/2}$ for some positive
definite Hermitian matrix $\mathbf{T}_n$. Also with $k=\infty$ our model covers
the case of repeated linear processes considered in recent highdimensional
time series literature. The CLT found in this paper substantially generalizes
the seminal CLT in Bai and Silverstein (2004). Applications of this new CLT are
proposed for testing the structure of a highdimensional covariance matrix. The
derived tests are then used to analyse a large fMRI data set regarding its
temporary correlation structure.

Ambulatory cardiovascular (CV) measurements provide valuable insights into
individuals' health conditions in "reallife," everyday settings. Current
methods of modeling ambulatory CV data do not consider the dynamic
characteristics of the full data set and their relationships with covariates
such as caffeine use and stress. We propose a stochastic differential equation
(SDE) in the form of a dual nonlinear OrnsteinUhlenbeck (OU) model with
personspecific covariates to capture the morning surge and nighttime dipping
dynamics of ambulatory CV data. To circumvent the data analytic constraint that
empirical measurements are typically collected at irregular and much larger
time intervals than those evaluated in simulation studies of SDEs, we adopt a
Bayesian approach with a regularized Brownian Bridge sampler (RBBS) and an
efficient multiresolution (MR) algorithm to fit the proposed SDE. The MR
algorithm can produce more efficient MCMC samples that is crucial for valid
parameter estimation and inference. Using this model and algorithm to data from
the Duke Behavioral Investigation of Hypertension Study, results indicate that
age, caffeine intake, gender and race have effects on distinct dynamic
characteristics of the participants' CV trajectories.

The aim of this paper is to develop a class of spatial transformation models
(STM) to spatially model the varying association between imaging measures in a
threedimensional (3D) volume (or 2D surface) and a set of covariates. Our STMs
include a varying BoxCox transformation model for dealing with the issue of
nonGaussian distributed imaging data and a Gaussian Markov Random Field model
for incorporating spatial smoothness of the imaging data. Posterior computation
proceeds via an efficient Markov chain Monte Carlo algorithm. Simulations and
real data analysis demonstrate that the STM significantly outperforms the
voxelwise linear model with Gaussian noise in recovering meaningful geometric
patterns. Our STM is able to reveal important brain regions with morphological
changes in children with attention deficit hyperactivity disorder.

Identification of regions of interest (ROI) associated with certain disease
has a great impact on public health. Imposing sparsity of pixel values and
extracting active regions simultaneously greatly complicate the image analysis.
We address these challenges by introducing a novel regionselection penalty in
the framework of imageonscalar regression. Our penalty combines the Smoothly
Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the
SCAD of total variation (TV) regularization, enforcing spatial contiguity, into
one group, which segments contiguous spatial regions against zerovalued
background. Efficient algorithm is based on the alternative direction method of
multipliers (ADMM) which decomposes the nonconvex problem into two iterative
optimization problems with explicit solutions. Another virtue of the proposed
method is that a divide and conquer learning algorithm is developed, thereby
allowing scaling to large images. Several examples are presented and the
experimental results are compared with other stateoftheart approaches.

We propose an extrinsic regression framework for modeling data with manifold
valued responses and Euclidean predictors. Regression with manifold responses
has wide applications in shape analysis, neuroscience, medical imaging and many
other areas. Our approach embeds the manifold where the responses lie onto a
higher dimensional Euclidean space, obtains a local regression estimate in that
space, and then projects this estimate back onto the image of the manifold.
Outside the regression setting both intrinsic and extrinsic approaches have
been proposed for modeling i.i.d manifoldvalued data. However, to our
knowledge our work is the first to take an extrinsic approach to the regression
problem. The proposed extrinsic regression framework is general,
computationally efficient and theoretically appealing. Asymptotic distributions
and convergence rates of the extrinsic regression estimates are derived and a
large class of examples are considered indicating the wide applicability of our
approach.

Many neuroimaging studies have collected ultrahigh dimensional imaging data
in order to identify imaging biomarkers that are related to normal biological
processes, diseases, and the response to treatment, among many others. These
imaging data are often represented in the form of a multidimensional array,
called a tensor. Existing statistical methods are insufficient for analysis of
these tensor data due to their ultrahigh dimensionality as well as their
complex structure. The aim of this paper is to develop a tensor partition
regression modeling (TPRM) framework to establish an association between
lowdimensional clinical outcomes and ultrahigh dimensional tensor covariates.
Our TPRM is a hierarchical model and efficiently integrates four components:
(i) a partition model; (ii) a canonical polyadic decomposition model; (iii) a
factor model; and (iv) a generalized linear model. Under this framework,
ultrahigh dimensionality is not only reduced to a manageable level, resulting
inefficient estimation, but also prediction accuracy is optimized to search for
informative subtensors. Posterior computation proceeds via an efficient Markov
chain Monte Carlo algorithm. Simulation shows that TPRM outperforms several
other competing methods. We apply TPRM to predict disease status (Alzheimer
versus control) by using structural magnetic resonance imaging data obtained
from Alzheimer's Disease Neuroimaging Initiative (ADNI) study.

In the eventrelated functional magnetic resonance imaging (fMRI) data
analysis, there is an extensive interest in accurately and robustly estimating
the hemodynamic response function (HRF) and its associated statistics (e.g.,
the magnitude and duration of the activation). Most methods to date are
developed in the time domain and they have utilized almost exclusively the
temporal information of fMRI data without accounting for the spatial
information. The aim of this paper is to develop a multiscale adaptive
smoothing model (MASM) in the frequency domain by integrating the spatial and
frequency information to adaptively and accurately estimate HRFs pertaining to
each stimulus sequence across all voxels in a threedimensional (3D) volume. We
use two sets of simulation studies and a real data set to examine the finite
sample performance of MASM in estimating HRFs. Our real and simulated data
analyses confirm that MASM outperforms several other stateoftheart methods,
such as the smooth finite impulse response (sFIR) model.

Motivated by recent work on studying massive imaging data in various
neuroimaging studies, we propose a novel spatially varying coefficient model
(SVCM) to spatially model the varying association between imaging measures in a
threedimensional (3D) volume (or 2D surface) with a set of covariates. Two key
features of most neuorimaging data are the presence of multiple piecewise
smooth regions with unknown edges and jumps and substantial spatial
correlations. To specifically account for these two features, SVCM includes a
measurement model with multiple varying coefficient functions, a jumping
surface model for each varying coefficient function, and a functional principal
component model. We develop a threestage estimation procedure to
simultaneously estimate the varying coefficient functions and the spatial
correlations. The estimation procedure includes a fast multiscale adaptive
estimation and testing procedure to independently estimate each varying
coefficient function, while preserving its edges among different
piecewisesmooth regions. We systematically investigate the asymptotic
properties (e.g., consistency and asymptotic normality) of the multiscale
adaptive parameter estimates. We also establish the uniform convergence rate of
the estimated spatial covariance function and its associated eigenvalue and
eigenfunctions. Our Monte Carlo simulation and real data analysis have
confirmed the excellent performance of SVCM.

Diffusion tensor imaging provides important information on tissue structure
and orientation of fiber tracts in brain white matter in vivo. It results in
diffusion tensors, which are $3\times3$ symmetric positive definite (SPD)
matrices, along fiber bundles. This paper develops a functional data analysis
framework to model diffusion tensors along fiber tracts as functional data in a
Riemannian manifold with a set of covariates of interest, such as age and
gender. We propose a statistical model with varying coefficient functions to
characterize the dynamic association between functional SPD matrixvalued
responses and covariates. We calculate weighted least squares estimators of the
varying coefficient functions for the logEuclidean metric in the space of SPD
matrices. We also develop a global test statistic to test specific hypotheses
about these coefficient functions and construct their simultaneous confidence
bands. Simulated data are further used to examine the finite sample performance
of the estimated varying coefficient functions. We apply our model to study
potential gender differences and find a statistically significant aspect of the
development of diffusion tensors along the right internal capsule tract in a
clinical study of neurodevelopment.

The aim of this paper is to establish several deep theoretical properties of
principal component analysis for multiplecomponent spike covariance models.
Our new results reveal a surprising asymptotic conical structure in critical
sample eigendirections under the spike models with distinguishable (or
indistinguishable) eigenvalues, when the sample size and/or the number of
variables (or dimension) tend to infinity. The consistency of the sample
eigenvectors relative to their population counterparts is determined by the
ratio between the dimension and the product of the sample size with the spike
size. When this ratio converges to a nonzero constant, the sample eigenvector
converges to a cone, with a certain angle to its corresponding population
eigenvector.In the High Dimension, Low Sample Size case, the angle between the
sample eigenvector and its population counterpart converges to a limiting
distribution.Several generalizations of the multispike covariance models are
also explored, and additional theoretical results are presented.

Motivated by recent work studying massive imaging data in the neuroimaging
literature, we propose multivariate varying coefficient models (MVCM) for
modeling the relation between multiple functional responses and a set of
covariates. We develop several statistical inference procedures for MVCM and
systematically study their theoretical properties. We first establish the weak
convergence of the local linear estimate of coefficient functions, as well as
its asymptotic bias and variance, and then we derive asymptotic bias and mean
integrated squared error of smoothed individual functions and their uniform
convergence rate. We establish the uniform convergence rate of the estimated
covariance function of the individual functions and its associated eigenvalue
and eigenfunctions. We propose a global test for linear hypotheses of varying
coefficient functions, and derive its asymptotic distribution under the null
hypothesis. We also propose a simultaneous confidence band for each individual
effect curve. We conduct Monte Carlo simulation to examine the finitesample
performance of the proposed procedures. We apply MVCM to investigate the
development of white matter diffusivities along the genu tract of the corpus
callosum in a clinical study of neurodevelopment.

Principal component analysis is a useful dimension reduction and data
visualization method. However, in high dimension, low sample size asymptotic
contexts, where the sample size is fixed and the dimension goes to infinity,a
paradox has arisen. In particular, despite the useful real data insights
commonly obtained from principal component score visualization, these scores
are not consistent even when the sample eigenvectors are consistent. This
paradox is resolved by asymptotic study of the ratio between the sample and
population principal component scores. In particular, it is seen that this
proportion converges to a nondegenerate random variable. The realization is
the same for each data point, i.e. there is a common random rescaling, which
appears for each eigendirection. This then gives inconsistent axis labels for
the standard scores plot, yet the relative positions of the points (typically
the main visual content) are consistent. This paradox disappears when the
sample size goes to infinity.

Cook's distance [Technometrics 19 (1977) 1518] is one of the most important
diagnostic tools for detecting influential individual or subsets of
observations in linear regression for crosssectional data. However, for many
complex data structures (e.g., longitudinal data), no rigorous approach has
been developed to address a fundamental issue: deleting subsets with different
numbers of observations introduces different degrees of perturbation to the
current model fitted to the data, and the magnitude of Cook's distance is
associated with the degree of the perturbation. The aim of this paper is to
address this issue in general parametric models with complex data structures.
We propose a new quantity for measuring the degree of the perturbation
introduced by deleting a subset. We use stochastic ordering to quantify the
stochastic relationship between the degree of the perturbation and the
magnitude of Cook's distance. We develop several scaled Cook's distances to
resolve the comparison of Cook's distance for different subset deletions.
Theoretical and numerical examples are examined to highlight the broad spectrum
of applications of these scaled Cook's distances in a formal influence
analysis.

Classical regression methods treat covariates as a vector and estimate a
corresponding vector of regression coefficients. Modern applications in medical
imaging generate covariates of more complex form such as multidimensional
arrays (tensors). Traditional statistical and computational methods are proving
insufficient for analysis of these highthroughput data due to their ultrahigh
dimensionality as well as complex structure. In this article, we propose a new
family of tensor regression models that efficiently exploit the special
structure of tensor covariates. Under this framework, ultrahigh dimensionality
is reduced to a manageable level, resulting in efficient estimation and
prediction. A fast and highly scalable estimation algorithm is proposed for
maximum likelihood estimation and its associated asymptotic properties are
studied. Effectiveness of the new methods is demonstrated on both synthetic and
real MRI imaging data.

Longitudinal imaging studies are essential to understanding the neural
development of neuropsychiatric disorders, substance use disorders, and the
normal brain. The main objective of this paper is to develop a twostage
adjusted exponentially tilted empirical likelihood (TETEL) for the spatial
analysis of neuroimaging data from longitudinal studies. The TETEL method as a
frequentist approach allows us to efficiently analyze longitudinal data without
modeling temporal correlation and to classify different timedependent
covariate types. To account for spatial dependence, the TETEL method developed
here specifically combines all the data in the closest neighborhood of each
voxel (or pixel) on a 3dimensional (3D) volume (or 2D surface) with
appropriate weights to calculate adaptive parameter estimates and adaptive test
statistics. Simulation studies are used to examine the finite sample
performance of the adjusted exponential tilted likelihood ratio statistic and
TETEL. We demonstrate the application of our statistical methods to the
detection of the difference in the morphological changes of the hippocampus
across time between schizophrenia patients and healthy subjects in a
longitudinal schizophrenia study.

Cook's [J. Roy. Statist. Soc. Ser. B 48 (1986) 133169] local influence
approach based on normal curvature is an important diagnostic tool for
assessing local influence of minor perturbations to a statistical model.
However, no rigorous approach has been developed to address two fundamental
issues: the selection of an appropriate perturbation and the development of
influence measures for objective functions at a point with a nonzero first
derivative. The aim of this paper is to develop a differentialgeometrical
framework of a perturbation model (called the perturbation manifold) and
utilize associated metric tensor and affine curvatures to resolve these issues.
We will show that the metric tensor of the perturbation manifold provides
important information about selecting an appropriate perturbation of a model.
Moreover, we will introduce new influence measures that are applicable to
objective functions at any point. Examples including linear regression models
and linear mixed models are examined to demonstrate the effectiveness of using
new influence measures for the identification of influential observations.

Many important problems in psychology and biomedical studies require testing
for overdispersion, correlation and heterogeneity in mixed effects and latent
variable models, and score tests are particularly useful for this purpose. But
the existing testing procedures depend on restrictive assumptions. In this
paper we propose a class of test statistics based on a general mixed effects
model to test the homogeneity hypothesis that all of the variance components
are zero. Under some mild conditions, not only do we derive asymptotic
distributions of the test statistics, but also propose a resampling procedure
for approximating their asymptotic distributions conditional on the observed
data. To overcome the technical challenge, we establish an invariance principle
for random quadratic forms indexed by a parameter. A simulation study is
conducted to investigate the empirical performance of the test statistics. A
real data set is analyzed to illustrate the application of our theoretical
results.