• We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators by allowing for a wide range of distributional assumptions beyond merely sub-Gaussian tails. This methodology can furthermore be adapted to a wide range of other estimators and settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true recoveries.
  • With the rapid growth of modern technology, many large-scale biomedical studies have been/are being/will be conducted to collect massive datasets with large volumes of multi-modality imaging, genetic, neurocognitive, and clinical information from increasingly large cohorts. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function, and brain-related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention, and treatment of numerous complex brain-related disorders (e.g., schizophrenia and Alzheimer). However, the development of analytical methods for the joint analysis of both high-dimensional imaging phenotypes and high-dimensional genetic data, called big data squared (BD$^2$), presents major computational and theoretical challenges for existing analytical methods. Besides the high-dimensional nature of BD$^2$, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for the joint analysis of BD$^2$, including massive univariate and voxel-wise approaches, reduced rank regression, mixture models, and group sparse multi-task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research.
  • In this paper, we study a new type of spatial sparse recovery problem, that is to infer the fine-grained spatial distribution of certain density data in a region only based on the aggregate observations recorded for each of its subregions. One typical example of this spatial sparse recovery problem is to infer spatial distribution of cellphone activities based on aggregate mobile traffic volumes observed at sparsely scattered base stations. We propose a novel Constrained Spatial Smoothing (CSS) approach, which exploits the local continuity that exists in many types of spatial data to perform sparse recovery via finite-element methods, while enforcing the aggregated observation constraints through an innovative use of the ADMM algorithm. We also improve the approach to further utilize additional geographical attributes. Extensive evaluations based on a large dataset of phone call records and a demographical dataset from the city of Milan show that our approach significantly outperforms various state-of-the-art approaches, including Spatial Spline Regression (SSR).
  • Online real-estate information systems such as Zillow and Trulia have gained increasing popularity in recent years. One important feature offered by these systems is the online home price estimate through automated data-intensive computation based on housing information and comparative market value analysis. State-of-the-art approaches model house prices as a combination of a latent land desirability surface and a regression from house features. However, by using uniformly damping kernels, they are unable to handle irregularly shaped regions or capture land value discontinuities within the same region due to the existence of implicit sub-communities, which are common in real-world scenarios. In this paper, we explore the novel application of recent advances in spatial functional analysis to house price modeling and propose the Hierarchical Spatial Functional Model (HSFM), which decomposes house values into land desirability at both the global scale and hidden local scales as well as the feature regression component. We propose statistical learning algorithms based on finite-element spatial functional analysis and spatial constrained clustering to train our model. Extensive evaluations based on housing data in a major Canadian city show that our proposed approach can reduce the mean relative house price estimation error down to 6.60%.
  • We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.
  • The cqrReg package for R is the first to introduce a family of robust, high-dimensional regression models for quantile and composite quantile regression, both with and without an adaptive lasso penalty for variable selection. In this paper, we reformulate these quantile regression problems and present the estimators we implement in cqrReg using alternating direction method of multipliers (ADMM), majorize-minimization (MM), and coordinate descent (CD) algorithms. Our new approaches address the lack of publicly-available methods for (composite) quantile regression, both with and without regularization. We demonstrate the need for a variety of algorithms in later simulation studies. For comparison, we also introduce the widely-used interior point (IP) formulation and test our methods against the advanced IP algorithms in the existing quantreg package. Our simulation studies show that each of our methods, particularly MM and CD, excel in different settings such as with large or high-dimensional data sets, respectively, and outperform the methods currently implemented in quantreg. ADMM offers particular promise for future developments in its amenability to parallelization.
  • We have previously proposed the partial quantile regression (PQR) prediction procedure for functional linear model by using partial quantile covariance techniques and developed the simple partial quantile regression (SIMPQR) algorithm to efficiently extract PQR basis for estimating functional coefficients. However, although the PQR approach is considered as an attractive alternative to projections onto the principal component basis, there are certain limitations to uncovering the corresponding asymptotic properties mainly because of its iterative nature and the non-differentiability of the quantile loss function. In this article, we propose and implement an alternative formulation of partial quantile regression (APQR) for functional linear model by using block relaxation method and finite smoothing techniques. The proposed reformulation leads to insightful results and motivates new theory, demonstrating consistency and establishing convergence rates by applying advanced techniques from empirical process theory. Two simulations and two real data from ADHD-200 sample and ADNI are investigated to show the superiority of our proposed methods.
  • In this manuscript, we study quantile regression in partial functional linear model where response is scalar and predictors include both scalars and multiple functions. Wavelet basis are adopted to better approximate functional slopes while effectively detect local features. The sparse group lasso penalty is imposed to select important functional predictors while capture shared information among them. The estimation problem can be reformulated into a standard second-order cone program and then solved by an interior point method. We also give a novel algorithm by using alternating direction method of multipliers (ADMM) which was recently employed by many researchers in solving penalized quantile regression problems. The asymptotic properties such as the convergence rate and prediction error bound have been established. Simulations and a real data from ADHD-200 fMRI data are investigated to show the superiority of our proposed method.
  • This paper studies a \textit{partial functional partially linear single-index model} that consists of a functional linear component as well as a linear single-index component. This model generalizes many well-known existing models and is suitable for more complicated data structures. However, its estimation inherits the difficulties and complexities from both components and makes it a challenging problem, which calls for new methodology. We propose a novel profile B-spline method to estimate the parameters by approximating the unknown nonparametric link function in the single-index component part with B-spline, while the linear slope function in the functional component part is estimated by the functional principal component basis. The consistency and asymptotic normality of the parametric estimators are derived, and the global convergence of the proposed estimator of the linear slope function is also established. More excitingly, the latter convergence is optimal in the minimax sense. A two-stage procedure is implemented to estimate the nonparametric link function, and the resulting estimator possesses the optimal global rate of convergence. Furthermore, the convergence rate of the mean squared prediction error for a predictor is also obtained. Empirical properties of the proposed procedures are studied through Monte Carlo simulations. A real data example is also analyzed to illustrate the power and flexibility of the proposed methodology.
  • Matrix factorization is a popular approach to solving matrix estimation problems based on partial observations. Existing matrix factorization is based on least squares and aims to yield a low-rank matrix to interpret the conditional sample means given the observations. However, in many real applications with skewed and extreme data, least squares cannot explain their central tendency or tail distributions, yielding undesired estimates. In this paper, we propose \emph{expectile matrix factorization} by introducing asymmetric least squares, a key concept in expectile regression analysis, into the matrix factorization framework. We propose an efficient algorithm to solve the new problem based on alternating minimization and quadratic programming. We prove that our algorithm converges to a global optimum and exactly recovers the true underlying low-rank matrices when noise is zero. For synthetic data with skewed noise and a real-world dataset containing web service response times, the proposed scheme achieves lower recovery errors than the existing matrix factorization method based on least squares in a wide range of settings.
  • Identification of regions of interest (ROI) associated with certain disease has a great impact on public health. Imposing sparsity of pixel values and extracting active regions simultaneously greatly complicate the image analysis. We address these challenges by introducing a novel region-selection penalty in the framework of image-on-scalar regression. Our penalty combines the Smoothly Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the SCAD of total variation (TV) regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zero-valued background. Efficient algorithm is based on the alternative direction method of multipliers (ADMM) which decomposes the non-convex problem into two iterative optimization problems with explicit solutions. Another virtue of the proposed method is that a divide and conquer learning algorithm is developed, thereby allowing scaling to large images. Several examples are presented and the experimental results are compared with other state-of-the-art approaches.
  • We propose a bivariate quantile regression method for the bivariate varying coefficient model through a directional approach. The varying coefficients are approximated by the B-spline basis and an $L_{2}$ type penalty is imposed to achieve desired smoothness. We develop a multistage estimation procedure based the Propagation-Separation~(PS) approach to borrow information from nearby directions. The PS method is capable of handling the computational complexity raised by simultaneously considering multiple directions to efficiently estimate varying coefficients while guaranteeing certain smoothness along directions. We reformulate the optimization problem and solve it by the Alternating Direction Method of Multipliers~(ADMM), which is implemented using R while the core is written in C to speed it up. Simulation studies are conducted to confirm the finite sample performance of our proposed method. A real data on Diffusion Tensor Imaging~(DTI) properties from a clinical study on neurodevelopment is analyzed.
  • We propose a prediction procedure for the functional linear quantile regression model by using partial quantile covariance techniques and develop a simple partial quantile regression (SIMPQR) algorithm to efficiently extract partial quantile regression (PQR) basis for estimating functional coefficients. We further extend our partial quantile covariance techniques to functional composite quantile regression (CQR) defining partial composite quantile covariance. There are three major contributions. (1) We define partial quantile covariance between two scalar variables through linear quantile regression. We compute PQR basis by sequentially maximizing the partial quantile covariance between the response and projections of functional covariates. (2) In order to efficiently extract PQR basis, we develop a SIMPQR algorithm analogous to simple partial least squares (SIMPLS). (3) Under the homoscedasticity assumption, we extend our techniques to partial composite quantile covariance and use it to find the partial composite quantile regression (PCQR) basis. The SIMPQR algorithm is then modified to obtain the SIMPCQR algorithm. Two simulation studies show the superiority of our proposed methods. Two real data from ADHD-200 sample and ADNI are analyzed using our proposed methods.
  • Genetic studies often involve quantitative traits. Identifying genetic features that influence quantitative traits can help to uncover the etiology of diseases. Quantile regression method considers the conditional quantiles of the response variable, and is able to characterize the underlying regression structure in a more comprehensive manner. On the other hand, genetic studies often involve high dimensional genomic features, and the underlying regression structure may be heterogeneous in terms of both effect sizes and sparsity. To account for the potential genetic heterogeneity, including the heterogeneous sparsity, a regularized quantile regression method is introduced. The theoretical property of the proposed method is investigated, and its performance is examined through a series of simulation studies. A real dataset is analyzed to demonstrate the application of the proposed method.
  • We give methods for the construction of designs for linear models, when the purpose of the investigation is the estimation of the conditional quantile function and the estimation method is quantile regression. The designs are robust against misspecified response functions, and against unanticipated heteroscedasticity. The methods are illustrated by example, and in a case study in which they are applied to growth charts.
  • The use of quantiles to obtain insights about multivariate data is addressed. It is argued that incisive insights can be obtained by considering directional quantiles, the quantiles of projections. Directional quantile envelopes are proposed as a way to condense this kind of information; it is demonstrated that they are essentially halfspace (Tukey) depth levels sets, coinciding for elliptic distributions (in particular multivariate normal) with density contours. Relevant questions concerning their indexing, the possibility of the reverse retrieval of directional quantile information, invariance with respect to affine transformations, and approximation/asymptotic properties are studied. It is argued that the analysis in terms of directional quantiles and their envelopes offers a straightforward probabilistic interpretation and thus conveys a concrete quantitative meaning; the directional definition can be adapted to elaborate frameworks, like estimation of extreme quantiles and directional quantile regression, the regression of depth contours on covariates. The latter facilitates the construction of multivariate growth charts---the question that motivated all the development.
  • Motivated by recent work on studying massive imaging data in various neuroimaging studies, we propose a novel spatially varying coefficient model (SVCM) to spatially model the varying association between imaging measures in a three-dimensional (3D) volume (or 2D surface) with a set of covariates. Two key features of most neuorimaging data are the presence of multiple piecewise smooth regions with unknown edges and jumps and substantial spatial correlations. To specifically account for these two features, SVCM includes a measurement model with multiple varying coefficient functions, a jumping surface model for each varying coefficient function, and a functional principal component model. We develop a three-stage estimation procedure to simultaneously estimate the varying coefficient functions and the spatial correlations. The estimation procedure includes a fast multiscale adaptive estimation and testing procedure to independently estimate each varying coefficient function, while preserving its edges among different piecewise-smooth regions. We systematically investigate the asymptotic properties (e.g., consistency and asymptotic normality) of the multiscale adaptive parameter estimates. We also establish the uniform convergence rate of the estimated spatial covariance function and its associated eigenvalue and eigenfunctions. Our Monte Carlo simulation and real data analysis have confirmed the excellent performance of SVCM.
  • Motivated by recent work studying massive imaging data in the neuroimaging literature, we propose multivariate varying coefficient models (MVCM) for modeling the relation between multiple functional responses and a set of covariates. We develop several statistical inference procedures for MVCM and systematically study their theoretical properties. We first establish the weak convergence of the local linear estimate of coefficient functions, as well as its asymptotic bias and variance, and then we derive asymptotic bias and mean integrated squared error of smoothed individual functions and their uniform convergence rate. We establish the uniform convergence rate of the estimated covariance function of the individual functions and its associated eigenvalue and eigenfunctions. We propose a global test for linear hypotheses of varying coefficient functions, and derive its asymptotic distribution under the null hypothesis. We also propose a simultaneous confidence band for each individual effect curve. We conduct Monte Carlo simulation to examine the finite-sample performance of the proposed procedures. We apply MVCM to investigate the development of white matter diffusivities along the genu tract of the corpus callosum in a clinical study of neurodevelopment.
  • Discussion of "Multivariate quantiles and multiple-output regression quantiles: From $L_1$ optimization to halfspace depth" by M. Hallin, D. Paindaveine and M. Siman [arXiv:1002.4486]