• It is well recognised that animal and plant pathogens form complex ecological communities of interacting organisms within their hosts. Although community ecology approaches have been applied to determine pathogen interactions at the within-host scale, methodologies enabling robust inference of the epidemiological impact of pathogen interactions are lacking. Here we developed a novel statistical framework to identify statistical covariances from the infection time-series of multiple pathogens simultaneously. Our framework extends Bayesian multivariate disease mapping models to analyse multivariate time series data by accounting for within- and between-year dependencies in infection risk and incorporating a between-pathogen covariance matrix which we estimate. Importantly, our approach accounts for possible confounding drivers of temporal patterns in pathogen infection frequencies, enabling robust inference of pathogen-pathogen interactions. We illustrate the validity of our statistical framework using simulated data and applied it to diagnostic data available for five respiratory viruses co-circulating in a major urban population between 2005 and 2013: adenovirus, human coronavirus, human metapneumovirus, influenza B virus and respiratory syncytial virus. We found positive and negative covariances indicative of epidemiological interactions among specific virus pairs. This statistical framework enables a community ecology perspective to be applied to infectious disease epidemiology with important utility for public health planning and preparedness.
  • In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
  • We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators by allowing for a wide range of distributional assumptions beyond merely sub-Gaussian tails. This methodology can furthermore be adapted to a wide range of other estimators and settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true recoveries.
  • Large-scale multiple testing with highly correlated test statistics arises frequently in many scientific research. Incorporating correlation information in estimating false discovery proportion has attracted increasing attention in recent years. When the covariance matrix of test statistics is known, Fan, Han & Gu (2012) provided a consistent estimate of False Discovery Proportion (FDP) under arbitrary dependence structure. However, the covariance matrix is often unknown in many applications and such dependence information has to be estimated before estimating FDP (Efron, 2010). The estimation accuracy can greatly affect the convergence result of FDP or even violate its consistency. In the current paper, we provide methodological modification and theoretical investigations for estimation of FDP with unknown covariance. First we develop requirements for estimates of eigenvalues and eigenvectors such that we can obtain a consistent estimate of FDP. Secondly we give conditions on the dependence structures such that the estimate of FDP is consistent. Such dependence structures include sparse covariance matrices, which have been popularly considered in the contemporary random matrix theory. When data are sampled from an approximate factor model, which encompasses most practical situations, we provide a consistent estimate of FDP via exploiting this specific dependence structure. The results are further demonstrated by simulation studies and some real data applications.
  • The literature on regression kink designs develops identification results for average effects of continuous treatments (Card, Lee, Pei, and Weber, 2015), average effects of binary treatments (Dong, 2018), and quantile-wise effects of continuous treatments (Chiang and Sasaki, 2019), but there has been no identification result for quantile-wise effects of binary treatments to date. In this paper, we fill this void in the literature by providing an identification of quantile treatment effects in regression kink designs with binary treatment variables. For completeness, we also develop large sample theories for statistical inference and a practical guideline on estimation and inference.
  • While many statistical models and methods are now available for network analysis, resampling network data remains a challenging problem. Cross-validation is a useful general tool for model selection and parameter tuning, but is not directly applicable to networks since splitting network nodes into groups requires deleting edges and destroys some of the network structure. Here we propose a new network resampling strategy based on splitting node pairs rather than nodes applicable to cross-validation for a wide range of network model selection tasks. We provide a theoretical justification for our method in a general setting and examples of how our method can be used in specific network model selection and parameter tuning tasks. Numerical results on simulated networks and on a citation network of statisticians show that this cross-validation approach works well for model selection.
  • In this paper we present a novel inference methodology to perform Bayesian inference for spatiotemporal Cox processes where the intensity function depends on a multivariate Gaussian process. Dynamic Gaussian processes are introduced to allow for evolution of the intensity function over discrete time. The novelty of the method lies on the fact that no discretisation error is involved despite the non-tractability of the likelihood function and infinite dimensionality of the problem. The method is based on a Markov chain Monte Carlo algorithm that samples from the joint posterior distribution of the parameters and latent variables of the model. The models are defined in a general and flexible way but they are amenable to direct sampling from the relevant distributions, due to careful characterisation of its components. The models also allow for the inclusion of regression covariates and/or temporal components to explain the variability of the intensity function. These components may be subject to relevant interaction with space and/or time. Real and simulated examples illustrate the methodology, followed by concluding remarks.
  • There is a growing need for the ability to analyse interval-valued data. However, existing descriptive frameworks to achieve this ignore the process by which interval-valued data are typically constructed; namely by the aggregation of real-valued data generated from some underlying process. In this article we develop the foundations of likelihood based statistical inference for random intervals that directly incorporates the underlying generative procedure into the analysis. That is, it permits the direct fitting of models for the underlying real-valued data given only the random interval-valued summaries. This generative approach overcomes several problems associated with existing methods, including the rarely satisfied assumption of within-interval uniformity. The new methods are illustrated by simulated and real data analyses.
  • This paper studies non-separable models with a continuous treatment when the dimension of the control variables is high and potentially larger than the effective sample size. We propose a three-step estimation procedure to estimate the average, quantile, and marginal treatment effects. In the first stage we estimate the conditional mean, distribution, and density objects by penalized local least squares, penalized local maximum likelihood estimation, and numerical differentiation, respectively, where control variables are selected via a localized method of L1-penalization at each value of the continuous treatment. In the second stage we estimate the average and marginal distribution of the potential outcome via the plug-in principle. In the third stage, we estimate the quantile and marginal treatment effects by inverting the estimated distribution function and using the local linear regression, respectively. We study the asymptotic properties of these estimators and propose a weighted-bootstrap method for inference. Using simulated and real datasets, we demonstrate that the proposed estimators perform well in finite samples.
  • We consider joint selection of fixed and random effects in general mixed-effects models. The interpretation of estimated mixed-effects models is challenging since changing the structure of one set of effects can lead to different choices of important covariates in the model. We propose a stepwise selection algorithm to perform simultaneous selection of the fixed and random effects. It is based on BIC-type criteria whose penalties are adapted to mixed-effects models. The proposed procedure performs model selection in both linear and nonlinear models. It should be used in the low-dimension setting where the number of covariates and the number of random effects are moderate with respect to the total number of observations. The performance of the algorithm is assessed via a simulation study, that includes also a comparative study with alternatives when available in the literature. The use of the method is illustrated in the clinical study of an antibiotic agent kinetics.
  • Dimension reduction provides a useful tool for analyzing high dimensional data. The recently developed \textit{Envelope} method is a parsimonious version of the classical multivariate regression model through identifying a minimal reducing subspace of the responses. However, existing envelope methods assume an independent error structure in the model. While the assumption of independence is convenient, it does not address the additional complications associated with spatial or temporal correlations in the data. In this article, we introduce a \textit{Spatial Envelope} method for dimension reduction in the presence of dependencies across space. We study the asymptotic properties of the proposed estimators and show that the asymptotic variance of the estimated regression coefficients under the spatial envelope model is smaller than that from the traditional maximum likelihood estimation. Furthermore, we present a computationally efficient approach for inference. The efficacy of the new approach is investigated through simulation studies and an analysis of an Air Quality Standard (AQS) dataset from the Environmental Protection Agency (EPA).
  • Estimation of the number of species or unobserved classes from a random sample of the underlying population is a ubiquitous problem in statistics. In classical settings, the size of the sample is usually small. New technologies such as high-throughput DNA sequencing have allowed for the sampling of extremely large and heterogeneous populations at scales not previously attainable or even considered. New algorithms are required that take advantage of the size of the data to account for heterogeneity, but are also sufficiently fast and scale well with large data. We present a non-parametric moment-based estimator that is both computationally efficient and is sufficiently flexible to account for heterogeneity in the abundances of underlying population. This estimator is based on an extension of a popular moment-based lower bound (Chao, 1984), originally developed by Harris (1959) but unattainable due to the lack of economical algorithms to solve the system of nonlinear equation required for estimation. We apply results from the classical moment problem to show that solutions can be obtained efficiently, allowing for estimators that are simultaneously conservative and use more information. This is critical for modern genomic applications, where there may be many large experiments that require the application of species estimation. We present applications of our estimator to estimating T-Cell receptor repertoire and dropout in single cell RNA-seq experiments.
  • The goal of this paper is to contrast and survey the major advances in two of the most commonly used high-dimensional techniques, namely, the Lasso and horseshoe regularization. Lasso is a gold standard for predictor selection while horseshoe is a state-of-the-art Bayesian estimator for sparse signals. Lasso is fast and scalable and uses convex optimization whilst the horseshoe is non-convex. Our novel perspective focuses on three aspects: (i) theoretical optimality in high dimensional inference for the Gaussian sparse model and beyond, (ii) efficiency and scalability of computation and (iii) methodological development and performance.
  • We propose an optimal experimental design for a curvilinear regression model that minimizes the band-width of simultaneous confidence bands. Simultaneous confidence bands for curvilinear regression are constructed by evaluating the volume of a tube about a curve that is defined as a trajectory of a regression basis vector (Naiman, 1986). The proposed criterion is constructed based on the volume of a tube, and the corresponding optimal design that minimizes the volume of tube is referred to as the tube-volume optimal (TV-optimal) design. For Fourier and weighted polynomial regressions, the problem is formalized as one of minimization over the cone of Hankel positive definite matrices, and the criterion to minimize is expressed as an elliptic integral. We show that the M\"obius group keeps our problem invariant, and hence, minimization can be conducted over cross-sections of orbits. We demonstrate that for the weighted polynomial regression and the Fourier regression with three bases, the tube-volume optimal design forms an orbit of the M\"obius group containing D-optimal designs as representative elements.
  • In this article, we propose a factor-adjusted multiple testing (FAT) procedure based on factor-adjusted p-values in a linear factor model involving some observable and unobservable factors, for the purpose of selecting skilled funds in empirical finance. The factor-adjusted p-values were obtained after extracting the latent common factors by the principal component method. Under some mild conditions, the false discovery proportion can be consistently estimated even if the idiosyncratic errors are allowed to be weakly correlated across units. Furthermore, by appropriately setting a sequence of threshold values approaching zero, the proposed FAT procedure enjoys model selection consistency. Extensive simulation studies and a real data analysis for selecting skilled funds in the U.S. financial market are presented to illustrate the practical utility of the proposed method. Supplementary materials for this article are available online.
  • Feature selection is a standard approach to understanding and modeling high-dimensional classification data, but the corresponding statistical methods hinge on tuning parameters that are difficult to calibrate. In particular, existing calibration schemes in the logistic regression framework lack any finite sample guarantees. In this paper, we introduce a novel calibration scheme for $\ell_1$-penalized logistic regression. It is based on simple tests along the tuning parameter path and is equipped with optimal guarantees for feature selection. It is also amenable to easy and efficient implementations, and it rivals or outmatches existing methods in simulations and real data applications.
  • Bayesian matrix factorization (BMF) is a powerful tool for producing low-rank representations of matrices and for predicting missing values and providing confidence intervals. Scaling up the posterior inference for massive-scale matrices is challenging and requires distributing both data and computation over many workers, making communication the main computational bottleneck. Embarrassingly parallel inference would remove the communication needed, by using completely independent computations on different data subsets, but it suffers from the inherent unidentifiability of BMF solutions. We introduce a hierarchical decomposition of the joint posterior distribution, which couples the subset inferences, allowing for embarrassingly parallel computations in a sequence of at most three stages. Using an efficient approximate implementation, we show improvements empirically on both real and simulated data. Our distributed approach is able to achieve a speed-up of almost an order of magnitude over the full posterior, with a negligible effect on predictive accuracy. Our method outperforms state-of-the-art embarrassingly parallel MCMC methods in accuracy, and achieves results competitive to other available distributed and parallel implementations of BMF.
  • We propose new tests for assessing whether covariates in a treatment group and matched control group are balanced in observational studies. The tests exhibit high power under a wide range of multivariate alternatives, some of which existing tests have little power for. The asymptotic permutation null distributions of the proposed tests are studied and the p-values calculated through the asymptotic results work well in finite samples, facilitating the application of the test to large data sets. The tests are illustrated in a study of the effect of smoking on blood lead levels. The proposed tests are implemented in an R package BalanceCheck.
  • Bayesian inference for factorial hidden Markov models is challenging due to the exponentially sized latent variable space. Standard Monte Carlo samplers can have difficulties effectively exploring the posterior landscape and are often restricted to exploration around localised regions that depend on initialisation. We introduce a general purpose ensemble Markov Chain Monte Carlo (MCMC) technique to improve on existing poorly mixing samplers. This is achieved by combining parallel tempering and an auxiliary variable scheme to exchange information between the chains in an efficient way. The latter exploits a genetic algorithm within an augmented Gibbs sampler. We compare our technique with various existing samplers in a simulation study as well as in a cancer genomics application, demonstrating the improvements obtained by our augmented ensemble approach.
  • We consider the estimation and inference of graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. A critical challenge in the estimation and inference of this model is the fact that its penalized maximum likelihood estimation involves minimizing a non-convex objective function. To address it, this paper makes two contributions: (i) In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with an optimal statistical rate of convergence. (ii) We propose a de-biased statistical inference procedure for testing hypotheses on the true support of the sparse precision matrices, and employ it for testing a growing number of hypothesis with false discovery rate (FDR) control. The asymptotic normality of our test statistic and the consistency of FDR control procedure are established. Our theoretical results are backed up by thorough numerical studies and our real applications on neuroimaging studies of Autism spectrum disorder and users' advertising click analysis bring new scientific findings and business insights. The proposed methods are encoded into a publicly available R package Tlasso.
  • The practical importance of inference with robustness against large bandwidths for causal effects in regression discontinuity and kink designs is widely recognized. Existing robust methods cover many cases, but do not handle uniform inference for CDF and quantile processes in fuzzy designs, despite its use in the recent literature in empirical microeconomics. In this light, this paper extends the literature by developing a unified framework of inference with robustness against large bandwidths that applies to uniform inference for quantile treatment effects in fuzzy designs, as well as all the other cases of sharp/fuzzy mean/quantile regression discontinuity/kink designs. We present Monte Carlo simulation studies and an empirical application for evaluations of the Oklahoma pre-K program.
  • Statistics derived from the eigenvalues of sample covariance matrices are called spectral statistics, and they play a central role in multivariate testing. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, these methods are relatively unexplored in the high-dimensional setting. The aim of this paper is to focus on linear spectral statistics as a class of prototypes for developing a new bootstrap in high-dimensions --- and we refer to this method as the Spectral Bootstrap. In essence, the method originates from the parametric bootstrap, and is motivated by the notion that, in high dimensions, it is difficult to obtain a non-parametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use, and allows the user to circumvent the difficulties of complex asymptotic formulas for linear spectral statistics. In addition to proving the consistency of the proposed method, we provide encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, we show through simulations that the method can be applied successfully to statistics outside the class of linear spectral statistics, such as the largest sample eigenvalue and others.
  • Consider the problem of modeling memory effects in discrete-state random walks using higher-order Markov chains. This paper explores cross validation and information criteria as proxies for a model's predictive accuracy. Our objective is to select, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. Through simulations, I evaluate these criteria in the case where data are drawn from systems with fixed orders of history, noting trends in the relative performance of the criteria. As a real-world illustrative example of these methods, this manuscript evaluates the problem of detecting statistical dependencies in shot outcomes in free throw shooting. Over three NBA seasons analyzed, several players exhibited statistical dependencies in free throw hitting probability of various types - hot handedness, cold handedness, and error correction. For the 2013-2014 through 2015-2016 NBA seasons, I detected statistical dependencies in 23% of all player-seasons. Focusing on a single player, in two of these three seasons, LeBron James shot a better percentage after an immediate miss than otherwise. In those seasons, conditioning on the previous outcome makes for a more predictive model than treating free throw makes as independent. When extended to data from the 2016-2017 NBA season specifically for LeBron James, a model depending on the previous shot (single-step Markovian) does not clearly beat a model with independent outcomes. An error-correcting variable length model of two parameters, where James shoots a higher percentage after a missed free throw than otherwise, is more predictive than either model.
  • The spectral distribution $f(\omega)$ of a stationary time series $\{Y_t\}_{t\in\mathbb{Z}}$ can be used to investigate whether or not periodic structures are present in $\{Y_t\}_{t\in\mathbb{Z}}$, but $f(\omega)$ has some limitations due to its dependence on the autocovariances $\gamma(h)$. For example, $f(\omega)$ can not distinguish white i.i.d. noise from GARCH-type models (whose terms are dependent, but uncorrelated), which implies that $f(\omega)$ can be an inadequate tool when $\{Y_t\}_{t\in\mathbb{Z}}$ contains asymmetries and nonlinear dependencies. Asymmetries between the upper and lower tails of a time series can be investigated by means of the local Gaussian autocorrelations introduced in Tj{\o}stheim and Hufthammer (2013), and these local measures of dependence can be used to construct the local Gaussian spectral density presented in this paper. A key feature of the new local spectral density is that it coincides with $f(\omega)$ for Gaussian time series, which implies that it can be used to detect non-Gaussian traits in the time series under investigation. In particular, if $f(\omega)$ is flat, then peaks and troughs of the new local spectral density can indicate nonlinear traits, which potentially might discover local periodic phenomena that remain undetected in an ordinary spectral analysis.
  • Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples ($p> n$). In order to make informative inference, we assume that the model is approximately sparse, that is the effect of covariates on the response can be well approximated by conditioning on a relatively small number of covariates whose identities are unknown. We develop a framework for testing very general hypotheses regarding the model parameters. Our framework encompasses testing whether the parameter lies in a convex cone, testing the signal strength, and testing arbitrary functionals of the parameter. We show that the proposed procedure controls the type I error, and also analyze the power of the procedure. Our numerical experiments confirm our theoretical findings and demonstrate that we control false positive rate (type I error) near the nominal level, and have high power. By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters. For linear functionals, the length of confidence intervals is shown to be minimax rate optimal.