• In many applications, linear models fit the data poorly. This article studies an appealing alternative, the generalized regression model. This model only assumes that there exists an unknown monotonically increasing link function connecting the response $Y$ to a single index $X^T\beta^*$ of explanatory variables $X\in\mathbb{R}^d$. The generalized regression model is flexible and covers many widely used statistical models. It fits the data generating mechanisms well in many real problems, which makes it useful in a variety of applications where regression models are regularly employed. In low dimensions, rank-based M-estimators are recommended to deal with the generalized regression model, giving root-$n$ consistent estimators of $\beta^*$. Applications of these estimators to high dimensional data, however, are questionable. This article studies, both theoretically and practically, a simple yet powerful smoothing approach to handle the high dimensional generalized regression model. Theoretically, a family of smoothing functions is provided, and the amount of smoothing necessary for efficient inference is carefully calculated. Practically, our study is motivated by an important and challenging scientific problem: decoding gene regulation by predicting transcription factors that bind to cis-regulatory elements. Applying our proposed method to this problem shows substantial improvement over the state-of-the-art alternative in real data.
  • Analyses of high-throughput genomic data often lead to ranked lists of genomic loci. How to characterize concordant signals between two rank lists is a common problem with many applications. One example is measuring the reproducibility between two replicate experiments. Another is to characterize the interaction and co-binding between two transcription factors (TF) based on the overlap between their binding sites. As an exploratory tool, the simple Venn diagram approach can be used to show the common loci between two lists. However, this approach does not account for changes in overlap with decreasing ranks, which may contain useful information for studying similarities or dissimilarities of the two lists. The recently proposed irreproducible discovery rate (IDR) approach compares two rank lists using a copula mixture model. This model considers the rank correlation between two lists. However, it only analyzes the genomic loci that appear in both lists, thereby only measuring signal concordance in the overlapping set of the two lists. When two lists have little overlap but loci in their overlapping set have high concordance in terms of rank, the original IDR approach may misleadingly claim that the two rank lists are highly reproducible when they are indeed not. In this article, we propose to address the various issues above by translating the problem into a bivariate survival problem. A survival copula mixture model is developed to characterize concordant signals in two rank lists. The effectiveness of this approach is demonstrated using both simulations and real data.
  • The standard methods for detecting differential gene expression are mostly designed for analyzing a single gene expression experiment. When data from multiple related gene expression studies are available, separately analyzing each study is not an ideal strategy as it may fail to detect important genes with consistent but relatively weak differential signals in multiple studies. Jointly modeling all data allows one to borrow information across studies to improve the analysis. However, a simple concordance model, in which each gene is assumed to be differential in either all studies or none of the studies, is incapable of handling genes with study-specific differential expression. In contrast, a model that naively enumerates and analyzes all possible differential patterns across all studies can deal with study-specificity and allow information pooling, but the complexity of its parameter space grows exponentially as the number of studies increases. Here we propose a "correlation motif" approach to address this dilemma. This approach automatically searches for a small number of latent probability vectors called "correlation motifs" to capture the major correlation patterns among multiple studies. The motifs provide the basis for sharing information among studies and genes. The approach improves detection of differential expression and overcomes the barrier of exponentially growing parameter space. It is capable of handling all possible study-specific differential patterns in a large number of studies. The advantages of this new approach over existing methods are illustrated using both simulated and real data.
  • Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping gene-set analysis and can be used to determine which are the most reproducible gene sets. Results: Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions: We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can improve biological inference of gene set analysis.