• ### High Dimensional Estimation and Multi-Factor Models(1804.08472)

April 23, 2018 q-fin.ST, stat.ML
The purpose of this paper is to re-investigate the estimation of multiple factor models by relaxing the convention that the number of factors is small. We first obtain the collection of all possible factors and we provide a simultaneous test, security by security, of which factors are significant. Since the collection of risk factors selected for investigation is large and highly correlated, we use dimension reduction methods, including the Least Absolute Shrinkage and Selection Operator (LASSO) and prototype clustering, to perform the investigation. For comparison with the existing literature, we compare the multi-factor model's performance with the Fama-French 5-factor model. We find that both the Fama-French 5-factor and the multi-factor model are consistent with the behavior of "large-time scale" security returns. In a goodness-of-fit test comparing the Fama-French 5-factor with the multi-factor model, the multi-factor model has a substantially larger adjusted $R^{2}$. Robustness tests confirm that the multi-factor model provides a reasonable characterization of security returns.
• ### A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models(1803.09735)

March 26, 2018 stat.ME
A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a modified form of the EM algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.
• ### The middle-scale asymptotics of Wishart matrices(1705.03510)

May 9, 2017 math.PR, math.ST, stat.TH
We study the behavior of a real $p$-dimensional Wishart random matrix with $n$ degrees of freedom when $n,p\rightarrow\infty$ but $p/n\rightarrow 0$. We establish the existence of phase transitions when $p$ grows at the order $n^{(K+1)/(K+3)}$ for every $k\in\mathbb{N}$, and derive expressions for approximating densities between every two phase transitions. To do this, we make use of a novel tool we call the G-transform of a distribution, which is closely related to the characteristic function. We also derive an extension of the $t$-distribution to the real symmetric matrices, which naturally appears as the conjugate distribution to the Wishart under a G-transformation, and show its empirical spectral distribution obeys a semicircle law when $p/n\rightarrow 0$. Finally, we discuss how the phase transitions of the Wishart distribution might originate from changes in rates of convergence of symmetric $t$ statistics.
• ### Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable(1611.07115)

Nov. 24, 2019 cs.LG, stat.ML
Ensembles of decision trees are known to perform well on many problems, but are not interpretable. In contrast to existing explanations of tree ensembles that explain relationships between features and predictions, we propose an alternative approach to interpreting tree ensembles by surfacing representative points for each class, in which we explain a prediction by presenting points with similar predictions -- prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that our found prototypes perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. We also present a use case of debugging dataset errors using our proposed methods.
• ### On the Domain of Attraction of a Tracy-Widom Law with Applications to Testing Multiple Largest Roots(1510.08873)

Oct. 29, 2015 math.ST, stat.TH
The greatest root statistic arises as the test statistic in several multivariate analysis settings. Suppose there is a global null hypothesis that consists of different independent sub-null hypotheses, and suppose the greatest root statistic is used as the test statistic for each sub-null hypothesis. Such problems may arise when conducting a batch MANOVA or several batches of pairwise testing for equality of covariance matrices. Using the union-intersection testing approach and by letting the problem dimension tend to infinity faster than the number of batches, we show that the global null can be tested using a Gumbel distribution to approximate the critical values. Although the theoretical results are asymptotic, simulation studies indicate that the approximations are very good even for small to moderate dimensions. The results are general and can be applied in any setting where the greatest root statistic is used, not just for the two methods we use for illustrative purposes.
• ### A Scalable Empirical Bayes Approach to Variable Selection(1510.03781)

Oct. 13, 2015 stat.ME
We develop a model-based empirical Bayes approach to variable selection problems in which the number of predictors is very large, possibly much larger than the number of responses (the so-called 'large p, small n' problem). We consider the multiple linear regression setting, where the response is assumed to be a continuous variable and it is a linear function of the predictors plus error. The explanatory variables in the linear model can have a positive effect on the response, a negative effect, or no effect. We model the effects of the linear predictors as a three-component mixture in which a key assumption is that only a small (unknown) fraction of the candidate predictors have a non-zero effect on the response variable. By treating the coefficients as random effects we develop an approach that is computationally efficient because the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using the EM algorithm which is scalable and leads to significantly faster convergence, compared with simulation-based methods.
• ### Improved Second Order Estimation in the Singular Multivariate Normal Model(1509.02451)

Sept. 8, 2015 math.ST, stat.TH
We consider the problem of estimating covariance and precision matrices, and their associated discriminant coefficients, from normal data when the rank of the covariance matrix is strictly smaller than its dimension and the available sample size. Using unbiased risk estimation, we construct novel estimators by minimizing upper bounds on the difference in risk over several classes. Our proposal estimates are empirically demonstrated to offer substantial improvement over classical approaches.
• ### Penalized versus constrained generalized eigenvalue problems(1410.6131)

May 4, 2015 stat.CO, stat.ML
We investigate the difference between using an $\ell_1$ penalty versus an $\ell_1$ constraint in generalized eigenvalue problems, such as principal component analysis and discriminant analysis. Our main finding is that an $\ell_1$ penalty may fail to provide very sparse solutions; a severe disadvantage for variable selection that can be remedied by using an $\ell_1$ constraint. Our claims are supported both by empirical evidence and theoretical analysis. Finally, we illustrate the advantages of an $\ell_1$ constraint in the context of discriminant analysis and principal component analysis.
• ### Simultaneous sparse estimation of canonical vectors in the p>>N setting(1403.6095)

April 30, 2015 stat.ME, stat.ML
This article considers the problem of sparse estimation of canonical vectors in linear discriminant analysis when $p\gg N$. Several methods have been proposed in the literature that estimate one canonical vector in the two-group case. However, $G-1$ canonical vectors can be considered if the number of groups is $G$. In the multi-group context, it is common to estimate canonical vectors in a sequential fashion. Moreover, separate prior estimation of the covariance structure is often required. We propose a novel methodology for direct estimation of canonical vectors. In contrast to existing techniques, the proposed method estimates all canonical vectors at once, performs variable selection across all the vectors and comes with theoretical guarantees on the variable selection and classification consistency. First, we highlight the fact that in the $N>p$ setting the canonical vectors can be expressed in a closed form up to an orthogonal transformation. Secondly, we propose an extension of this form to the $p\gg N$ setting and achieve feature selection by using a group penalty. The resulting optimization problem is convex and can be solved using a block-coordinate descent algorithm. The practical performance of the method is evaluated through simulation studies as well as real data applications.
• ### Supervised Classification Using Sparse Fisher's LDA(1301.4976)

Sept. 16, 2014 stat.CO, stat.ML
It is well known that in a supervised classification setting when the number of features is smaller than the number of observations, Fisher's linear discriminant rule is asymptotically Bayes. However, there are numerous modern applications where classification is needed in the high-dimensional setting. Naive implementation of Fisher's rule in this case fails to provide good results because the sample covariance matrix is singular. Moreover, by constructing a classifier that relies on all features the interpretation of the results is challenging. Our goal is to provide robust classification that relies only on a small subset of important features and accounts for the underlying correlation structure. We apply a lasso-type penalty to the discriminant vector to ensure sparsity of the solution and use a shrinkage type estimator for the covariance matrix. The resulting optimization problem is solved using an iterative coordinate ascent algorithm. Furthermore, we analyze the effect of nonconvexity on the sparsity level of the solution and highlight the difference between the penalized and the constrained versions of the problem. The simulation results show that the proposed method performs favorably in comparison to alternatives. The method is used to classify leukemia patients based on DNA methylation features.
• ### Noise Estimation in the Spiked Covariance Model(1408.6440)

Aug. 27, 2014 math.ST, stat.TH, stat.ME
The problem of estimating a spiked covariance matrix in high dimensions under Frobenius loss, and the parallel problem of estimating the noise in spiked PCA is investigated. We propose an estimator of the noise parameter by minimizing an unbiased estimator of the invariant Frobenius risk using calculus of variations. The resulting estimator is shown, using random matrix theory, to be strongly consistent and essentially asymptotically normal and minimax for the noise estimation problem. We apply the construction to construct a robust spiked covariance matrix estimator with consistent eigenvalues.
• ### AIC, Cp and estimators of loss for elliptically symmetric distributions(1308.2766)

May 24, 2014 math.ST, stat.TH
In this article, we develop a modern perspective on Akaike's Information Criterion and Mallows' Cp for model selection. Despite the diff erences in their respective motivation, they are equivalent in the special case of Gaussian linear regression. In this case they are also equivalent to a third criterion, an unbiased estimator of the quadratic prediction loss, derived from loss estimation theory. Our first contribution is to provide an explicit link between loss estimation and model selection through a new oracle inequality. We then show that the form of the unbiased estimator of the quadratic prediction loss under a Gaussian assumption still holds under a more general distributional assumption, the family of spherically symmetric distributions. One of the features of our results is that our criterion does not rely on the speci ficity of the distribution, but only on its spherical symmetry. Also this family of laws o ffers some dependence property between the observations, a case not often studied.
• ### Improved multivariate normal mean estimation with unknown covariance when p is greater than n(1302.6746)

Feb. 27, 2013 math.ST, stat.TH
We consider the problem of estimating the mean vector of a p-variate normal $(\theta,\Sigma)$ distribution under invariant quadratic loss, $(\delta-\theta)'\Sigma^{-1}(\delta-\theta)$, when the covariance is unknown. We propose a new class of estimators that dominate the usual estimator $\delta^0(X)=X$. The proposed estimators of $\theta$ depend upon X and an independent Wishart matrix S with n degrees of freedom, however, S is singular almost surely when p>n. The proof of domination involves the development of some new unbiased estimators of risk for the p>n setting. We also find some relationships between the amount of domination and the magnitudes of n and p.
• ### On Improved Loss Estimation for Shrinkage Estimators(1203.4989)

March 22, 2012 stat.ME
Let $X$ be a random vector with distribution $P_{\theta}$ where $\theta$ is an unknown parameter. When estimating $\theta$ by some estimator $\varphi(X)$ under a loss function $L(\theta,\varphi)$, classical decision theory advocates that such a decision rule should be used if it has suitable properties with respect to the frequentist risk $R(\theta,\varphi)$. However, after having observed $X=x$, instances arise in practice in which $\varphi$ is to be accompanied by an assessment of its loss, $L(\theta,\varphi(x))$, which is unobservable since $\theta$ is unknown. A common approach to this assessment is to consider estimation of $L(\theta,\varphi(x))$ by an estimator $\delta$, called a loss estimator. We present an expository development of loss estimation with substantial emphasis on the setting where the distributional context is normal and its extension to the case where the underlying distribution is spherically symmetric. Our overview covers improved loss estimators for least squares but primarily focuses on shrinkage estimators. Bayes estimation is also considered and comparisons are made with unbiased estimation.
• ### MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions(1001.4776)

Jan. 21, 2011 stat.CO, math.ST, stat.TH
In this paper, we propose a general class of algorithms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. The resulting algorithms rely on iterated soft-thresholding, implemented componentwise, allowing for fast, stable updating that avoids the need for any high-dimensional matrix inversion. We establish a local convergence theory for this class of algorithms under weaker assumptions than previously considered in the statistical literature. We also demonstrate the exceptional effectiveness of new acceleration methods, originally proposed for the EM algorithm, in this class of problems. Simulation results and a microarray data example are provided to demonstrate the algorithm's capabilities and versatility.
• ### Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments(1101.0905)

Jan. 5, 2011 stat.ME
A two-groups mixed-effects model for the comparison of (normalized) microarray data from two treatment groups is considered. Most competing parametric methods that have appeared in the literature are obtained as special cases or by minor modification of the proposed model. Approximate maximum likelihood fitting is accomplished via a fast and scalable algorithm, which we call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of treatment $\times$ gene interactions, derived from the model, involve shrinkage estimates of both the interactions and of the gene specific error variances. Genes are classified as being associated with treatment based on the posterior odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our model-based approach also allows one to declare the non-null status of a gene by controlling the false discovery rate (FDR). It is shown in a detailed simulation study that the approach outperforms well-known competitors. We also apply the proposed methodology to two previously analyzed microarray examples. Extensions of the proposed method to paired treatments and multiple treatments are also discussed.
• ### A Conversation with Shayle R. Searle(1001.3272)

Jan. 19, 2010 stat.ME
Born in New Zealand, Shayle Robert Searle earned a bachelor's degree (1949) and a master's degree (1950) from Victoria University, Wellington, New Zealand. After working for an actuary, Searle went to Cambridge University where he earned a Diploma in mathematical statistics in 1953. Searle won a Fulbright travel award to Cornell University, where he earned a doctorate in animal breeding, with a strong minor in statistics in 1959, studying under Professor Charles Henderson. In 1962, Cornell invited Searle to work in the university's computing center, and he soon joined the faculty as an assistant professor of biological statistics. He was promoted to associate professor in 1965, and became a professor of biological statistics in 1970. Searle has also been a visiting professor at Texas A&M University, Florida State University, Universit\"{a}t Augsburg and the University of Auckland. He has published several statistics textbooks and has authored more than 165 papers. Searle is a Fellow of the American Statistical Association, the Royal Statistical Society, and he is an elected member of the International Statistical Institute. He also has received the prestigious Alexander von Humboldt U.S. Senior Scientist Award, is an Honorary Fellow of the Royal Society of New Zealand and was recently awarded the D.Sc. Honoris Causa by his alma mater, Victoria University of Wellington, New Zealand.
• ### A Multivariate Variance Components Model for Analysis of Covariance in Designed Experiments(1001.3011)

Jan. 18, 2010 stat.ME
Traditional methods for covariate adjustment of treatment means in designed experiments are inherently conditional on the observed covariate values. In order to develop a coherent general methodology for analysis of covariance, we propose a multivariate variance components model for the joint distribution of the response and covariates. It is shown that, if the design is orthogonal with respect to (random) blocking factors, then appropriate adjustments to treatment means can be made using the univariate variance components model obtained by conditioning on the observed covariate values. However, it is revealed that some widely used models are incorrectly specified, leading to biased estimates and incorrect standard errors. The approach clarifies some issues that have been the source of ongoing confusion in the statistics literature.