• In contemporary scientific research, it is of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array) and additional covariates. This mixture of different types of data leads to challenges in statistical analysis. Motivated by applications in science and engineering, we propose a comprehensive and interpretable discriminant analysis model, called CATCH model (in short for Covariate-Adjusted Tensor Classification in High-dimensions), which efficiently integrates the covariates and the tensor to predict the categorical outcome. The CATCH model jointly models the relationships among the covariates, the tensor predictor, and the categorical response. More importantly, it preserves and utilizes the structures of the data for maximum interpretability and optimal prediction. To tackle the new computational and statistical challenges arising from the intimidating tensor dimensions, we propose a penalized approach to select a subset of tensor predictor entries that has direct discriminative effect after adjusting for covariates. We further develop an efficient algorithm that takes advantage of the tensor structure. Theoretical results confirm that our method achieves variable selection consistency and optimal classification error, even when the tensor dimension is much larger than the sample size. The superior performance of our method over existing methods is demonstrated in extensive simulated and real data examples.
  • An envelope is a targeted dimension reduction subspace for simultaneously achieving dimension reduction and improving parameter estimation efficiency. While many envelope methods have been proposed in recent years, all envelope methods hinge on the knowledge of a key hyperparameter, the structural dimension of the envelope. How to estimate the envelope dimension consistently is of substantial interest from both theoretical and practical aspects. Moreover, very recent advances in the literature have generalized envelope as a model-free method, which makes selecting the envelope dimension even more challenging. Likelihood-based approaches such as information criteria and likelihood-ratio tests either cannot be directly applied or have no theoretical justification. To address this critical issue of dimension selection, we propose two unified approaches -- called FG and 1D selections -- for determining the envelope dimension that can be applied to any envelope models and methods. The two model-free selection approaches are based on the two different envelope optimization procedures: the full Grassmannian (FG) optimization and the 1D algorithm (Cook and Zhang, 2016), and are shown to be capable of correctly identifying the structural dimension with a probability tending to 1 under mild moment conditions as the sample size increases. While the FG selection unifies and generalizes the BIC and modified BIC approaches that existing in the literature, and hence provides the theoretical justification of them under weak moment condition and model-free context, the 1D selection is computationally more stable and efficient in finite sample. Extensive simulations and a real data analysis demonstrate the superb performance of our proposals.
  • A new model-free screening method called the fused Kolmogorov filter is proposed for high-dimensional data analysis. This new method is fully nonparametric and can work with many types of covariates and response variables, including continuous, discrete and categorical variables. We apply the fused Kolmogorov filter to deal with variable screening problems emerging from a wide range of applications, such as multiclass classification, nonparametric regression and Poisson regression, among others. It is shown that the fused Kolmogorov filter enjoys the sure screening property under weak regularity conditions that are much milder than those required for many existing nonparametric screening methods. In particular, the fused Kolmogorov filter can still be powerful when covariates are strongly dependent on each other. We further demonstrate the superior performance of the fused Kolmogorov filter over existing screening methods by simulations and real data examples.
  • In recent years many sparse linear discriminant analysis methods have been proposed for high-dimensional classification and variable selection. However, most of these proposals focus on binary classification and they are not directly applicable to multiclass classification problems. There are two sparse discriminant analysis methods that can handle multiclass classification problems, but their theoretical justifications remain unknown. In this paper, we propose a new multiclass sparse discriminant analysis method that estimates all discriminant directions simultaneously. We show that when applied to the binary case our proposal yields a classification direction that is equivalent to those by two successful binary sparse LDA methods in the literature. An efficient algorithm is developed for computing our method with high-dimensional data. Variable selection consistency and rates of convergence are established under the ultrahigh dimensionality setting. We further demonstrate the superior performance of our proposal over the existing methods on simulated and real data.
  • In recent years, a considerable amount of work has been devoted to generalizing linear discriminant analysis to overcome its incompetence for high-dimensional classification (Witten & Tibshirani 2011, Cai & Liu 2011, Mai et al. 2012, Fan et al. 2012). In this paper, we develop high-dimensional semiparametric sparse discriminant analysis (HD-SeSDA) that generalizes the normal-theory discriminant analysis in two ways: it relaxes the Gaussian assumptions and can handle non-polynomial (NP) dimension classification problems. If the underlying Bayes rule is sparse, HD-SeSDA can estimate the Bayes rule and select the true features simultaneously with overwhelming probability, as long as the logarithm of dimension grows slower than the cube root of sample size. Simulated and real examples are used to demonstrate the finite sample performance of HD-SeSDA. At the core of the theory is a new exponential concentration bound for semiparametric Gaussian copulas, which is of independent interest.