• We consider the estimation and inference of graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. A critical challenge in the estimation and inference of this model is the fact that its penalized maximum likelihood estimation involves minimizing a non-convex objective function. To address it, this paper makes two contributions: (i) In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with an optimal statistical rate of convergence. (ii) We propose a de-biased statistical inference procedure for testing hypotheses on the true support of the sparse precision matrices, and employ it for testing a growing number of hypothesis with false discovery rate (FDR) control. The asymptotic normality of our test statistic and the consistency of FDR control procedure are established. Our theoretical results are backed up by thorough numerical studies and our real applications on neuroimaging studies of Autism spectrum disorder and users' advertising click analysis bring new scientific findings and business insights. The proposed methods are encoded into a publicly available R package Tlasso.
  • Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large family of high-dimensional statistical models, including sparse Fisher's discriminant analysis, canonical correlation analysis, and sufficient dimension reduction. Sparse GEP involves solving a non-convex optimization problem. Most existing methods and theory in the context of specific statistical models that are special cases of the sparse GEP require restrictive structural assumptions on the input matrices. In this paper, we propose a two-stage computational framework to solve the sparse GEP. At the first stage, we solve a convex relaxation of the sparse GEP. Taking the solution as an initial value, we then exploit a nonconvex optimization perspective and propose the truncated Rayleigh flow method (Rifle) to estimate the leading generalized eigenvector. We show that Rifle converges linearly to a solution with the optimal statistical rate of convergence for many statistical models. Theoretically, our method significantly improves upon the existing literature by eliminating structural assumptions on the input matrices for both stages. To achieve this, our analysis involves two key ingredients: (i) a new analysis of the gradient based method on nonconvex objective functions, and (ii) a fine-grained characterization of the evolution of sparsity patterns along the solution path. Thorough numerical studies are provided to validate the theoretical results.
  • This paper explores the information-theoretic limitations of graph property testing in zero-field Ising models. Instead of learning the entire graph structure, sometimes testing a basic graph property such as connectivity, cycle presence or maximum clique size is a more relevant and attainable objective. Since property testing is more fundamental than graph recovery, any necessary conditions for property testing imply corresponding conditions for graph recovery, while custom property tests can be statistically and/or computationally more efficient than graph recovery based algorithms. Understanding the statistical complexity of property testing requires the distinction of ferromagnetic (i.e., positive interactions only) and general Ising models. Using combinatorial constructs such as graph packing and strong monotonicity, we characterize how target properties affect the corresponding minimax upper and lower bounds within the realm of ferromagnets. On the other hand, by studying the detection of an antiferromagnetic (i.e., negative interactions only) Curie-Weiss model buried in Rademacher noise, we show that property testing is strictly more challenging over general Ising models. In terms of methodological development, we propose two types of correlation based tests: computationally efficient screening for ferromagnets, and score type tests for general models, including a fast cycle presence test. Our correlation screening tests match the information-theoretic bounds for property testing in ferromagnets.
  • We consider estimating the parametric components of semi-parametric multiple index models in a high-dimensional and non-Gaussian setting. Such models form a rich class of non-linear models with applications to signal processing, machine learning and statistics. Our estimators leverage the score function based first and second-order Stein's identities and do not require the covariates to satisfy Gaussian or elliptical symmetry assumptions common in the literature. Moreover, to handle score functions and responses that are heavy-tailed, our estimators are constructed via carefully thresholding their empirical counterparts. We show that our estimator achieves near-optimal statistical rate of convergence in several settings. We supplement our theoretical results via simulation experiments that confirm the theory.
  • User and item features of side information are crucial for accurate recommendation. However, the large number of feature dimensions, e.g., usually larger than 10^7, results in expensive storage and computational cost. This prohibits fast recommendation especially on mobile applications where the computational resource is very limited. In this paper, we develop a generic feature-based recommendation model, called Discrete Factorization Machine (DFM), for fast and accurate recommendation. DFM binarizes the real-valued model parameters (e.g., float32) of every feature embedding into binary codes (e.g., boolean), and thus supports efficient storage and fast user-item score computation. To avoid the severe quantization loss of the binarization, we propose a convergent updating rule that resolves the challenging discrete optimization of DFM. Through extensive experiments on two real-world datasets, we show that 1) DFM consistently outperforms state-of-the-art binarized recommendation models, and 2) DFM shows very competitive performance compared to its real-valued version (FM), demonstrating the minimized quantization loss.
  • There has been an increasing interest in testing the equality of large Pearson's correlation matrices. However, in many applications it is more important to test the equality of large rank-based correlation matrices since they are more robust to outliers and nonlinearity. Unlike the Pearson's case, testing the equality of large rank-based statistics has not been well explored and requires us to develop new methods and theory. In this paper, we provide a framework for testing the equality of two large U-statistic based correlation matrices, which include the rank-based correlation matrices as special cases. Our approach exploits extreme value statistics and the Jackknife estimator for uncertainty assessment and is valid under a fully nonparametric model. Theoretically, we develop a theory for testing the equality of U-statistic based correlation matrices. We then apply this theory to study the problem of testing large Kendall's tau correlation matrices and demonstrate its optimality. For proving this optimality, a novel construction of least favourable distributions is developed for the correlation matrix comparison.
  • Offensive or antagonistic language targeted at individuals and social groups based on their personal characteristics (also known as cyber hate speech or cyberhate) has been frequently posted and widely circulated viathe World Wide Web. This can be considered as a key risk factor for individual and societal tension linked toregional instability. Automated Web-based cyberhate detection is important for observing and understandingcommunity and regional societal tension - especially in online social networks where posts can be rapidlyand widely viewed and disseminated. While previous work has involved using lexicons, bags-of-words orprobabilistic language parsing approaches, they often suffer from a similar issue which is that cyberhate can besubtle and indirect - thus depending on the occurrence of individual words or phrases can lead to a significantnumber of false negatives, providing inaccurate representation of the trends in cyberhate. This problemmotivated us to challenge thinking around the representation of subtle language use, such as references toperceived threats from "the other" including immigration or job prosperity in a hateful context. We propose anovel framework that utilises language use around the concept of "othering" and intergroup threat theory toidentify these subtleties and we implement a novel classification method using embedding learning to computesemantic distances between parts of speech considered to be part of an "othering" narrative. To validate ourapproach we conduct several experiments on different types of cyberhate, namely religion, disability, race andsexual orientation, with F-measure scores for classifying hateful instances obtained through applying ourmodel of 0.93, 0.86, 0.97 and 0.98 respectively, providing a significant improvement in classifier accuracy overthe state-of-the-art
  • We consider the problem of \emph{fully decentralized} multi-agent reinforcement learning (MARL), where the agents are located at the nodes of a time-varying communication network. Specifically, we assume that the reward functions of the agents might correspond to different tasks, and are only known to the corresponding agent. Moreover, each agent makes individual decisions based on both the information observed locally and the messages received from its neighbors over the network. Within this setting, the collective goal of the agents is to maximize the globally averaged return over the network through exchanging information with their neighbors. To this end, we propose two decentralized actor-critic algorithms with function approximation, which are applicable to large-scale MARL problems where both the number of states and the number of agents are massively large. Under the decentralized structure, the actor step is performed individually by each agent with no need to infer the policies of others. For the critic step, we propose a consensus update via communication over the network. Our algorithms are fully incremental and can be implemented in an online fashion. Convergence analyses of the algorithms are provided when the value functions are approximated within the class of linear functions. Extensive simulation results with both linear and nonlinear function approximations are presented to validate the proposed algorithms. Our work appears to be the first study of fully decentralized MARL algorithms for networked agents with function approximation, with provable convergence guarantees.
  • Many machine learning techniques sacrifice convenient computational structures to gain estimation robustness and modeling flexibility. However, by exploring the modeling structures, we find these "sacrifices" do not always require more computational efforts. To shed light on such a "free-lunch" phenomenon, we study the square-root-Lasso (SQRT-Lasso) type regression problem. Specifically, we show that the nonsmooth loss functions of SQRT-Lasso type regression ease tuning effort and gain adaptivity to inhomogeneous noise, but is not necessarily more challenging than Lasso in computation. We can directly apply proximal algorithms (e.g. proximal gradient descent, proximal Newton, and proximal Quasi-Newton algorithms) without worrying the nonsmoothness of the loss function. Theoretically, we prove that the proximal algorithms combined with the pathwise optimization scheme enjoy fast convergence guarantees with high probability. Numerical results are provided to support our theory.
  • We propose a new family of combinatorial inference problems for graphical models. Unlike classical statistical inference where the main interest is point estimation or parameter testing, combinatorial inference aims at testing the global structure of the underlying graph. Examples include testing the graph connectivity, the presence of a cycle of certain size, or the maximum degree of the graph. To begin with, we develop a unified theory for the fundamental limits of a large family of combinatorial inference problems. We propose new concepts including structural packing and buffer entropies to characterize how the complexity of combinatorial graph structures impacts the corresponding minimax lower bounds. On the other hand, we propose a family of novel and practical structural testing algorithms to match the lower bounds. We provide thorough numerical results on both synthetic graphical models and brain networks to illustrate the usefulness of these proposed methods.
  • We propose a novel class of time-varying nonparanormal graphical models, which allows us to model high dimensional heavy-tailed systems and the evolution of their latent network structures. Under this model, we develop statistical tests for presence of edges both locally at a fixed index value and globally over a range of values. The tests are developed for a high-dimensional regime, are robust to model selection mistakes and do not require commonly assumed minimum signal strength. The testing procedures are based on a high dimensional, debiasing-free moment estimator, which uses a novel kernel smoothed Kendall's tau correlation matrix as an input statistic. The estimator consistently estimates the latent inverse Pearson correlation matrix uniformly in both the index variable and kernel bandwidth. Its rate of convergence is shown to be minimax optimal. Our method is supported by thorough numerical simulations and an application to a neural imaging data set.
  • We develop a novel procedure for constructing confidence bands for components of a sparse additive model. Our procedure is based on a new kernel-sieve hybrid estimator that combines two most popular nonparametric estimation methods in the literature, the kernel regression and the spline method, and is of interest in its own right. Existing methods for fitting sparse additive model are primarily based on sieve estimators, while the literature on confidence bands for nonparametric models are primarily based upon kernel or local polynomial estimators. Our kernel-sieve hybrid estimator combines the best of both worlds and allows us to provide a simple procedure for constructing confidence bands in high-dimensional sparse additive models. We prove that the confidence bands are asymptotically honest by studying approximation with a Gaussian process. Thorough numerical results on both synthetic data and real-world neuroscience data are provided to demonstrate the efficacy of the theory.
  • We propose a general theory for studying the \xl{landscape} of nonconvex \xl{optimization} with underlying symmetric structures \tz{for a class of machine learning problems (e.g., low-rank matrix factorization, phase retrieval, and deep linear neural networks)}. In specific, we characterize the locations of stationary points and the null space of Hessian matrices \xl{of the objective function} via the lens of invariant groups\removed{for associated optimization problems, including low-rank matrix factorization, phase retrieval, and deep linear neural networks}. As a major motivating example, we apply the proposed general theory to characterize the global \xl{landscape} of the \xl{nonconvex optimization in} low-rank matrix factorization problem. In particular, we illustrate how the rotational symmetry group gives rise to infinitely many nonisolated strict saddle points and equivalent global minima of the objective function. By explicitly identifying all stationary points, we divide the entire parameter space into three regions: ($\cR_1$) the region containing the neighborhoods of all strict saddle points, where the objective has negative curvatures; ($\cR_2$) the region containing neighborhoods of all global minima, where the objective enjoys strong convexity along certain directions; and ($\cR_3$) the complement of the above regions, where the gradient has sufficiently large magnitudes. We further extend our result to the matrix sensing problem. Such global landscape implies strong global convergence guarantees for popular iterative algorithms with arbitrary initial solutions.
  • We propose a stochastic variance reduced optimization algorithm for solving sparse learning problems with cardinality constraints. Sufficient conditions are provided, under which the proposed algorithm enjoys strong linear convergence guarantees and optimal estimation accuracy in high dimensions. We further extend the proposed algorithm to an asynchronous parallel variant with a near linear speedup. Numerical experiments demonstrate the efficiency of our algorithm in terms of both parameter estimation and computational performance.
  • In this paper, we generalize (accelerated) Newton's method with cubic regularization under inexact second-order information for (strongly) convex optimization problems. Under mild assumptions, we provide global rate of convergence of these methods and show the explicit dependence of the rate of convergence on the problem parameters. While the complexity bounds of our presented algorithms are theoretically worse than those of their exact counterparts, they are at least as good as those of the optimal first-order methods. Our numerical experiments also show that using inexact Hessians can significantly speed up the algorithms in practice.
  • Principal component analysis (PCA) has been a prominent tool for high-dimensional data analysis. Online algorithms that estimate the principal component by processing streaming data are of tremendous practical and theoretical interests. Despite its rich applications, theoretical convergence analysis remains largely open. In this paper, we cast online PCA into a stochastic nonconvex optimization problem, and we analyze the online PCA algorithm as a stochastic approximation iteration. The stochastic approximation iteration processes data points incrementally and maintains a running estimate of the principal component. We prove for the first time a nearly optimal finite-sample error bound for the online PCA algorithm. Under the subgaussian assumption, we show that the finite-sample error bound closely matches the minimax information lower bound.
  • We develop a new modeling framework for Inter-Subject Analysis (ISA). The goal of ISA is to explore the dependency structure between different subjects with the intra-subject dependency as nuisance. It has important applications in neuroscience to explore the functional connectivity between brain regions under natural stimuli. Our framework is based on the Gaussian graphical models, under which ISA can be converted to the problem of estimation and inference of the inter-subject precision matrix. The main statistical challenge is that we do not impose sparsity constraint on the whole precision matrix and we only assume the inter-subject part is sparse. For estimation, we propose to estimate an alternative parameter to get around the non-sparse issue and it can achieve asymptotic consistency even if the intra-subject dependency is dense. For inference, we propose an "untangle and chord" procedure to de-bias our estimator. It is valid without the sparsity assumption on the inverse Hessian of the log-likelihood function. This inferential method is general and can be applied to many other statistical problems, thus it is of independent theoretical interest. Numerical experiments on both simulated and brain imaging data validate our methods and theory.
  • We consider the problem of undirected graphical model inference. In many applications, instead of perfectly recovering the unknown graph structure, a more realistic goal is to infer some graph invariants (e.g., the maximum degree, the number of connected subgraphs, the number of isolated nodes). In this paper, we propose a new inferential framework for testing nested multiple hypotheses and constructing confidence intervals of the unknown graph invariants under undirected graphical models. Compared to perfect graph recovery, our methods require significantly weaker conditions. This paper makes two major contributions: (i) Methodologically, for testing nested multiple hypotheses, we propose a skip-down algorithm on the whole family of monotone graph invariants (The invariants which are non-decreasing under addition of edges). We further show that the same skip-down algorithm also provides valid confidence intervals for the targeted graph invariants. (ii) Theoretically, we prove that the length of the obtained confidence intervals are optimal and adaptive to the unknown signal strength. We also prove generic lower bounds for the confidence interval length for various invariants. Numerical results on both synthetic simulations and a brain imaging dataset are provided to illustrate the usefulness of the proposed method.
  • We consider the testing of mutual independence among all entries in a $d$-dimensional random vector based on $n$ independent observations. We study two families of distribution-free test statistics, which include Kendall's tau and Spearman's rho as important examples. We show that under the null hypothesis the test statistics of these two families converge weakly to Gumbel distributions, and propose tests that control the type I error in the high-dimensional setting where $d>n$. We further show that the two tests are rate-optimal in terms of power against sparse alternatives, and outperform competitors in simulations, especially when $d$ is large.
  • We consider the problem of learning high-dimensional Gaussian graphical models. The graphical lasso is one of the most popular methods for estimating Gaussian graphical models. However, it does not achieve the oracle rate of convergence. In this paper, we propose the graphical nonconvex optimization for optimal estimation in Gaussian graphical models, which is then approximated by a sequence of convex programs. Our proposal is computationally tractable and produces an estimator that achieves the oracle rate of convergence. The statistical error introduced by the sequential approximation using the convex programs are clearly demonstrated via a contraction property. The rate of convergence can be further improved using the notion of sparsity pattern. The proposed methodology is then extended to semiparametric graphical models. We show through numerical studies that the proposed estimator outperforms other popular methods for estimating Gaussian graphical models.
  • Developments in deep generative models have allowed for tractable learning of high-dimensional data distributions. While the employed learning procedures typically assume that training data is drawn i.i.d. from the distribution of interest, it may be desirable to model distinct distributions which are observed sequentially, such as when different classes are encountered over time. Although conditional variations of deep generative models permit multiple distributions to be modeled by a single network in a disentangled fashion, they are susceptible to catastrophic forgetting when the distributions are encountered sequentially. In this paper, we adapt recent work in reducing catastrophic forgetting to the task of training generative adversarial networks on a sequence of distinct distributions, enabling continual generative modeling.
  • This paper studies the asymptotic behavior of smooth solutions to the generalized Hall-magneto-hydrodynamics system (1.1) with one single diffusion on the whole space $\mathbb R^3$. We establish that, in the inviscid resistive case, the energy $\|b(t)\|_2^2$ vanishes and $\|u(t)\|_2^2$ converges to a constant as time tends to infinity provided the velocity is bounded in $W^{1-\alpha,\frac3\alpha}(\mathbb R^3)$; in the viscous non-resistive case, the energy $\|u(t)\|_2^2$ vanishes and $\|b(t)\|_2^2$ converges to a constant provided the magnetic field is bounded in $W^{1-\beta,\infty}(\mathbb R^3)$. In summary, one single diffusion, being as weak as $(-\Delta)^\alpha b$ or $(-\Delta)^\beta u$ with small enough $\alpha, \beta$, is sufficient to prevent asymptotic energy oscillations for certain smooth solutions to the system.
  • We propose a computational framework named iterative local adaptive majorize-minimization (I-LAMM) to simultaneously control algorithmic complexity and statistical error when fitting high dimensional models. I-LAMM is a two-stage algorithmic implementation of the local linear approximation to a family of folded concave penalized quasi-likelihood. The first stage solves a convex program with a crude precision tolerance to obtain a coarse initial estimator, which is further refined in the second stage by iteratively solving a sequence of convex programs with smaller precision tolerances. Theoretically, we establish a phase transition: the first stage has a sublinear iteration complexity, while the second stage achieves an improved linear rate of convergence. Though this framework is completely algorithmic, it provides solutions with optimal statistical performances and controlled algorithmic complexity for a large family of nonconvex optimization problems. The iteration effects on statistical errors are clearly demonstrated via a contraction property. Our theory relies on a localized version of the sparse/restricted eigenvalue condition, which allows us to analyze a large family of loss and penalty functions and provide optimality guarantees under very weak assumptions (For example, I-LAMM requires much weaker minimal signal strength than other procedures). Thorough numerical results are provided to support the obtained theory.
  • High dimensional sparse learning has imposed a great computational challenge to large scale data analysis. In this paper, we are interested in a broad class of sparse learning approaches formulated as linear programs parametrized by a {\em regularization factor}, and solve them by the parametric simplex method (PSM). Our parametric simplex method offers significant advantages over other competing methods: (1) PSM naturally obtains the complete solution path for all values of the regularization parameter; (2) PSM provides a high precision dual certificate stopping criterion; (3) PSM yields sparse solutions through very few iterations, and the solution sparsity significantly reduces the computational cost per iteration. Particularly, we demonstrate the superiority of PSM over various sparse learning approaches, including Dantzig selector for sparse linear regression, LAD-Lasso for sparse robust linear regression, CLIME for sparse precision matrix estimation, sparse differential network estimation, and sparse Linear Programming Discriminant (LPD) analysis. We then provide sufficient conditions under which PSM always outputs sparse solutions such that its computational performance can be significantly boosted. Thorough numerical experiments are provided to demonstrate the outstanding performance of the PSM method.
  • The cyclic block coordinate descent-type (CBCD-type) methods, which performs iterative updates for a few coordinates (a block) simultaneously throughout the procedure, have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that for strongly convex minimization, the CBCD-type methods attain iteration complexity of $\cO(p\log(1/\epsilon))$, where $\epsilon$ is a pre-specified accuracy of the objective value, and $p$ is the number of blocks. However, such iteration complexity explicitly depends on $p$, and therefore is at least $p$ times worse than the complexity $\cO(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity $\cO(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCD-type methods matches that of the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus our complexity bounds are sharper than the existing bounds by at least a factor of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a $\log^2 p$ factor), under the assumption that the largest and smallest eigenvalues of the Hessian matrix do not scale with $p$. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones.