
We consider the estimation and inference of graphical models that
characterize the dependency structure of highdimensional tensorvalued data.
To facilitate the estimation of the precision matrix corresponding to each way
of the tensor, we assume the data follow a tensor normal distribution whose
covariance has a Kronecker product structure. A critical challenge in the
estimation and inference of this model is the fact that its penalized maximum
likelihood estimation involves minimizing a nonconvex objective function. To
address it, this paper makes two contributions: (i) In spite of the
nonconvexity of this estimation problem, we prove that an alternating
minimization algorithm, which iteratively estimates each sparse precision
matrix while fixing the others, attains an estimator with an optimal
statistical rate of convergence. (ii) We propose a debiased statistical
inference procedure for testing hypotheses on the true support of the sparse
precision matrices, and employ it for testing a growing number of hypothesis
with false discovery rate (FDR) control. The asymptotic normality of our test
statistic and the consistency of FDR control procedure are established. Our
theoretical results are backed up by thorough numerical studies and our real
applications on neuroimaging studies of Autism spectrum disorder and users'
advertising click analysis bring new scientific findings and business insights.
The proposed methods are encoded into a publicly available R package Tlasso.

Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large
family of highdimensional statistical models, including sparse Fisher's
discriminant analysis, canonical correlation analysis, and sufficient dimension
reduction. Sparse GEP involves solving a nonconvex optimization problem. Most
existing methods and theory in the context of specific statistical models that
are special cases of the sparse GEP require restrictive structural assumptions
on the input matrices. In this paper, we propose a twostage computational
framework to solve the sparse GEP. At the first stage, we solve a convex
relaxation of the sparse GEP. Taking the solution as an initial value, we then
exploit a nonconvex optimization perspective and propose the truncated Rayleigh
flow method (Rifle) to estimate the leading generalized eigenvector. We show
that Rifle converges linearly to a solution with the optimal statistical rate
of convergence for many statistical models. Theoretically, our method
significantly improves upon the existing literature by eliminating structural
assumptions on the input matrices for both stages. To achieve this, our
analysis involves two key ingredients: (i) a new analysis of the gradient based
method on nonconvex objective functions, and (ii) a finegrained
characterization of the evolution of sparsity patterns along the solution path.
Thorough numerical studies are provided to validate the theoretical results.

This paper explores the informationtheoretic limitations of graph property
testing in zerofield Ising models. Instead of learning the entire graph
structure, sometimes testing a basic graph property such as connectivity, cycle
presence or maximum clique size is a more relevant and attainable objective.
Since property testing is more fundamental than graph recovery, any necessary
conditions for property testing imply corresponding conditions for graph
recovery, while custom property tests can be statistically and/or
computationally more efficient than graph recovery based algorithms.
Understanding the statistical complexity of property testing requires the
distinction of ferromagnetic (i.e., positive interactions only) and general
Ising models. Using combinatorial constructs such as graph packing and strong
monotonicity, we characterize how target properties affect the corresponding
minimax upper and lower bounds within the realm of ferromagnets. On the other
hand, by studying the detection of an antiferromagnetic (i.e., negative
interactions only) CurieWeiss model buried in Rademacher noise, we show that
property testing is strictly more challenging over general Ising models. In
terms of methodological development, we propose two types of correlation based
tests: computationally efficient screening for ferromagnets, and score type
tests for general models, including a fast cycle presence test. Our correlation
screening tests match the informationtheoretic bounds for property testing in
ferromagnets.

We consider estimating the parametric components of semiparametric multiple
index models in a highdimensional and nonGaussian setting. Such models form a
rich class of nonlinear models with applications to signal processing, machine
learning and statistics. Our estimators leverage the score function based first
and secondorder Stein's identities and do not require the covariates to
satisfy Gaussian or elliptical symmetry assumptions common in the literature.
Moreover, to handle score functions and responses that are heavytailed, our
estimators are constructed via carefully thresholding their empirical
counterparts. We show that our estimator achieves nearoptimal statistical rate
of convergence in several settings. We supplement our theoretical results via
simulation experiments that confirm the theory.

User and item features of side information are crucial for accurate
recommendation. However, the large number of feature dimensions, e.g., usually
larger than 10^7, results in expensive storage and computational cost. This
prohibits fast recommendation especially on mobile applications where the
computational resource is very limited. In this paper, we develop a generic
featurebased recommendation model, called Discrete Factorization Machine
(DFM), for fast and accurate recommendation. DFM binarizes the realvalued
model parameters (e.g., float32) of every feature embedding into binary codes
(e.g., boolean), and thus supports efficient storage and fast useritem score
computation. To avoid the severe quantization loss of the binarization, we
propose a convergent updating rule that resolves the challenging discrete
optimization of DFM. Through extensive experiments on two realworld datasets,
we show that 1) DFM consistently outperforms stateoftheart binarized
recommendation models, and 2) DFM shows very competitive performance compared
to its realvalued version (FM), demonstrating the minimized quantization loss.

There has been an increasing interest in testing the equality of large
Pearson's correlation matrices. However, in many applications it is more
important to test the equality of large rankbased correlation matrices since
they are more robust to outliers and nonlinearity. Unlike the Pearson's case,
testing the equality of large rankbased statistics has not been well explored
and requires us to develop new methods and theory. In this paper, we provide a
framework for testing the equality of two large Ustatistic based correlation
matrices, which include the rankbased correlation matrices as special cases.
Our approach exploits extreme value statistics and the Jackknife estimator for
uncertainty assessment and is valid under a fully nonparametric model.
Theoretically, we develop a theory for testing the equality of Ustatistic
based correlation matrices. We then apply this theory to study the problem of
testing large Kendall's tau correlation matrices and demonstrate its
optimality. For proving this optimality, a novel construction of least
favourable distributions is developed for the correlation matrix comparison.

Offensive or antagonistic language targeted at individuals and social groups
based on their personal characteristics (also known as cyber hate speech or
cyberhate) has been frequently posted and widely circulated viathe World Wide
Web. This can be considered as a key risk factor for individual and societal
tension linked toregional instability. Automated Webbased cyberhate detection
is important for observing and understandingcommunity and regional societal
tension  especially in online social networks where posts can be rapidlyand
widely viewed and disseminated. While previous work has involved using
lexicons, bagsofwords orprobabilistic language parsing approaches, they often
suffer from a similar issue which is that cyberhate can besubtle and indirect 
thus depending on the occurrence of individual words or phrases can lead to a
significantnumber of false negatives, providing inaccurate representation of
the trends in cyberhate. This problemmotivated us to challenge thinking around
the representation of subtle language use, such as references toperceived
threats from "the other" including immigration or job prosperity in a hateful
context. We propose anovel framework that utilises language use around the
concept of "othering" and intergroup threat theory toidentify these subtleties
and we implement a novel classification method using embedding learning to
computesemantic distances between parts of speech considered to be part of an
"othering" narrative. To validate ourapproach we conduct several experiments on
different types of cyberhate, namely religion, disability, race andsexual
orientation, with Fmeasure scores for classifying hateful instances obtained
through applying ourmodel of 0.93, 0.86, 0.97 and 0.98 respectively, providing
a significant improvement in classifier accuracy overthe stateoftheart

We consider the problem of \emph{fully decentralized} multiagent
reinforcement learning (MARL), where the agents are located at the nodes of a
timevarying communication network. Specifically, we assume that the reward
functions of the agents might correspond to different tasks, and are only known
to the corresponding agent. Moreover, each agent makes individual decisions
based on both the information observed locally and the messages received from
its neighbors over the network. Within this setting, the collective goal of the
agents is to maximize the globally averaged return over the network through
exchanging information with their neighbors. To this end, we propose two
decentralized actorcritic algorithms with function approximation, which are
applicable to largescale MARL problems where both the number of states and the
number of agents are massively large. Under the decentralized structure, the
actor step is performed individually by each agent with no need to infer the
policies of others. For the critic step, we propose a consensus update via
communication over the network. Our algorithms are fully incremental and can be
implemented in an online fashion. Convergence analyses of the algorithms are
provided when the value functions are approximated within the class of linear
functions. Extensive simulation results with both linear and nonlinear function
approximations are presented to validate the proposed algorithms. Our work
appears to be the first study of fully decentralized MARL algorithms for
networked agents with function approximation, with provable convergence
guarantees.

Many machine learning techniques sacrifice convenient computational
structures to gain estimation robustness and modeling flexibility. However, by
exploring the modeling structures, we find these "sacrifices" do not always
require more computational efforts. To shed light on such a "freelunch"
phenomenon, we study the squarerootLasso (SQRTLasso) type regression
problem. Specifically, we show that the nonsmooth loss functions of SQRTLasso
type regression ease tuning effort and gain adaptivity to inhomogeneous noise,
but is not necessarily more challenging than Lasso in computation. We can
directly apply proximal algorithms (e.g. proximal gradient descent, proximal
Newton, and proximal QuasiNewton algorithms) without worrying the
nonsmoothness of the loss function. Theoretically, we prove that the proximal
algorithms combined with the pathwise optimization scheme enjoy fast
convergence guarantees with high probability. Numerical results are provided to
support our theory.

We propose a new family of combinatorial inference problems for graphical
models. Unlike classical statistical inference where the main interest is point
estimation or parameter testing, combinatorial inference aims at testing the
global structure of the underlying graph. Examples include testing the graph
connectivity, the presence of a cycle of certain size, or the maximum degree of
the graph. To begin with, we develop a unified theory for the fundamental
limits of a large family of combinatorial inference problems. We propose new
concepts including structural packing and buffer entropies to characterize how
the complexity of combinatorial graph structures impacts the corresponding
minimax lower bounds. On the other hand, we propose a family of novel and
practical structural testing algorithms to match the lower bounds. We provide
thorough numerical results on both synthetic graphical models and brain
networks to illustrate the usefulness of these proposed methods.

We propose a novel class of timevarying nonparanormal graphical models,
which allows us to model high dimensional heavytailed systems and the
evolution of their latent network structures. Under this model, we develop
statistical tests for presence of edges both locally at a fixed index value and
globally over a range of values. The tests are developed for a highdimensional
regime, are robust to model selection mistakes and do not require commonly
assumed minimum signal strength. The testing procedures are based on a high
dimensional, debiasingfree moment estimator, which uses a novel kernel
smoothed Kendall's tau correlation matrix as an input statistic. The estimator
consistently estimates the latent inverse Pearson correlation matrix uniformly
in both the index variable and kernel bandwidth. Its rate of convergence is
shown to be minimax optimal. Our method is supported by thorough numerical
simulations and an application to a neural imaging data set.

We develop a novel procedure for constructing confidence bands for components
of a sparse additive model. Our procedure is based on a new kernelsieve hybrid
estimator that combines two most popular nonparametric estimation methods in
the literature, the kernel regression and the spline method, and is of interest
in its own right. Existing methods for fitting sparse additive model are
primarily based on sieve estimators, while the literature on confidence bands
for nonparametric models are primarily based upon kernel or local polynomial
estimators. Our kernelsieve hybrid estimator combines the best of both worlds
and allows us to provide a simple procedure for constructing confidence bands
in highdimensional sparse additive models. We prove that the confidence bands
are asymptotically honest by studying approximation with a Gaussian process.
Thorough numerical results on both synthetic data and realworld neuroscience
data are provided to demonstrate the efficacy of the theory.

We propose a general theory for studying the \xl{landscape} of nonconvex
\xl{optimization} with underlying symmetric structures \tz{for a class of
machine learning problems (e.g., lowrank matrix factorization, phase
retrieval, and deep linear neural networks)}. In specific, we characterize the
locations of stationary points and the null space of Hessian matrices \xl{of
the objective function} via the lens of invariant groups\removed{for associated
optimization problems, including lowrank matrix factorization, phase
retrieval, and deep linear neural networks}. As a major motivating example, we
apply the proposed general theory to characterize the global \xl{landscape} of
the \xl{nonconvex optimization in} lowrank matrix factorization problem. In
particular, we illustrate how the rotational symmetry group gives rise to
infinitely many nonisolated strict saddle points and equivalent global minima
of the objective function. By explicitly identifying all stationary points, we
divide the entire parameter space into three regions: ($\cR_1$) the region
containing the neighborhoods of all strict saddle points, where the objective
has negative curvatures; ($\cR_2$) the region containing neighborhoods of all
global minima, where the objective enjoys strong convexity along certain
directions; and ($\cR_3$) the complement of the above regions, where the
gradient has sufficiently large magnitudes. We further extend our result to the
matrix sensing problem. Such global landscape implies strong global convergence
guarantees for popular iterative algorithms with arbitrary initial solutions.

We propose a stochastic variance reduced optimization algorithm for solving
sparse learning problems with cardinality constraints. Sufficient conditions
are provided, under which the proposed algorithm enjoys strong linear
convergence guarantees and optimal estimation accuracy in high dimensions. We
further extend the proposed algorithm to an asynchronous parallel variant with
a near linear speedup. Numerical experiments demonstrate the efficiency of our
algorithm in terms of both parameter estimation and computational performance.

In this paper, we generalize (accelerated) Newton's method with cubic
regularization under inexact secondorder information for (strongly) convex
optimization problems. Under mild assumptions, we provide global rate of
convergence of these methods and show the explicit dependence of the rate of
convergence on the problem parameters. While the complexity bounds of our
presented algorithms are theoretically worse than those of their exact
counterparts, they are at least as good as those of the optimal firstorder
methods. Our numerical experiments also show that using inexact Hessians can
significantly speed up the algorithms in practice.

Principal component analysis (PCA) has been a prominent tool for
highdimensional data analysis. Online algorithms that estimate the principal
component by processing streaming data are of tremendous practical and
theoretical interests. Despite its rich applications, theoretical convergence
analysis remains largely open. In this paper, we cast online PCA into a
stochastic nonconvex optimization problem, and we analyze the online PCA
algorithm as a stochastic approximation iteration. The stochastic approximation
iteration processes data points incrementally and maintains a running estimate
of the principal component. We prove for the first time a nearly optimal
finitesample error bound for the online PCA algorithm. Under the subgaussian
assumption, we show that the finitesample error bound closely matches the
minimax information lower bound.

We develop a new modeling framework for InterSubject Analysis (ISA). The
goal of ISA is to explore the dependency structure between different subjects
with the intrasubject dependency as nuisance. It has important applications in
neuroscience to explore the functional connectivity between brain regions under
natural stimuli. Our framework is based on the Gaussian graphical models, under
which ISA can be converted to the problem of estimation and inference of the
intersubject precision matrix. The main statistical challenge is that we do
not impose sparsity constraint on the whole precision matrix and we only assume
the intersubject part is sparse. For estimation, we propose to estimate an
alternative parameter to get around the nonsparse issue and it can achieve
asymptotic consistency even if the intrasubject dependency is dense. For
inference, we propose an "untangle and chord" procedure to debias our
estimator. It is valid without the sparsity assumption on the inverse Hessian
of the loglikelihood function. This inferential method is general and can be
applied to many other statistical problems, thus it is of independent
theoretical interest. Numerical experiments on both simulated and brain imaging
data validate our methods and theory.

We consider the problem of undirected graphical model inference. In many
applications, instead of perfectly recovering the unknown graph structure, a
more realistic goal is to infer some graph invariants (e.g., the maximum
degree, the number of connected subgraphs, the number of isolated nodes). In
this paper, we propose a new inferential framework for testing nested multiple
hypotheses and constructing confidence intervals of the unknown graph
invariants under undirected graphical models. Compared to perfect graph
recovery, our methods require significantly weaker conditions. This paper makes
two major contributions: (i) Methodologically, for testing nested multiple
hypotheses, we propose a skipdown algorithm on the whole family of monotone
graph invariants (The invariants which are nondecreasing under addition of
edges). We further show that the same skipdown algorithm also provides valid
confidence intervals for the targeted graph invariants. (ii) Theoretically, we
prove that the length of the obtained confidence intervals are optimal and
adaptive to the unknown signal strength. We also prove generic lower bounds for
the confidence interval length for various invariants. Numerical results on
both synthetic simulations and a brain imaging dataset are provided to
illustrate the usefulness of the proposed method.

We consider the testing of mutual independence among all entries in a
$d$dimensional random vector based on $n$ independent observations. We study
two families of distributionfree test statistics, which include Kendall's tau
and Spearman's rho as important examples. We show that under the null
hypothesis the test statistics of these two families converge weakly to Gumbel
distributions, and propose tests that control the type I error in the
highdimensional setting where $d>n$. We further show that the two tests are
rateoptimal in terms of power against sparse alternatives, and outperform
competitors in simulations, especially when $d$ is large.

We consider the problem of learning highdimensional Gaussian graphical
models. The graphical lasso is one of the most popular methods for estimating
Gaussian graphical models. However, it does not achieve the oracle rate of
convergence. In this paper, we propose the graphical nonconvex optimization for
optimal estimation in Gaussian graphical models, which is then approximated by
a sequence of convex programs. Our proposal is computationally tractable and
produces an estimator that achieves the oracle rate of convergence. The
statistical error introduced by the sequential approximation using the convex
programs are clearly demonstrated via a contraction property. The rate of
convergence can be further improved using the notion of sparsity pattern. The
proposed methodology is then extended to semiparametric graphical models. We
show through numerical studies that the proposed estimator outperforms other
popular methods for estimating Gaussian graphical models.

Developments in deep generative models have allowed for tractable learning of
highdimensional data distributions. While the employed learning procedures
typically assume that training data is drawn i.i.d. from the distribution of
interest, it may be desirable to model distinct distributions which are
observed sequentially, such as when different classes are encountered over
time. Although conditional variations of deep generative models permit multiple
distributions to be modeled by a single network in a disentangled fashion, they
are susceptible to catastrophic forgetting when the distributions are
encountered sequentially. In this paper, we adapt recent work in reducing
catastrophic forgetting to the task of training generative adversarial networks
on a sequence of distinct distributions, enabling continual generative
modeling.

This paper studies the asymptotic behavior of smooth solutions to the
generalized Hallmagnetohydrodynamics system (1.1) with one single diffusion
on the whole space $\mathbb R^3$. We establish that, in the inviscid resistive
case, the energy $\b(t)\_2^2$ vanishes and $\u(t)\_2^2$ converges to a
constant as time tends to infinity provided the velocity is bounded in
$W^{1\alpha,\frac3\alpha}(\mathbb R^3)$; in the viscous nonresistive case,
the energy $\u(t)\_2^2$ vanishes and $\b(t)\_2^2$ converges to a constant
provided the magnetic field is bounded in $W^{1\beta,\infty}(\mathbb R^3)$. In
summary, one single diffusion, being as weak as $(\Delta)^\alpha b$ or
$(\Delta)^\beta u$ with small enough $\alpha, \beta$, is sufficient to prevent
asymptotic energy oscillations for certain smooth solutions to the system.

We propose a computational framework named iterative local adaptive
majorizeminimization (ILAMM) to simultaneously control algorithmic complexity
and statistical error when fitting high dimensional models. ILAMM is a
twostage algorithmic implementation of the local linear approximation to a
family of folded concave penalized quasilikelihood. The first stage solves a
convex program with a crude precision tolerance to obtain a coarse initial
estimator, which is further refined in the second stage by iteratively solving
a sequence of convex programs with smaller precision tolerances. Theoretically,
we establish a phase transition: the first stage has a sublinear iteration
complexity, while the second stage achieves an improved linear rate of
convergence. Though this framework is completely algorithmic, it provides
solutions with optimal statistical performances and controlled algorithmic
complexity for a large family of nonconvex optimization problems. The iteration
effects on statistical errors are clearly demonstrated via a contraction
property. Our theory relies on a localized version of the sparse/restricted
eigenvalue condition, which allows us to analyze a large family of loss and
penalty functions and provide optimality guarantees under very weak assumptions
(For example, ILAMM requires much weaker minimal signal strength than other
procedures). Thorough numerical results are provided to support the obtained
theory.

High dimensional sparse learning has imposed a great computational challenge
to large scale data analysis. In this paper, we are interested in a broad class
of sparse learning approaches formulated as linear programs parametrized by a
{\em regularization factor}, and solve them by the parametric simplex method
(PSM). Our parametric simplex method offers significant advantages over other
competing methods: (1) PSM naturally obtains the complete solution path for all
values of the regularization parameter; (2) PSM provides a high precision dual
certificate stopping criterion; (3) PSM yields sparse solutions through very
few iterations, and the solution sparsity significantly reduces the
computational cost per iteration. Particularly, we demonstrate the superiority
of PSM over various sparse learning approaches, including Dantzig selector for
sparse linear regression, LADLasso for sparse robust linear regression, CLIME
for sparse precision matrix estimation, sparse differential network estimation,
and sparse Linear Programming Discriminant (LPD) analysis. We then provide
sufficient conditions under which PSM always outputs sparse solutions such that
its computational performance can be significantly boosted. Thorough numerical
experiments are provided to demonstrate the outstanding performance of the PSM
method.

The cyclic block coordinate descenttype (CBCDtype) methods, which performs
iterative updates for a few coordinates (a block) simultaneously throughout the
procedure, have shown remarkable computational performance for solving strongly
convex minimization problems. Typical applications include many popular
statistical machine learning methods such as elasticnet regression, ridge
penalized logistic regression, and sparse additive regression. Existing
optimization literature has shown that for strongly convex minimization, the
CBCDtype methods attain iteration complexity of $\cO(p\log(1/\epsilon))$,
where $\epsilon$ is a prespecified accuracy of the objective value, and $p$ is
the number of blocks. However, such iteration complexity explicitly depends on
$p$, and therefore is at least $p$ times worse than the complexity
$\cO(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this
theoretical gap, we propose an improved convergence analysis for the CBCDtype
methods. In particular, we first show that for a family of quadratic
minimization problems, the iteration complexity
$\cO(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCDtype methods matches that of
the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus
our complexity bounds are sharper than the existing bounds by at least a factor
of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved
complexity bounds are tight (up to a $\log^2 p$ factor), under the assumption
that the largest and smallest eigenvalues of the Hessian matrix do not scale
with $p$. Finally, we generalize our analysis to other strongly convex
minimization problems beyond quadratic ones.