
This paper explores the informationtheoretic limitations of graph property
testing in zerofield Ising models. Instead of learning the entire graph
structure, sometimes testing a basic graph property such as connectivity, cycle
presence or maximum clique size is a more relevant and attainable objective.
Since property testing is more fundamental than graph recovery, any necessary
conditions for property testing imply corresponding conditions for graph
recovery, while custom property tests can be statistically and/or
computationally more efficient than graph recovery based algorithms.
Understanding the statistical complexity of property testing requires the
distinction of ferromagnetic (i.e., positive interactions only) and general
Ising models. Using combinatorial constructs such as graph packing and strong
monotonicity, we characterize how target properties affect the corresponding
minimax upper and lower bounds within the realm of ferromagnets. On the other
hand, by studying the detection of an antiferromagnetic (i.e., negative
interactions only) CurieWeiss model buried in Rademacher noise, we show that
property testing is strictly more challenging over general Ising models. In
terms of methodological development, we propose two types of correlation based
tests: computationally efficient screening for ferromagnets, and score type
tests for general models, including a fast cycle presence test. Our correlation
screening tests match the informationtheoretic bounds for property testing in
ferromagnets.

We consider the recovery of regression coefficients, denoted by
$\boldsymbol{\beta}_0$, for a single index model (SIM) relating a binary
outcome $Y$ to a set of possibly high dimensional covariates $\boldsymbol{X}$,
based on a large but 'unlabeled' dataset $\mathcal{U}$, with $Y$ never
observed. On $\mathcal{U}$, we fully observe $\boldsymbol{X}$ and additionally,
a surrogate $S$ which, while not being strongly predictive of $Y$ throughout
the entirety of its support, can forecast it with high accuracy when it assumes
extreme values. Such datasets arise naturally in modern studies involving large
databases such as electronic medical records (EMR) where $Y$, unlike
$(\boldsymbol{X}, S)$, is difficult and/or expensive to obtain. In EMR studies,
an example of $Y$ and $S$ would be the true disease phenotype and the count of
the associated diagnostic codes respectively. Assuming another SIM for $S$
given $\boldsymbol{X}$, we show that under sparsity assumptions, we can recover
$\boldsymbol{\beta}_0$ proportionally by simply fitting a least squares LASSO
estimator to the subset of the observed data on $(\boldsymbol{X}, S)$
restricted to the extreme sets of $S$, with $Y$ imputed using the surrogacy of
$S$. We obtain sharp finite sample performance bounds for our estimator,
including deterministic deviation bounds and probabilistic guarantees. We
demonstrate the effectiveness of our approach through multiple simulation
studies, as well as by application to real data from an EMR study conducted at
the Partners HealthCare Systems.

We propose a new family of combinatorial inference problems for graphical
models. Unlike classical statistical inference where the main interest is point
estimation or parameter testing, combinatorial inference aims at testing the
global structure of the underlying graph. Examples include testing the graph
connectivity, the presence of a cycle of certain size, or the maximum degree of
the graph. To begin with, we develop a unified theory for the fundamental
limits of a large family of combinatorial inference problems. We propose new
concepts including structural packing and buffer entropies to characterize how
the complexity of combinatorial graph structures impacts the corresponding
minimax lower bounds. On the other hand, we propose a family of novel and
practical structural testing algorithms to match the lower bounds. We provide
thorough numerical results on both synthetic graphical models and brain
networks to illustrate the usefulness of these proposed methods.

We consider the problem of undirected graphical model inference. In many
applications, instead of perfectly recovering the unknown graph structure, a
more realistic goal is to infer some graph invariants (e.g., the maximum
degree, the number of connected subgraphs, the number of isolated nodes). In
this paper, we propose a new inferential framework for testing nested multiple
hypotheses and constructing confidence intervals of the unknown graph
invariants under undirected graphical models. Compared to perfect graph
recovery, our methods require significantly weaker conditions. This paper makes
two major contributions: (i) Methodologically, for testing nested multiple
hypotheses, we propose a skipdown algorithm on the whole family of monotone
graph invariants (The invariants which are nondecreasing under addition of
edges). We further show that the same skipdown algorithm also provides valid
confidence intervals for the targeted graph invariants. (ii) Theoretically, we
prove that the length of the obtained confidence intervals are optimal and
adaptive to the unknown signal strength. We also prove generic lower bounds for
the confidence interval length for various invariants. Numerical results on
both synthetic simulations and a brain imaging dataset are provided to
illustrate the usefulness of the proposed method.

It is known that for a certain class of single index models (SIMs) $Y =
f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbol{\beta}_0, \varepsilon)$,
support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0,
\mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below
a critical threshold. Recently, optimal algorithms based on Sliced Inverse
Regression (SIR) were suggested. These algorithms work provably under the
assumption that the design $\boldsymbol{X}$ comes from an i.i.d. Gaussian
distribution. In the present paper we analyze algorithms based on covariance
screening and least squares with $L_1$ penalization (i.e. LASSO) and
demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample
size in terms of support recovery, albeit under slightly different assumptions
on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we
show more generally, that LASSO succeeds in recovering the signed support of
$\boldsymbol{\beta}_0$ if $\boldsymbol{X} \sim \mathcal{N}(0,
\boldsymbol{\Sigma})$, and the covariance $\boldsymbol{\Sigma}$ satisfies the
irrepresentable condition. Our work extends existing results on the support
recovery of LASSO for the linear model, to a more general class of SIMs.

In this paper we study the support recovery problem for single index models
$Y=f(\boldsymbol{X}^{\intercal} \boldsymbol{\beta},\varepsilon)$, where $f$ is
an unknown link function, $\boldsymbol{X}\sim N_p(0,\mathbb{I}_{p})$ and
$\boldsymbol{\beta}$ is an $s$sparse unit vector such that
$\boldsymbol{\beta}_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}$. In particular, we look
into the performance of two computationally inexpensive algorithms: (a) the
diagonal thresholding sliced inverse regression (DTSIR) introduced by Lin et
al. (2015); and (b) a semidefinite programming (SDP) approach inspired by
Amini & Wainwright (2008). When $s=O(p^{1\delta})$ for some $\delta>0$, we
demonstrate that both procedures can succeed in recovering the support of
$\boldsymbol{\beta}$ as long as the rescaled sample size
$\kappa=\frac{n}{s\log(ps)}$ is larger than a certain critical threshold. On
the other hand, when $\kappa$ is smaller than a critical value, any algorithm
fails to recover the support with probability at least $\frac{1}{2}$
asymptotically. In other words, we demonstrate that both DTSIR and the SDP
approach are optimal (up to a scalar) for recovering the support of
$\boldsymbol{\beta}$ in terms of sample size. We provide extensive simulations,
as well as a real dataset application to help verify our theoretical
observations.

We propose a new inferential framework for constructing confidence regions
and testing hypotheses in statistical models specified by a system of high
dimensional estimating equations. We construct an influence function by
projecting the fitted estimating equations to a sparse direction obtained by
solving a largescale linear program. Our main theoretical contribution is to
establish a unified Zestimation theory of confidence regions for high
dimensional problems.
Different from existing methods, all of which require the specification of
the likelihood or pseudolikelihood, our framework is likelihoodfree. As a
result, our approach provides valid inference for a broad class of high
dimensional constrained estimating equation problems, which are not covered by
existing methods.
Such examples include, noisy compressed sensing, instrumental variable
regression, undirected graphical models, discriminant analysis and vector
autoregressive models. We present detailed theoretical results for all these
examples. Finally, we conduct thorough numerical simulations, and a real
dataset analysis to back up the developed theoretical results.