
Modern, high dimensional data has renewed investigation on instrumental
variables (IV) analysis, primarily focusing on estimation of effects of
endogenous variables and putting little attention towards specification tests.
This paper studies in high dimensions the DurbinWuHausman (DWH) test, a
popular specification test for endogeneity in IV regression. We show,
surprisingly, that the DWH test maintains its size in high dimensions, but at
an expense of power. We propose a new test that remedies this issue and has
better power than the DWH test. Simulation studies reveal that our test
achieves nearoracle performance to detect endogeneity.

A major challenge in instrumental variables (IV) analysis is to find
instruments that are valid, or have no direct effect on the outcome and are
ignorable. Typically one is unsure whether all of the putative IVs are in fact
valid. We propose a general inference procedure in the presence of invalid IVs,
called TwoStage Hard Thresholding (TSHT) with voting. TSHT uses two hard
thresholding steps to select strong instruments and generate candidate sets of
valid IVs. Voting takes the candidate sets and uses majority and plurality
rules to determine the true set of valid IVs. In low dimensions, if the
sufficient and necessary identification condition under invalid instruments is
met, which is more general than the socalled 50% rule or the majority rule,
our proposal (i) correctly selects valid IVs, (ii) consistently estimates the
causal effect, (iii) produces valid confidence intervals for the causal effect,
and (iv) has oracleoptimal width. In high dimensions, we establish nearly
identical results without oracleoptimality. In simulations, our proposal
outperforms traditional and recent methods in the invalid IV literature. We
also apply our method to reanalyze the causal effect of education on earnings.

This paper considers point and interval estimation of the $\ell_q$ loss of an
estimator in highdimensional linear regression with random design. We
establish the minimax rate for estimating the $\ell_{q}$ loss and the minimax
expected length of confidence intervals for the $\ell_{q}$ loss of rateoptimal
estimators of the regression vector, including commonly used estimators such as
Lasso, scaled Lasso, squareroot Lasso and Dantzig Selector. Adaptivity of the
confidence intervals for the $\ell_{q}$ loss is also studied. Both the setting
of known identity design covariance matrix and known noise level and the
setting of unknown design covariance matrix and unknown noise level are
studied. The results reveal interesting and significant differences between
estimating the $\ell_2$ loss and $\ell_q$ loss with $1\le q <2$ as well as
between the two settings.
New technical tools are developed to establish rate sharp lower bounds for
the minimax estimation error and the expected length of minimax and adaptive
confidence intervals for the $\ell_q$ loss. A significant difference between
loss estimation and the traditional parameter estimation is that for loss
estimation the constraint is on the performance of the estimator of the
regression vector, but the lower bounds are on the difficulty of estimating its
$\ell_q$ loss. The technical tools developed in this paper can also be of
independent interest.

Mediation analysis seeks to understand the mechanism by which a treatment
affects an outcome. Count or zeroinflated count outcome are common in many
studies in which mediation analysis is of interest. For example, in dental
studies, outcomes such as decayed, missing and filled teeth are typically zero
inflated. Existing mediation analysis approaches for count data assume
sequential ignorability of the mediator. This is often not plausible because
the mediator is not randomized so that there are unmeasured confounders
associated with the mediator and the outcome. In this paper, we develop causal
methods based on instrumental variable (IV) approaches for mediation analysis
for count data possibly with a lot of zeros that do not require the assumption
of sequential ignorability. We first define the direct and indirect effect
ratios for those data, and then propose estimating equations and use empirical
likelihood to estimate the direct and indirect effects consistently. A
sensitivity analysis is proposed for violations of the IV exclusion restriction
assumption. Simulation studies demonstrate that our method works well for
different types of outcomes under different settings. Our method is applied to
a randomized dental caries prevention trial and a study of the effect of a
massive flood in Bangladesh on children's diarrhea.

Coheritability is an important concept that characterizes the genetic
associations within pairs of quantitative traits. There has been significant
recent interest in estimating the coheritability based on data from the
genomewide association studies (GWAS). This paper introduces two measures of
coheritability in the highdimensional linear model framework, including the
inner product of the two regression vectors and a normalized inner product by
their lengths. Functional debiased estimators (FDEs) are developed to estimate
these two coheritability measures. In addition, estimators of quadratic
functionals of the regression vectors are proposed. Both theoretical and
numerical properties of the estimators are investigated. In particular, minimax
rates of convergence are established and the proposed estimators of the inner
product, the quadratic functionals and the normalized inner product are shown
to be rateoptimal. Simulation results show that the FDEs significantly
outperform the naive plugin estimates. The FDEs are also applied to analyze a
yeast segregant data set with multiple traits to estimate heritability and
coheritability among the traits.

The instrumental variable method consistently estimates the effect of a
treatment when there is unmeasured confounding and a valid instrumental
variable. A valid instrumental variable is a variable that is independent of
unmeasured confounders and affects the treatment but does not have a direct
effect on the outcome beyond its effect on the treatment. Two commonly used
estimators for using an instrumental variable to estimate a treatment effect
are the two stage least squares estimator and the control function estimator.
For linear causal effect models, these two estimators are equivalent, but for
nonlinear causal effect models, the estimators are different. We provide a
systematic comparison of these two estimators for nonlinear causal effect
models and develop an approach to combing the two estimators that generally
performs better than either one alone. We show that the control function
estimator is a two stage least squares estimator with an augmented set of
instrumental variables. If these augmented instrumental variables are valid,
then the control function estimator can be much more efficient than usual two
stage least squares without the augmented instrumental variables while if the
augmented instrumental variables are not valid, then the control function
estimator may be inconsistent while the usual two stage least squares remains
consistent. We apply the Hausman test to test whether the augmented
instrumental variables are valid and construct a pretest estimator based on
this test. The pretest estimator is shown to work well in a simulation study.
An application to the effect of exposure to violence on time preference is
considered.

An important concern in an observational study is whether or not there is
unmeasured confounding, i.e., unmeasured ways in which the treatment and
control groups differ before treatment that affect the outcome. We develop a
test of whether there is unmeasured confounding when an instrumental variable
(IV) is available. An IV is a variable that is independent of the unmeasured
confounding and encourages a subject to take one treatment level vs. another,
while having no effect on the outcome beyond its encouragement of a certain
treatment level. We show what types of unmeasured confounding can be tested for
with an IV and develop a test for this type of unmeasured confounding that has
correct type I error rate. We show that the widely used DurbinWuHausman (DWH)
test can have inflated type I error rates when there is treatment effect
heterogeneity. Additionally, we show that our test provides more insight into
the nature of the unmeasured confounding than the DWH test. We apply our test
to an observational study of the effect of a premature infant being delivered
in a highlevel neonatal intensive care unit (one with mechanical assisted
ventilation and high volume) vs. a lower level unit, using the excess travel
time a mother lives from the nearest highlevel unit to the nearest lowerlevel
unit as an IV.

Confidence sets play a fundamental role in statistical inference. In this
paper, we consider confidence intervals for high dimensional linear regression
with random design. We first establish the convergence rates of the minimax
expected length for confidence intervals in the oracle setting where the
sparsity parameter is given. The focus is then on the problem of adaptation to
sparsity for the construction of confidence intervals. Ideally, an adaptive
confidence interval should have its length automatically adjusted to the
sparsity of the unknown regression vector, while maintaining a prespecified
coverage probability. It is shown that such a goal is in general not
attainable, except when the sparsity parameter is restricted to a small region
over which the confidence intervals have the optimal length of the usual
parametric rate. It is further demonstrated that the lack of adaptivity is not
due to the conservativeness of the minimax framework, but is fundamentally
caused by the difficulty of learning the bias accurately.

For a family of domains in the Sierpinski gasket, we study harmonic functions
of finite energy, characterizing them in terms of their boundary values, and
study their normal derivatives on the boundary. We characterize those domains
for which there is an extension operator for functions of finite energy. We
give an explicit construction of the Green's function for these domains.