
We propose new tests for assessing whether covariates in a treatment group
and matched control group are balanced in observational studies. The tests
exhibit high power under a wide range of multivariate alternatives, some of
which existing tests have little power for. The asymptotic permutation null
distributions of the proposed tests are studied and the pvalues calculated
through the asymptotic results work well in finite samples, facilitating the
application of the test to large data sets. The tests are illustrated in a
study of the effect of smoking on blood lead levels. The proposed tests are
implemented in an R package BalanceCheck.

Effect modification occurs when the effect of the treatment on an outcome
varies according to the level of other covariates and often has important
implications in decision making. When there are tens or hundreds of covariates,
it becomes necessary to use the observed data to select a simpler model for
effect modification and then make valid statistical inference. We propose a two
stage procedure to solve this problem. First, we use Robinson's transformation
to decouple the nuisance parameters from the treatment effect of interest and
use machine learning algorithms to estimate the nuisance parameters. Next,
after plugging in the estimates of the nuisance parameters, we use the Lasso to
choose a lowcomplexity model for effect modification. Compared to a full model
consisting of all the covariates, the selected model is much more
interpretable. Compared to the univariate subgroup analyses, the selected model
greatly reduces the number of false discoveries. We show that the conditional
selective inference for the selected model is asymptotically valid given the
rate assumptions in classical semiparametric regression. Extensive simulation
studies are conducted to verify the asymptotic results and an epidemiological
application is used to demonstrate the method.

Instrumental variable analysis is a widely used method to estimate causal
effects in the presence of unmeasured confounding. When the instruments,
exposure and outcome are not measured in the same sample, Angrist and Krueger
(1992) suggested to use twosample instrumental variable (TSIV) estimators that
use sample moments from an instrumentexposure sample and an instrumentoutcome
sample. However, this method is biased if the two samples are from
heterogeneous populations so that the distributions of the instruments are
different. In linear structural equation models, we derive a new class of TSIV
estimators that are robust to heterogeneous samples under the key assumption
that the structural relations in the two samples are the same. The widely used
twosample twostage least squares estimator belongs to this class. It is
generally not asymptotically efficient, although we find that it performs
similarly to the optimal TSIV estimator in most practical situations. We then
attempt to relax the linearity assumption. We find that, unlike onesample
analyses, the TSIV estimator is not robust to misspecified exposure model.
Additionally, to nonparametrically identify the magnitude of the causal effect,
the noise in the exposure must have the same distributions in the two samples.
However, this assumption is in general untestable because the exposure is not
observed in one sample. Nonetheless, we may still identify the sign of the
causal effect in the absence of homogeneity of the noise.

Instrumental variables are commonly used to estimate effects of a treatment
afflicted by unmeasured confounding, and in practice instruments are often
continuous (e.g., measures of distance, or treatment preference). However,
available methods for continuous instruments have important limitations: they
either require restrictive parametric assumptions for identification, or else
rely on modeling both the outcome and treatment process well (and require
modeling effect modification by all adjustment covariates). In this work we
develop the first semiparametric doubly robust estimators of the local
instrumental variable effect curve, i.e., the effect among those who would take
treatment for instrument values above some threshold and not below. In addition
to being robust to misspecification of either the instrument or
treatment/outcome processes, our approach also incorporates information about
the instrument mechanism and allows for flexible dataadaptive estimation of
effect modification. We discuss asymptotic properties under weak conditions,
and use the methods to study infant mortality effects of neonatal intensive
care units with high versus low technical capacity, using travel time as an
instrument.

In matched observational studies where treatment assignment is not
randomized, sensitivity analysis helps investigators determine how sensitive
their estimated treatment effect is to some unmeasured con founder. The
standard approach calibrates the sensitivity analysis according to the worst
case bias in a pair. This approach will result in a conservative sensitivity
analysis if the worst case bias does not hold in every pair. In this paper, we
show that for binary data, the standard approach can be calibrated in terms of
the average bias in a pair rather than worst case bias. When the worst case
bias and average bias differ, the average bias interpretation results in a less
conservative sensitivity analysis and more power. In many studies, the average
case calibration may also carry a more natural interpretation than the worst
case calibration and may also allow researchers to incorporate additional data
to establish an empirical basis with which to calibrate a sensitivity analysis.
We illustrate this with a study of the effects of cellphone use on the
incidence of automobile accidents. Finally, we extend the average case
calibration to the sensitivity analysis of confidence intervals for
attributable effects.

Mendelian randomization (MR) is an instrumental variable method of estimating
the causal effect of risk exposures in epidemiology, where genetic variants are
used as instruments. With the increasing availability of largescale
genomewide association studies, it is now possible to greatly improve the
power of MR by using genetic variants that are only weakly relevant. We
consider how to increase the efficiency of Mendelian randomization by a
genomewide design where more than a thousand genetic instruments are used. An
empirical partially Bayes estimator is proposed, where weaker instruments are
shrunken more heavily and thus brings less variation to the MR estimate. This
is generally more efficient than the profilelikelihoodbased estimator which
gives no shrinkage to weak instruments. We apply our method to estimate the
causal effect of blood lipids on cardiovascular diseases. We find highdensity
lipoprotein cholesterol (HDLc) has a significantly protective effect on heart
diseases, while previous MR studies reported null findings.

It is common in instrumental variable studies for instrument values to be
missing, for example when the instrument is a genetic test in Mendelian
randomization studies. In this paper we discuss two apparent paradoxes that
arise in socalled single consent designs where there is onesided
noncompliance, i.e., where unencouraged units cannot access treatment. The
first paradox is that, even under a missing completely at random assumption, a
completecase analysis is biased when knowledge of onesided noncompliance is
taken into account; this is not the case when such information is disregarded.
This occurs because incorporating information about onesided noncompliance
induces a dependence between the missingness and treatment. The second paradox
is that, although incorporating such information does not lead to efficiency
gains without missing data, the story is different when instrument values are
missing: there, incorporating such information changes the efficiency bound,
allowing possible efficiency gains. This is because some of the missing values
can be filled in, based on the fact that anyone who received treatment must
have been encouraged by the instrument (since the unencouraged cannot access
treatment).

Effect modification means the magnitude or stability of a treatment effect
varies as a function of an observed covariate. Generally, larger and more
stable treatment effects are insensitive to larger biases from unmeasured
covariates, so a causal conclusion may be considerably firmer if this pattern
is noted if it occurs. We propose a new strategy, called the submaxmethod,
that combines exploratory and confirmatory efforts to determine whether there
is stronger evidence of causality  that is, greater insensitivity to
unmeasured confounding  in some subgroups of individuals. It uses the joint
distribution of test statistics that split the data in various ways based on
certain observed covariates. For $L$ binary covariates, the method splits the
population $L$ times into two subpopulations, perhaps first men and women,
perhaps then smokers and nonsmokers, computing a test statistic from each
subpopulation, and appends the test statistic for the whole population, making
$2L+1$ test statistics in total. Although $L$ binary covariates define $2^{L}$
interaction groups, only $2L+1$ tests are performed, and at least $L+1$ of
these tests use at least half of the data. The submaxmethod achieves the
highest design sensitivity and the highest Bahadur efficiency of its component
tests. Moreover, the form of the test is sufficiently tractable that its large
sample power may be studied analytically. The simulation suggests that the
submax method exhibits superior performance, in comparison with an approach
using CART, when there is effect modification of moderate size. Using data from
the NHANES I Epidemiologic FollowUp Survey, an observational study of the
effects of physical activity on survival is used to illustrate the method. The
method is implemented in the $\texttt{R}$ package $\texttt{submax}$ which
contains the NHANES example.

Modern, high dimensional data has renewed investigation on instrumental
variables (IV) analysis, primarily focusing on estimation of effects of
endogenous variables and putting little attention towards specification tests.
This paper studies in high dimensions the DurbinWuHausman (DWH) test, a
popular specification test for endogeneity in IV regression. We show,
surprisingly, that the DWH test maintains its size in high dimensions, but at
an expense of power. We propose a new test that remedies this issue and has
better power than the DWH test. Simulation studies reveal that our test
achieves nearoracle performance to detect endogeneity.

Two problems that arise in making causal inferences for nonmortality
outcomes such as bronchopulmonary dysplasia (BPD) are unmeasured confounding
and censoring by death, i.e., the outcome is only observed when subjects
survive. In randomized experiments with noncompliance, instrumental variable
methods can be used to control for the unmeasured confounding without censoring
by death. But when there is censoring by death, the average causal treatment
effect cannot be identified under usual assumptions, but can be studied for a
specific subpopulation by using sensitivity analysis with additional
assumptions. However, in observational studies, evaluation of the local average
treatment effect (LATE) in censoring by death problems with unmeasured
confounding is not well studied. We develop a novel sensitivity analysis method
based on instrumental variable models for studying the LATE. Specifically, we
present the identification results under an additional assumption, and propose
a threestep procedure for the LATE estimation. Also, we propose an improved
twostep procedure by simultaneously estimating the instrument propensity score
(i.e., the probability of instrument given covariates) and the parameters
induced by the assumption. We have shown with simulation studies that the
twostep procedure can be more robust and efficient than the threestep
procedure. Finally, we apply our sensitivity analysis methods to a study of the
effect of delivery at highlevel neonatal intensive care units on the risk of
BPD.

Studies have shown that exposure to air pollution, even at low levels,
significantly increases mortality. As regulatory actions are becoming
prohibitively expensive, robust evidence to guide the development of targeted
interventions to reduce air pollution exposure is needed. In this paper, we
introduce a novel statistical method that splits the data into two subsamples:
(a) Using the first subsample, we consider a datadriven search for $\textit{de
novo}$ discovery of subgroups that could have exposure effects that differ from
the population mean; and then (b) using the second subsample, we quantify
evidence of effect modification among the subgroups with nonparametric
randomizationbased tests. We also develop a sensitivity analysis method to
assess the robustness of the conclusions to unmeasured confounding bias. Via
simulation studies and theoretical arguments, we demonstrate that since we
discover the subgroups in the first subsample, hypothesis testing on the second
subsample can focus on theses subgroups only, thus substantially increasing the
statistical power of the test. We apply our method to the data of 1,612,414
Medicare beneficiaries in New England region in the United States for the
period 2000 to 2006. We find that seniors aged between 8185 with low income
and seniors aged above 85 have statistically significant higher causal effects
of exposure to PM$_{2.5}$ on 5year mortality rate compared to the population
mean.

Mendelian randomization (MR) is a method of exploiting genetic variation to
unbiasedly estimate a causal effect in presence of unmeasured confounding. MR
is being widely used in epidemiology and other related areas of population
science. In this paper, we study statistical inference in the increasingly
popular twosample summarydata MR design. We show a linear model for the
observed associations approximately holds in a wide variety of settings when
all the genetic variants satisfy the exclusion restriction assumption, or in
genetic terms, when there is no pleiotropy. In this scenario, we derive a
maximum profile likelihood estimator with provable consistency and asymptotic
normality. However, through analyzing real datasets, we find strong evidence of
both systematic and idiosyncratic pleiotropy in MR, echoing some recent
discoveries in statistical genetics. We model the systematic pleiotropy by a
random effects model, where no genetic variant satisfies the exclusion
restriction condition exactly. In this case we propose a consistent and
asymptotically normal estimator by adjusting the profile score. We then tackle
the idiosyncratic pleiotropy by robustifying the adjusted profile score. We
demonstrate the robustness and efficiency of the proposed methods using several
simulated and real datasets.

In the evaluation of treatment effects, it is of major policy interest to
know if the treatment is beneficial for some and harmful for others, a
phenomenon known as qualitative interaction. We formulate this question as a
multiple testing problem with many conservative null $p$values, in which the
classical multiple testing methods may lose power substantially. We propose a
simple techniqueconditioningto improve the power. A crucial assumption we
need is uniform conservativeness, meaning for any conservative $p$value $p$,
the conditional distribution $(p/\tau)\,\,p \le \tau$ is stochastically larger
than the uniform distribution on $(0,1)$ for any $\tau$. We show this property
holds for onesided tests in a onedimensional exponential family (e.g.\
testing for qualitative interaction) as well as testing $\mu\le\eta$ using a
statistic $X \sim \mathrm{N}(\mu,1)$ (e.g.\ testing for practical importance
with threshold $\eta$). We propose an adaptive method to select the threshold
$\tau$. Our theoretical and simulation results suggest the proposed tests gain
significant power when many $p$values are uniformly conservative and lose
little power when no $p$value is uniformly conservative. We apply our method
to two educational intervention datasets.

A major challenge in instrumental variables (IV) analysis is to find
instruments that are valid, or have no direct effect on the outcome and are
ignorable. Typically one is unsure whether all of the putative IVs are in fact
valid. We propose a general inference procedure in the presence of invalid IVs,
called TwoStage Hard Thresholding (TSHT) with voting. TSHT uses two hard
thresholding steps to select strong instruments and generate candidate sets of
valid IVs. Voting takes the candidate sets and uses majority and plurality
rules to determine the true set of valid IVs. In low dimensions, if the
sufficient and necessary identification condition under invalid instruments is
met, which is more general than the socalled 50% rule or the majority rule,
our proposal (i) correctly selects valid IVs, (ii) consistently estimates the
causal effect, (iii) produces valid confidence intervals for the causal effect,
and (iv) has oracleoptimal width. In high dimensions, we establish nearly
identical results without oracleoptimality. In simulations, our proposal
outperforms traditional and recent methods in the invalid IV literature. We
also apply our method to reanalyze the causal effect of education on earnings.

In observational studies, the causal effect of a treatment on the
distribution of outcomes is of interest beyond the average treatment effect.
Instrumental variable methods allow for causal inference by controlling for
unmeasured confounding. The existing nonparametric method for estimating the
effect of the treatment on the distribution of outcomes for compliers has
several drawbacks, such as producing estimates that violate the nondecreasing
and nonnegative properties of cumulative distribution functions. In this
paper, we propose a novel nonparametric composite likelihood approach, referred
to as the binomial likelihood (BL) method, which overcomes the limitations of
the previous techniques and utilizes the advantage of likelihood methods. We
show the consistency of the maximum binomial likelihood (MBL) estimators and
derive their asymptotic distributions. Next, we develop a computationally
efficient algorithm for computing the MBL estimates by combining the
expectationmaximization (EM) and the pooladjacentviolators algorithms
(PAVA). Moreover, the BL method can be used to construct a binomial
likelihoodratio test (BLRT) for the null hypothesis of no distributional
treatment effect. Asymptotic expansion of the BLRT test is derived and the
performance of the BL method is demonstrated in simulation studies. Finally, we
apply our method to a study of the effect of Vietnam veteran status on the
distribution of civilian annual earnings.

Causal effects are commonly defined as comparisons of the potential outcomes
under treatment and control, but this definition is threatened by the
possibility that the treatment or control condition is not welldefined,
existing instead in more than one version. A simple, widely applicable analysis
is proposed to address the possibility that the treatment or control condition
exists in two versions with two different treatment effects. This analysis
loses no power in the main comparison of treatment and control, provides
additional information about version effects, and controls the familywise
error rate in several comparisons. The method is motivated and illustrated
using an ongoing study of the possibility that repeated head trauma in high
school football causes an increase in risk of early onset dementia.

We discuss observational studies that test many causal hypotheses, either
hypotheses about many outcomes or many treatments. To be credible an
observational study that tests many causal hypotheses must demonstrate that its
conclusions are neither artifacts of multiple testing nor of small biases from
nonrandom treatment assignment. In a sense that needs to be defined carefully,
hidden within a sensitivity analysis for nonrandom assignment is an enormous
correction for multiple testing: in the absence of bias, it is extremely
improbable that multiple testing alone would create an association insensitive
to moderate biases. We propose a new strategy called "crossscreening",
different from but motivated by recent work of Bogomolov and Heller on
replicability. Crossscreening splits the data in half at random, uses the
first half to plan a study carried out on the second half, then uses the second
half to plan a study carried out on the first half, and reports the more
favorable conclusions of the two studies correcting using the Bonferroni
inequality for having done two studies. If the two studies happen to concur,
then they achieve BogomolovHeller replicability; however, importantly,
replicability is not required for strong control of the familywise error rate,
and either study alone suffices for firm conclusions. In randomized studies
with a few hypotheses, crosssplit screening is not an attractive method when
compared with conventional methods of multiplicity control, but it can become
attractive when hundreds or thousands of hypotheses are subjected to
sensitivity analyses in an observational study. We illustrate the technique by
comparing 46 biomarkers in individuals who consume large quantities of fish
versus little or no fish.

An experimental unit is an opportunity to randomly apply or withhold a
treatment. There is interference between units if the application of the
treatment to one unit may also affect other units. In cognitive neuroscience, a
common form of experiment presents a sequence of stimuli or requests for
cognitive activity at random to each experimental subject and measures
biological aspects of brain activity that follow these requests. Each subject
is then many experimental units, and interference between units within an
experimental subject is likely, in part because the stimuli follow one another
quickly and in part because human subjects learn or become experienced or
primed or bored as the experiment proceeds. We use a recent fMRI experiment
concerned with the inhibition of motor activity to illustrate and further
develop recently proposed methodology for inference in the presence of
interference. A simulation evaluates the power of competing procedures.

Optogenetics is a new tool to study neuronal circuits that have been
genetically modified to allow stimulation by flashes of light. We study
recordings from single neurons within neural circuits under optogenetic
stimulation. The data from these experiments present a statistical challenge of
modeling a high frequency point process (neuronal spikes) while the input is
another high frequency point process (light flashes). We further develop a
generalized linear model approach to model the relationships between two point
processes, employing additive pointprocess response functions. The resulting
model, Pointprocess Responses for Optogenetics (PRO), provides explicit
nonlinear transformations to link the input point process with the output one.
Such response functions may provide important and interpretable scientific
insights into the properties of the biophysical process that governs neural
spiking in response to optogenetic stimulation. We validate and compare the PRO
model using a real dataset and simulations, and our model yields a superior
areaunderthe curve value as high as 93% for predicting every future spike.
For our experiment on the recurrent layer V circuit in the prefrontal cortex,
the PRO model provides evidence that neurons integrate their inputs in a
sophisticated manner. Another use of the model is that it enables understanding
how neural circuits are altered under various disease conditions and/or
experimental conditions by comparing the PRO parameters.

There is effect modification if the magnitude or stability of a treatment
effect varies systematically with the level of an observed covariate. A larger
or more stable treatment effect is typically less sensitive to bias from
unmeasured covariates, so it is important to recognize effect modification when
it is present. We illustrate a recent proposal for conducting a sensitivity
analysis that empirically discovers effect modification by exploratory methods,
but controls the familywise error rate in discovered groups. The example
concerns a study of mortality and use of the intensive care unit in 23,715
matched pairs of two Medicare patients, one of whom underwent surgery at a
hospital identified for superior nursing, the other at a conventional hospital.
The pairs were matched exactly for 130 fourdigit ICD9 surgical procedure
codes and balanced 172 observed covariates. The pairs were then split into five
groups of pairs by CART in its effort to locate effect modification. The
evidence of a beneficial effect of magnet hospitals on mortality is least
sensitive to unmeasured biases in a large group of patients undergoing rather
serious surgical procedures, but in the absence of other lifethreatening
conditions, such as a comorbidity of congestive heart failure or an emergency
admission leading to surgery.

Continuous treatments (e.g., doses) arise often in practice, but many
available causal effect estimators are limited by either requiring parametric
models for the effect curve, or by not allowing doubly robust covariate
adjustment. We develop a novel kernel smoothing approach that requires only
mild smoothness assumptions on the effect curve, and still allows for
misspecification of either the treatment density or outcome regression. We
derive asymptotic properties and give a procedure for datadriven bandwidth
selection. The methods are illustrated via simulation and in a study of the
effect of nurse staffing on hospital readmissions penalties.

Instrumental variables have been widely used to estimate the causal effect of
a treatment on an outcome. Existing confidence intervals for causal effects
based on instrumental variables assume that all of the putative instrumental
variables are valid; a valid instrumental variable is a variable that affects
the outcome only by affecting the treatment and is not related to unmeasured
confounders. However, in practice, some of the putative instrumental variables
are likely to be invalid. This paper presents a simple and general approach to
construct a confidence interval that is robust to possibly invalid instruments.
The robust confidence interval has theoretical guarantees on having the correct
coverage and can also be used to assess the sensitivity of inference when
instrumental variables assumptions are violated. The paper also shows that the
robust confidence interval outperforms traditional confidence intervals popular
in instrumental variables literature when invalid instruments are present. The
new approach is applied to a developmental economics study of the causal effect
of income on food expenditures.

Mediation analysis seeks to understand the mechanism by which a treatment
affects an outcome. Count or zeroinflated count outcome are common in many
studies in which mediation analysis is of interest. For example, in dental
studies, outcomes such as decayed, missing and filled teeth are typically zero
inflated. Existing mediation analysis approaches for count data assume
sequential ignorability of the mediator. This is often not plausible because
the mediator is not randomized so that there are unmeasured confounders
associated with the mediator and the outcome. In this paper, we develop causal
methods based on instrumental variable (IV) approaches for mediation analysis
for count data possibly with a lot of zeros that do not require the assumption
of sequential ignorability. We first define the direct and indirect effect
ratios for those data, and then propose estimating equations and use empirical
likelihood to estimate the direct and indirect effects consistently. A
sensitivity analysis is proposed for violations of the IV exclusion restriction
assumption. Simulation studies demonstrate that our method works well for
different types of outcomes under different settings. Our method is applied to
a randomized dental caries prevention trial and a study of the effect of a
massive flood in Bangladesh on children's diarrhea.

A potential causal relationship between head injuries sustained by NFL
players and laterlife neurological decline may have broad implications for
participants in youth and high school football programs. However, brain trauma
risk at the professional level may be different than that at the youth and high
school levels and the longterm effects of participation at these levels is
asyet unclear. To investigate the effect of playing high school football on
later life depression and cognitive functioning, we propose a retrospective
observational study using data from the Wisconsin Longitudinal Study (WLS) of
graduates from Wisconsin high schools in 1957.
We compare 1,153 high school males who played varsity football to 2,751 male
students who did not. 1,951 of the control subjects did not play any sport and
the remaining 800 controls played a noncontact sport. We focus on two primary
outcomes measured at age 65: a composite cognitive outcome measuring verbal
fluency and memory and the modified CESD depression score. To control for
potential confounders we adjust for preexposure covariates such as IQ with
matching and modelbased covariate adjustment. We will conduct an ordered
testing procedure that uses all 2,751 controls while controlling for possible
unmeasured differences between students who played sports and those who did
not. We will quantitatively assess the sensitivity of the results to potential
unmeasured confounding. The study will also consider several secondary outcomes
of clinical interest such as aggression and heavy drinking. The rich set of
preexposure variables, relatively unbiased sampling, and longitudinal nature
of the WLS dataset make the proposed analysis unique among related studies that
rely primarily on convenience samples of football players with reported
neurological symptoms.

Malaria is a parasitic disease that is a major health problem in many
tropical regions. The most characteristic symptom of malaria is fever. The
fraction of fevers that are attributable to malaria, the malaria attributable
fever fraction (MAFF), is an important public health measure for assessing the
effect of malaria control programs and other purposes. Estimating the MAFF is
not straightforward because there is no gold standard diagnosis of a malaria
attributable fever; an individual can have malaria parasites in her blood and a
fever, but the individual may have developed partial immunity that allows her
to tolerate the parasites and the fever is being caused by another infection.
We define the MAFF using the potential outcome framework for causal inference
and show what assumptions underlie current estimation methods. Current
estimation methods rely on an assumption that the parasite density is correctly
measured. However, this assumption does not generally hold because (i) fever
kills some parasites and (ii) the measurement of parasite density has
measurement error. In the presence of these problems, we show current
estimation methods do not perform well. We propose a novel maximum likelihood
estimation method based on exponential family gmodeling. Under the assumption
that the measurement error mechanism and the magnitude of the fever killing
effect are known, we show that our proposed method provides approximately
unbiased estimates of the MAFF in simulation studies. A sensitivity analysis
can be used to assess the impact of different magnitudes of fever killing and
different measurement error mechanisms. We apply our proposed method to
estimate the MAFF in Kilombero, Tanzania.