• ### Testing Endogeneity with High Dimensional Covariates(1609.06713)

March 8, 2018 math.ST, stat.TH
Modern, high dimensional data has renewed investigation on instrumental variables (IV) analysis, primarily focusing on estimation of effects of endogenous variables and putting little attention towards specification tests. This paper studies in high dimensions the Durbin-Wu-Hausman (DWH) test, a popular specification test for endogeneity in IV regression. We show, surprisingly, that the DWH test maintains its size in high dimensions, but at an expense of power. We propose a new test that remedies this issue and has better power than the DWH test. Simulation studies reveal that our test achieves near-oracle performance to detect endogeneity.
• ### Confidence Intervals for Causal Effects with Invalid Instruments using Two-Stage Hard Thresholding with Voting(1603.05224)

Aug. 8, 2017 math.ST, stat.TH, stat.ME
A major challenge in instrumental variables (IV) analysis is to find instruments that are valid, or have no direct effect on the outcome and are ignorable. Typically one is unsure whether all of the putative IVs are in fact valid. We propose a general inference procedure in the presence of invalid IVs, called Two-Stage Hard Thresholding (TSHT) with voting. TSHT uses two hard thresholding steps to select strong instruments and generate candidate sets of valid IVs. Voting takes the candidate sets and uses majority and plurality rules to determine the true set of valid IVs. In low dimensions, if the sufficient and necessary identification condition under invalid instruments is met, which is more general than the so-called 50% rule or the majority rule, our proposal (i) correctly selects valid IVs, (ii) consistently estimates the causal effect, (iii) produces valid confidence intervals for the causal effect, and (iv) has oracle-optimal width. In high dimensions, we establish nearly identical results without oracle-optimality. In simulations, our proposal outperforms traditional and recent methods in the invalid IV literature. We also apply our method to re-analyze the causal effect of education on earnings.
• ### Accuracy Assessment for High-dimensional Linear Regression(1603.03474)

Sept. 24, 2016 math.ST, stat.TH
This paper considers point and interval estimation of the $\ell_q$ loss of an estimator in high-dimensional linear regression with random design. We establish the minimax rate for estimating the $\ell_{q}$ loss and the minimax expected length of confidence intervals for the $\ell_{q}$ loss of rate-optimal estimators of the regression vector, including commonly used estimators such as Lasso, scaled Lasso, square-root Lasso and Dantzig Selector. Adaptivity of the confidence intervals for the $\ell_{q}$ loss is also studied. Both the setting of known identity design covariance matrix and known noise level and the setting of unknown design covariance matrix and unknown noise level are studied. The results reveal interesting and significant differences between estimating the $\ell_2$ loss and $\ell_q$ loss with $1\le q <2$ as well as between the two settings. New technical tools are developed to establish rate sharp lower bounds for the minimax estimation error and the expected length of minimax and adaptive confidence intervals for the $\ell_q$ loss. A significant difference between loss estimation and the traditional parameter estimation is that for loss estimation the constraint is on the performance of the estimator of the regression vector, but the lower bounds are on the difficulty of estimating its $\ell_q$ loss. The technical tools developed in this paper can also be of independent interest.
• ### Mediation Analysis for Count and Zero-Inflated Count Data without Sequential Ignorability and Its Application in Dental Studies(1601.06294)

July 9, 2016 stat.ME, stat.AP
Mediation analysis seeks to understand the mechanism by which a treatment affects an outcome. Count or zero-inflated count outcome are common in many studies in which mediation analysis is of interest. For example, in dental studies, outcomes such as decayed, missing and filled teeth are typically zero inflated. Existing mediation analysis approaches for count data assume sequential ignorability of the mediator. This is often not plausible because the mediator is not randomized so that there are unmeasured confounders associated with the mediator and the outcome. In this paper, we develop causal methods based on instrumental variable (IV) approaches for mediation analysis for count data possibly with a lot of zeros that do not require the assumption of sequential ignorability. We first define the direct and indirect effect ratios for those data, and then propose estimating equations and use empirical likelihood to estimate the direct and indirect effects consistently. A sensitivity analysis is proposed for violations of the IV exclusion restriction assumption. Simulation studies demonstrate that our method works well for different types of outcomes under different settings. Our method is applied to a randomized dental caries prevention trial and a study of the effect of a massive flood in Bangladesh on children's diarrhea.
• ### Optimal Estimation of Co-heritability in High-dimensional Linear Models(1605.07244)

May 24, 2016 math.ST, stat.TH, stat.ME
Co-heritability is an important concept that characterizes the genetic associations within pairs of quantitative traits. There has been significant recent interest in estimating the co-heritability based on data from the genome-wide association studies (GWAS). This paper introduces two measures of co-heritability in the high-dimensional linear model framework, including the inner product of the two regression vectors and a normalized inner product by their lengths. Functional de-biased estimators (FDEs) are developed to estimate these two co-heritability measures. In addition, estimators of quadratic functionals of the regression vectors are proposed. Both theoretical and numerical properties of the estimators are investigated. In particular, minimax rates of convergence are established and the proposed estimators of the inner product, the quadratic functionals and the normalized inner product are shown to be rate-optimal. Simulation results show that the FDEs significantly outperform the naive plug-in estimates. The FDEs are also applied to analyze a yeast segregant data set with multiple traits to estimate heritability and co-heritability among the traits.
• ### Control Function Instrumental Variable Estimation of Nonlinear Causal Effect Models(1602.01051)

Feb. 2, 2016 math.ST, stat.TH, stat.ME
The instrumental variable method consistently estimates the effect of a treatment when there is unmeasured confounding and a valid instrumental variable. A valid instrumental variable is a variable that is independent of unmeasured confounders and affects the treatment but does not have a direct effect on the outcome beyond its effect on the treatment. Two commonly used estimators for using an instrumental variable to estimate a treatment effect are the two stage least squares estimator and the control function estimator. For linear causal effect models, these two estimators are equivalent, but for nonlinear causal effect models, the estimators are different. We provide a systematic comparison of these two estimators for nonlinear causal effect models and develop an approach to combing the two estimators that generally performs better than either one alone. We show that the control function estimator is a two stage least squares estimator with an augmented set of instrumental variables. If these augmented instrumental variables are valid, then the control function estimator can be much more efficient than usual two stage least squares without the augmented instrumental variables while if the augmented instrumental variables are not valid, then the control function estimator may be inconsistent while the usual two stage least squares remains consistent. We apply the Hausman test to test whether the augmented instrumental variables are valid and construct a pretest estimator based on this test. The pretest estimator is shown to work well in a simulation study. An application to the effect of exposure to violence on time preference is considered.
• ### Using an Instrumental Variable to Test for Unmeasured Confounding(1601.06288)

Jan. 23, 2016 stat.ME, stat.AP
An important concern in an observational study is whether or not there is unmeasured confounding, i.e., unmeasured ways in which the treatment and control groups differ before treatment that affect the outcome. We develop a test of whether there is unmeasured confounding when an instrumental variable (IV) is available. An IV is a variable that is independent of the unmeasured confounding and encourages a subject to take one treatment level vs. another, while having no effect on the outcome beyond its encouragement of a certain treatment level. We show what types of unmeasured confounding can be tested for with an IV and develop a test for this type of unmeasured confounding that has correct type I error rate. We show that the widely used Durbin-Wu-Hausman (DWH) test can have inflated type I error rates when there is treatment effect heterogeneity. Additionally, we show that our test provides more insight into the nature of the unmeasured confounding than the DWH test. We apply our test to an observational study of the effect of a premature infant being delivered in a high-level neonatal intensive care unit (one with mechanical assisted ventilation and high volume) vs. a lower level unit, using the excess travel time a mother lives from the nearest high-level unit to the nearest lower-level unit as an IV.
• ### Confidence Intervals for High-Dimensional Linear Regression: Minimax Rates and Adaptivity(1506.05539)

Nov. 27, 2015 math.ST, stat.TH
Confidence sets play a fundamental role in statistical inference. In this paper, we consider confidence intervals for high dimensional linear regression with random design. We first establish the convergence rates of the minimax expected length for confidence intervals in the oracle setting where the sparsity parameter is given. The focus is then on the problem of adaptation to sparsity for the construction of confidence intervals. Ideally, an adaptive confidence interval should have its length automatically adjusted to the sparsity of the unknown regression vector, while maintaining a prespecified coverage probability. It is shown that such a goal is in general not attainable, except when the sparsity parameter is restricted to a small region over which the confidence intervals have the optimal length of the usual parametric rate. It is further demonstrated that the lack of adaptivity is not due to the conservativeness of the minimax framework, but is fundamentally caused by the difficulty of learning the bias accurately.
• ### Boundary Value Problems for a Family of Domains in the Sierpinski Gasket(1310.6463)

Oct. 24, 2013 math.CA, math.FA
For a family of domains in the Sierpinski gasket, we study harmonic functions of finite energy, characterizing them in terms of their boundary values, and study their normal derivatives on the boundary. We characterize those domains for which there is an extension operator for functions of finite energy. We give an explicit construction of the Green's function for these domains.