
Causal inference is known to be very challenging when only observational data
are available. Randomized experiments are often costly and impractical and in
instrumental variable regression the number of instruments has to exceed the
number of causal predictors. It was recently shown in Peters et al. [2016] that
causal inference for the full model is possible when data from distinct
observational environments are available, exploiting that the conditional
distribution of a response variable is invariant under the correct causal
model. Two shortcomings of such an approach are the high computational effort
for largescale data and the assumed absence of hidden confounders. Here we
show that these two shortcomings can be addressed if one is willing to make a
more restrictive assumption on the type of interventions that generate
different environments. Thereby, we look at a different notion of invariance,
namely innerproduct invariance. By avoiding a computationally cumbersome
reverseengineering approach such as in Peters et al. [2016], it allows for
largescale causal inference in linear structural equation models. We discuss
identifiability conditions for the causal parameter and derive asymptotic
confidence intervals in the lowdimensional setting. In the case of
nonidentifiability we show that the solution set of causal Dantzig has
predictive guarantees under certain interventions. We derive finitesample
bounds in the highdimensional setting and investigate its performance on
simulated datasets.

We investigate the problem of inferring the causal predictors of a response
$Y$ from a set of $d$ explanatory variables $(X^1,\dots,X^d)$. Classical
ordinary least squares regression includes all predictors that reduce the
variance of $Y$. Using only the causal predictors instead leads to models that
have the advantage of remaining invariant under interventions, loosely speaking
they lead to invariance across different "environments" or "heterogeneity
patterns". More precisely, the conditional distribution of $Y$ given its causal
predictors remains invariant for all observations. Recent work exploits such a
stability to infer causal relations from data with different but known
environments. We show that even without having knowledge of the environments or
heterogeneity pattern, inferring causal relations is possible for timeordered
(or any other type of sequentially ordered) data. In particular, this allows
detecting instantaneous causal relations in multivariate linear time series
which is usually not the case for Granger causality. Besides novel methodology,
we provide statistical confidence bounds and asymptotic detection results for
inferring causal predictors, and present an application to monetary policy in
macroeconomics.

We provide a view on highdimensional statistical inference for genomewide
association studies (GWAS). It is in part a review but covers also new
developments for meta analysis with multiple studies and novel software in
terms of an Rpackage hierinf. Inference and assessment of significance is
based on very highdimensional multivariate (generalized) linear models: in
contrast to often used marginal approaches, this provides a step towards more
causaloriented inference.

This is a preliminary draft of "Anchor regression: heterogeneous data meets
causality".

We propose and study properties of maximum likelihood estimators in the class
of conditional transformation models. Based on a suitable explicit
parameterisation of the unconditional or conditional transformation function,
we establish a cascade of increasingly complex transformation models that can
be estimated, compared and analysed in the maximum likelihood framework. Models
for the unconditional or conditional distribution function of any univariate
response variable can be setup and estimated in the same theoretical and
computational framework simply by choosing an appropriate transformation
function and parameterisation thereof. The ability to evaluate the distribution
function directly allows us to estimate models based on the exact likelihood,
especially in the presence of random censoring or truncation. For discrete and
continuous responses, we establish the asymptotic normality of the proposed
estimators. A reference software implementation of maximum likelihoodbased
estimation for conditional transformation models allowing the same flexibility
as the theory developed here was employed to illustrate the wide range of
possible applications.

In this work we propose a framework for constructing goodness of fit tests in
both low and highdimensional linear models. We advocate applying regression
methods to the scaled residuals following either an ordinary least squares or
Lasso fit to the data, and using some proxy for prediction error as the final
test statistic. We call this family Residual Prediction (RP) tests. We show
that simulation can be used to obtain the critical values for such tests in the
lowdimensional setting, and demonstrate using both theoretical results and
extensive numerical studies that some form of the parametric bootstrap can do
the same when the highdimensional linear model is under consideration. We show
that RP tests can be used to test for significance of groups or individual
variables as special cases, and here they compare favourably with state of the
art methods, but we also argue that they can be designed to test for as diverse
model misspecifications as heteroscedasticity and nonlinearity.

We consider the problem of structure learning for bowfree acyclic path
diagrams (BAPs). BAPs can be viewed as a generalization of linear Gaussian DAG
models that allow for certain hidden variables. We present a first method for
this problem using a greedy scorebased search algorithm. We also prove some
necessary and some sufficient conditions for distributional equivalence of BAPs
which are used in an algorithmic approach to compute (nearly) equivalent model
structures, allowing to infer lower bounds of causal effects. The application
of our method to datasets reveals that BAP models can represent the data much
better than DAG models in these cases.

Causal inference from observational data is an ambitious but highly relevant
task, with diverse applications ranging from natural to social sciences. Within
the scope of nonparametric time series, causal inference defined through
interventions (cf. Pearl (2000)) is largely unexplored, although time order
simplifies the problem substantially. We consider a marginal integration scheme
for inferring causal effects from observational time series data, MINTT
(marginal integration in time series), which is an adaptation for time series
of a method proposed by Ernest and B\"{u}hlmann (Electron. J. Statist, pp.
31553194, vol. 9, 2015) for the case of independent data. Our approach for
stationary stochastic processes is fully nonparametric and, assuming no
instantaneous effects consistently recovers the total causal effect of a single
intervention with optimal onedimensional nonparametric convergence rate
$n^{2/5}$ assuming regularity conditions and twice differentiability of a
certain corresponding regression function. Therefore, MINTT remains largely
unaffected by the curse of dimensionality as long as smoothness conditions hold
in higher dimensions and it is feasible for a large class of stationary time
series, including nonlinear and multivariate processes. For the case with
instantaneous effects, we provide a procedure which guards against false
positive causal statements.

We investigate the problem of testing whether $d$ random variables, which may
or may not be continuous, are jointly (or mutually) independent. Our method
builds on ideas of the two variable HilbertSchmidt independence criterion
(HSIC) but allows for an arbitrary number of variables. We embed the
$d$dimensional joint distribution and the product of the marginals into a
reproducing kernel Hilbert space and define the $d$variable HilbertSchmidt
independence criterion (dHSIC) as the squared distance between the embeddings.
In the population case, the value of dHSIC is zero if and only if the $d$
variables are jointly independent, as long as the kernel is characteristic.
Based on an empirical estimate of dHSIC, we define three different
nonparametric hypothesis tests: a permutation test, a bootstrap test and a
test based on a Gamma approximation. We prove that the permutation test
achieves the significance level and that the bootstrap test achieves pointwise
asymptotic significance level as well as pointwise asymptotic consistency
(i.e., it is able to detect any type of fixed dependence in the large sample
limit). The Gamma approximation does not come with these guarantees; however,
it is computationally very fast and for small $d$, it performs well in
practice. Finally, we apply the test to a problem in causal discovery.

We consider the identifiability and estimation of partially linear additive
structural equation models with Gaussian noise (PLSEMs). Existing
identifiability results in the framework of additive SEMs with Gaussian noise
are limited to linear and nonlinear SEMs, which can be considered as special
cases of PLSEMs with vanishing nonparametric or parametric part, respectively.
We close the wide gap between these two special cases by providing a
comprehensive theory of the identifiability of PLSEMs by means of (A) a
graphical, (B) a transformational, (C) a functional and (D) a causal ordering
characterization of PLSEMs that generate a given distribution. In particular,
the characterizations (C) and (D) answer the fundamental question to which
extent nonlinear functions in additive SEMs with Gaussian noise restrict the
set of potential causal models and hence influence the identifiability.
On the basis of the transformational characterization (B) we provide a
scorebased estimation procedure that outputs the graphical representation (A)
of the distribution equivalence class of a given PLSEM. We derive its
(highdimensional) consistency and demonstrate its performance on simulated
datasets.

We propose a residual and wild bootstrap methodology for individual and
simultaneous inference in highdimensional linear models with possibly
nonGaussian and heteroscedastic errors. We establish asymptotic consistency
for simultaneous inference for parameters in groups $G$, where $p \gg n$, $s_0
= o(n^{1/2}/\{\log(p) \log(G)^{1/2}\})$ and $\log(G) = o(n^{1/7})$, with
$p$ the number of variables, $n$ the sample size and $s_0$ denoting the
sparsity. The theory is complemented by many empirical results. Our proposed
procedures are implemented in the Rpackage hdi.

Largescale sequential data is often exposed to some degree of inhomogeneity
in the form of sudden changes in the parameters of the datagenerating process.
We consider the problem of detecting such structural changes in a
highdimensional regression setting. We propose a joint estimator of the number
and the locations of the change points and of the parameters in the
corresponding segments. The estimator can be computed using dynamic programming
or, as we emphasize here, it can be approximated using a binary search
algorithm with $O(n \log(n) \mathrm{Lasso}(n))$ computational operations while
still enjoying essentially the same theoretical properties; here
$\mathrm{Lasso}(n)$ denotes the computational cost of computing the Lasso for
sample size $n$. We establish oracle inequalities for the estimator as well as
for its binary search approximation, covering also the case with a large
(asymptotically growing) number of change points. We evaluate the performance
of the proposed estimation algorithms on simulated data and apply the
methodology to real data.

We present a (selective) review of recent frequentist highdimensional
inference methods for constructing $p$values and confidence intervals in
linear and generalized linear models. We include a broad, comparative empirical
study which complements the viewpoint from statistical methodology and theory.
Furthermore, we introduce and illustrate the Rpackage hdi which easily allows
the use of different methods and supports reproducibility.

What is the difference of a prediction that is made with a causal model and a
noncausal model? Suppose we intervene on the predictor variables or change the
whole environment. The predictions from a causal model will in general work as
well under interventions as for observational data. In contrast, predictions
from a noncausal model can potentially be very wrong if we actively intervene
on variables. Here, we propose to exploit this invariance of a prediction under
a causal model for causal inference: given different experimental settings (for
example various interventions) we collect all models that do show invariance in
their predictive accuracy across settings and interventions. The causal model
will be a member of this set of models with high probability. This approach
yields valid confidence intervals for the causal relationships in quite general
scenarios. We examine the example of structural equation models in more detail
and provide sufficient assumptions under which the set of causal predictors
becomes identifiable. We further investigate robustness properties of our
approach under model misspecification and discuss possible extensions. The
empirical properties are studied for various data sets, including largescale
gene perturbation experiments.

We consider the problem of inferring the total causal effect of a single
variable intervention on a (response) variable of interest. We propose a
certain marginal integration regression technique for a very general class of
potentially nonlinear structural equation models (SEMs) with known structure,
or at least known superset of adjustment variables: we call the procedure
Smint regression. We easily derive that it achieves the convergence rate as
for nonparametric regression: for example, single variable intervention effects
can be estimated with convergence rate $n^{2/5}$ assuming smoothness with
twice differentiable functions. Our result can also be seen as a major
robustness property with respect to model misspecification which goes much
beyond the notion of double robustness. Furthermore, when the structure of the
SEM is not known, we can estimate (the equivalence class of) the directed
acyclic graph corresponding to the SEM, and then proceed by using Smint based
on these estimates. We empirically compare the Smint regression method with
more classical approaches and argue that the former is indeed more robust, more
reliable and substantially simpler.

Largescale data are often characterized by some degree of inhomogeneity as
data are either recorded in different time regimes or taken from multiple
sources. We look at regression models and the effect of randomly changing
coefficients, where the change is either smoothly in time or some other
dimension or even without any such structure. Fitting varyingcoefficient
models or mixture models can be appropriate solutions but are computationally
very demanding and often return more information than necessary. If we just ask
for a model estimator that shows good predictive properties for all regimes of
the data, then we are aiming for a simple linear model that is reliable for all
possible subsets of the data. We propose the concept of "maximin effects" and a
suitable estimator and look at its prediction accuracy from a theoretical point
of view in a mixture model with known or unknown group structure. Under certain
circumstances the estimator can be computed orders of magnitudes faster than
standard penalized regression estimators, making computations on largescale
data feasible. Empirical examples complement the novel methodology and theory.

Given data sampled from a number of variables, one is often interested in the
underlying causal relationships in the form of a directed acyclic graph. In the
general case, without interventions on some of the variables it is only
possible to identify the graph up to its Markov equivalence class. However, in
some situations one can find the true causal graph just from observational
data, for example in structural equation models with additive noise and
nonlinear edge functions. Most current methods for achieving this rely on
nonparametric independence tests. One of the problems there is that the null
hypothesis is independence, which is what one would like to get evidence for.
We take a different approach in our work by using a penalized likelihood as a
score for model selection. This is practically feasible in many settings and
has the advantage of yielding a natural ranking of the candidate models. When
making smoothness assumptions on the probability density space, we prove
consistency of the penalized maximum likelihood estimator. We also present
empirical results for simulated scenarios and real twodimensional data sets
(causeeffect pairs) where we obtain similar results as other stateoftheart
methods.

We consider highdimensional inference when the assumed linear model is
misspecified. We describe some correct interpretations and corresponding
sufficient assumptions for valid asymptotic inference of the model parameters,
which still have a useful meaning when the model is misspecified. We largely
focus on the desparsified Lasso procedure but we also indicate some
implications for (multiple) sample splitting techniques. In view of available
methods and software, our results contribute to robustness considerations with
respect to model misspecification.

One challenge of largescale data analysis is that the assumption of an
identical distribution for all samples is often not realistic. An optimal
linear regression might, for example, be markedly different for distinct groups
of the data. Maximin effects have been proposed as a computationally attractive
way to estimate effects that are common across all data without fitting a
mixture distribution explicitly. So far just point estimators of the common
maximin effects have been proposed in Meinshausen and B\"uhlmann (2014). Here
we propose asymptotically valid confidence regions for these effects.

We propose a general, modular method for significance testing of groups (or
clusters) of variables in a highdimensional linear model. In presence of high
correlations among the covariables, due to serious problems of identifiability,
it is indispensable to focus on detecting groups of variables rather than
singletons. We propose an inference method which allows to build in
hierarchical structures. It relies on repeated sample splitting and sequential
rejection, and we prove that it asymptotically controls the familywise error
rate. It can be implemented on any collection of clusters and leads to improved
power in comparison to more standard nonsequential rejection methods. We
complete the theoretical analysis with empirical results for simulated and real
data.

We develop estimation for potentially highdimensional additive structural
equation models. A key component of our approach is to decouple order search
among the variables from feature or edge selection in a directed acyclic graph
encoding the causal structure. We show that the former can be done with
nonregularized (restricted) maximum likelihood estimation while the latter can
be efficiently addressed using sparse regression techniques. Thus, we
substantially simplify the problem of structure search and estimation for an
important class of causal models. We establish consistency of the (restricted)
maximum likelihood estimator for low and highdimensional scenarios, and we
also allow for misspecification of the error distribution. Furthermore, we
develop an efficient computational algorithm which can deal with many
variables, and the new method's accuracy and performance is illustrated on
simulated and real data.

Largescale data analysis poses both statistical and computational problems
which need to be addressed simultaneously. A solution is often straightforward
if the data are homogeneous: one can use classical ideas of subsampling and
mean aggregation to get a computationally efficient solution with acceptable
statistical accuracy, where the aggregation step simply averages the results
obtained on distinct subsets of the data. However, if the data exhibit
inhomogeneities (and typically they do), the same approach will be inadequate,
as it will be unduly influenced by effects that are not persistent across all
the data due to, for example, outliers or timevarying effects. We show that a
tweak to the aggregation step can produce an estimator of effects which are
common to all data, and hence interesting for interpretation and often leading
to better prediction than pooled effects.

We propose a method for testing whether hierarchically ordered groups of
potentially correlated variables are significant for explaining a response in a
highdimensional linear model. In presence of highly correlated variables, as
is very common in highdimensional data, it seems indispensable to go beyond an
approach of inferring individual regression coefficients, and we show that
detecting smallest groups of variables (MTDs: minimal true detections) is
realistic. Thanks to the hierarchy among the groups of variables, powerful
multiple testing adjustment is possible which leads to a datadriven choice of
the resolution level for the groups. Our procedure, based on repeated sample
splitting, is shown to asymptotically control the familywise error rate and we
provide empirical results for simulated and real data which complement the
theoretical analysis. Supplementary materials for this article are available
after the References.

We propose a general method for constructing confidence intervals and
statistical tests for single or lowdimensional components of a large parameter
vector in a highdimensional model. It can be easily adjusted for multiplicity
taking dependence among tests into account. For linear models, our method is
essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat.
Methodol. 76 (2014) 217242]: we analyze its asymptotic properties and
establish its asymptotic optimality in terms of semiparametric efficiency. Our
method naturally extends to generalized linear models with convex loss
functions. We develop the corresponding theory which includes a careful
analysis for Gaussian, subGaussian and bounded correlated designs.

Discussion of "A significance test for the lasso" by Richard Lockhart,
Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161].