
We propose an adversarial training procedure for learning a causal implicit
generative model for a given causal graph. We show that adversarial training
can be used to learn a generative model with true observational and
interventional distributions if the generator architecture is consistent with
the given causal graph. We consider the application of generating faces based
on given binary labels where the dependency structure between the labels is
preserved with a causal graph. This problem can be seen as learning a causal
implicit generative model for the image and labels. We devise a twostage
procedure for this problem. First we train a causal implicit generative model
over binary labels using a neural network consistent with a causal graph as the
generator. We empirically show that WassersteinGAN can be used to output
discrete labels. Later, we propose two new conditional GAN architectures, which
we call CausalGAN and CausalBEGAN. We show that the optimal generator of the
CausalGAN, given the labels, samples from the image distributions conditioned
on these labels. The conditional GAN combined with a trained causal implicit
generative model for the labels is then a causal implicit generative model over
the labels and the generated image. We show that the proposed architectures can
be used to sample from observational and interventional image distributions,
even for interventions which do not naturally occur in the dataset.

We consider the problem of learning a causal graph over a set of variables
with interventions. We study the costoptimal causal graph learning problem:
For a given skeleton (undirected version of the causal graph), design the set
of interventions with minimum total cost, that can uniquely identify any causal
graph with the given skeleton. We show that this problem is solvable in
polynomial time. Later, we consider the case when the number of interventions
is limited. For this case, we provide polynomial time algorithms when the
skeleton is a tree or a clique tree. For a general chordal skeleton, we develop
an efficient greedy algorithm, which can be improved when the causal graph
skeleton is an interval graph.

We consider support recovery in the quadratic logistic regression setting 
where the target depends on both p linear terms $x_i$ and up to $p^2$ quadratic
terms $x_i x_j$. Quadratic terms enable prediction/modeling of higherorder
effects between features and the target, but when incorporated naively may
involve solving a very large regression problem. We consider the sparse case,
where at most $s$ terms (linear or quadratic) are nonzero, and provide a new
faster algorithm. It involves (a) identifying the weak support (i.e. all
relevant variables) and (b) standard logistic regression optimization only on
these chosen variables. The first step relies on a novel insight about
correlation tests in the presence of nonlinearity, and takes $O(pn)$ time for
$n$ samples  giving potentially huge computational gains over the naive
approach. Motivated by insights from the boolean case, we propose a nonlinear
correlation test for nonbinary finite support case that involves hashing a
variable and then correlating with the output variable. We also provide
experimental results to demonstrate the effectiveness of our methods.

We study the problem of identifying the causal relationship between two
discrete random variables from observational data. We recently proposed a novel
framework called entropic causality that works in a very general functional
model but makes the assumption that the unobserved exogenous variable has small
entropy in the true causal direction.
This framework requires the solution of a minimum entropy coupling problem:
Given marginal distributions of m discrete random variables, each on n states,
find the joint distribution with minimum entropy, that respects the given
marginals. This corresponds to minimizing a concave function of nm variables
over a convex polytope defined by nm linear constraints, called a
transportation polytope. Unfortunately, it was recently shown that this minimum
entropy coupling problem is NPhard, even for 2 variables with n states. Even
representing points (joint distributions) over this space can require
exponential complexity (in n, m) if done naively.
In our recent work we introduced an efficient greedy algorithm to find an
approximate solution for this problem. In this paper we analyze this algorithm
and establish two results: that our algorithm always finds a local minimum and
also is within an additive approximation error from the unknown global optimum.

We consider the problem of identifying the causal direction between two
discrete random variables using observational data. Unlike previous work, we
keep the most general functional model but make an assumption on the unobserved
exogenous variable: Inspired by Occam's razor, we assume that the exogenous
variable is simple in the true causal direction. We quantify simplicity using
R\'enyi entropy. Our main result is that, under natural assumptions, if the
exogenous variable has low $H_0$ entropy (cardinality) in the true direction,
it must have high $H_0$ entropy in the wrong direction. We establish several
algorithmic hardness results about estimating the minimum entropy exogenous
variable. We show that the problem of finding the exogenous variable with
minimum entropy is equivalent to the problem of finding minimum joint entropy
given $n$ marginal distributions, also known as minimum entropy coupling
problem. We propose an efficient greedy algorithm for the minimum entropy
coupling problem, that for $n=2$ provably finds a local optimum. This gives a
greedy algorithm for finding the exogenous variable with minimum $H_1$ (Shannon
Entropy). Our greedy entropybased causal inference algorithm has similar
performance to the state of the art additive noise models in real datasets. One
advantage of our approach is that we make no use of the values of random
variables but only their distributions. Our method can therefore be used for
causal inference for both ordinal and also categorical data, unlike additive
noise models.

Motivated by online recommendation and advertising systems, we consider a
causal model for stochastic contextual bandits with a latent lowdimensional
confounder. In our model, there are $L$ observed contexts and $K$ arms of the
bandit. The observed context influences the reward obtained through a latent
confounder variable with cardinality $m$ ($m \ll L,K$). The arm choice and the
latent confounder causally determines the reward while the observed context is
correlated with the confounder. Under this model, the $L \times K$ mean reward
matrix $\mathbf{U}$ (for each context in $[L]$ and each arm in $[K]$)
factorizes into nonnegative factors $\mathbf{A}$ ($L \times m$) and
$\mathbf{W}$ ($m \times K$). This insight enables us to propose an
$\epsilon$greedy NMFBandit algorithm that designs a sequence of interventions
(selecting specific arms), that achieves a balance between learning this
lowdimensional structure and selecting the best arm to minimize regret. Our
algorithm achieves a regret of $\mathcal{O}\left(L\mathrm{poly}(m, \log K) \log
T \right)$ at time $T$, as compared to $\mathcal{O}(LK\log T)$ for conventional
contextual bandits, assuming a constant gap between the best arm and the rest
for each context. These guarantees are obtained under mild sufficiency
conditions on the factors that are weaker versions of the wellknown
Statistical RIP condition. We further propose a class of generative models that
satisfy our sufficient conditions, and derive a lower bound of
$\mathcal{O}\left(Km\log T\right)$. These are the first regret guarantees for
online matrix completion with bandit feedback, when the rank is greater than
one. We further compare the performance of our algorithm with the state of the
art, on synthetic and real world datasets.

We consider the problem of learning causal networks with interventions, when
each intervention is limited in size under Pearl's Structural Equation Model
with independent errors (SEMIE). The objective is to minimize the number of
experiments to discover the causal directions of all the edges in a causal
graph. Previous work has focused on the use of separating systems for complete
graphs for this task. We prove that any deterministic adaptive algorithm needs
to be a separating system in order to learn complete graphs in the worst case.
In addition, we present a novel separating system construction, whose size is
close to optimal and is arguably simpler than previous work in combinatorics.
We also develop a novel information theoretic lower bound on the number of
interventions that applies in full generality, including for randomized
adaptive learning algorithms.
For general chordal graphs, we derive worst case lower bounds on the number
of interventions. Building on observations about induced trees, we give a new
deterministic adaptive algorithm to learn directions on any chordal skeleton
completely. In the worst case, our achievable scheme is an
$\alpha$approximation algorithm where $\alpha$ is the independence number of
the graph. We also show that there exist graph classes for which the sufficient
number of experiments is close to the lower bound. In the other extreme, there
are graph classes for which the required number of experiments is
multiplicatively $\alpha$ away from our lower bound.
In simulations, our algorithm almost always performs very close to the lower
bound, while the approach based on separating systems for complete graphs is
significantly worse for random chordal graphs.

Let $f:\{1,1\}^n$ be a polynomial with at most $s$ nonzero real
coefficients. We give an algorithm for exactly reconstructing f given random
examples from the uniform distribution on $\{1,1\}^n$ that runs in time
polynomial in $n$ and $2s$ and succeeds if the function satisfies the unique
sign property: there is one output value which corresponds to a unique set of
values of the participating parities. This sufficient condition is satisfied
when every coefficient of f is perturbed by a small random noise, or satisfied
with high probability when s parity functions are chosen randomly or when all
the coefficients are positive. Learning sparse polynomials over the Boolean
domain in time polynomial in $n$ and $2s$ is considered notoriously hard in the
worstcase. Our result shows that the problem is tractable for almost all
sparse polynomials. Then, we show an application of this result to hypergraph
sketching which is the problem of learning a sparse (both in the number of
hyperedges and the size of the hyperedges) hypergraph from uniformly drawn
random cuts. We also provide experimental results on a real world dataset.