• We propose an adversarial training procedure for learning a causal implicit generative model for a given causal graph. We show that adversarial training can be used to learn a generative model with true observational and interventional distributions if the generator architecture is consistent with the given causal graph. We consider the application of generating faces based on given binary labels where the dependency structure between the labels is preserved with a causal graph. This problem can be seen as learning a causal implicit generative model for the image and labels. We devise a two-stage procedure for this problem. First we train a causal implicit generative model over binary labels using a neural network consistent with a causal graph as the generator. We empirically show that WassersteinGAN can be used to output discrete labels. Later, we propose two new conditional GAN architectures, which we call CausalGAN and CausalBEGAN. We show that the optimal generator of the CausalGAN, given the labels, samples from the image distributions conditioned on these labels. The conditional GAN combined with a trained causal implicit generative model for the labels is then a causal implicit generative model over the labels and the generated image. We show that the proposed architectures can be used to sample from observational and interventional image distributions, even for interventions which do not naturally occur in the dataset.
  • We consider the problem of learning a causal graph over a set of variables with interventions. We study the cost-optimal causal graph learning problem: For a given skeleton (undirected version of the causal graph), design the set of interventions with minimum total cost, that can uniquely identify any causal graph with the given skeleton. We show that this problem is solvable in polynomial time. Later, we consider the case when the number of interventions is limited. For this case, we provide polynomial time algorithms when the skeleton is a tree or a clique tree. For a general chordal skeleton, we develop an efficient greedy algorithm, which can be improved when the causal graph skeleton is an interval graph.
  • We consider support recovery in the quadratic logistic regression setting - where the target depends on both p linear terms $x_i$ and up to $p^2$ quadratic terms $x_i x_j$. Quadratic terms enable prediction/modeling of higher-order effects between features and the target, but when incorporated naively may involve solving a very large regression problem. We consider the sparse case, where at most $s$ terms (linear or quadratic) are non-zero, and provide a new faster algorithm. It involves (a) identifying the weak support (i.e. all relevant variables) and (b) standard logistic regression optimization only on these chosen variables. The first step relies on a novel insight about correlation tests in the presence of non-linearity, and takes $O(pn)$ time for $n$ samples - giving potentially huge computational gains over the naive approach. Motivated by insights from the boolean case, we propose a non-linear correlation test for non-binary finite support case that involves hashing a variable and then correlating with the output variable. We also provide experimental results to demonstrate the effectiveness of our methods.
  • We study the problem of identifying the causal relationship between two discrete random variables from observational data. We recently proposed a novel framework called entropic causality that works in a very general functional model but makes the assumption that the unobserved exogenous variable has small entropy in the true causal direction. This framework requires the solution of a minimum entropy coupling problem: Given marginal distributions of m discrete random variables, each on n states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of nm variables over a convex polytope defined by nm linear constraints, called a transportation polytope. Unfortunately, it was recently shown that this minimum entropy coupling problem is NP-hard, even for 2 variables with n states. Even representing points (joint distributions) over this space can require exponential complexity (in n, m) if done naively. In our recent work we introduced an efficient greedy algorithm to find an approximate solution for this problem. In this paper we analyze this algorithm and establish two results: that our algorithm always finds a local minimum and also is within an additive approximation error from the unknown global optimum.
  • We consider the problem of identifying the causal direction between two discrete random variables using observational data. Unlike previous work, we keep the most general functional model but make an assumption on the unobserved exogenous variable: Inspired by Occam's razor, we assume that the exogenous variable is simple in the true causal direction. We quantify simplicity using R\'enyi entropy. Our main result is that, under natural assumptions, if the exogenous variable has low $H_0$ entropy (cardinality) in the true direction, it must have high $H_0$ entropy in the wrong direction. We establish several algorithmic hardness results about estimating the minimum entropy exogenous variable. We show that the problem of finding the exogenous variable with minimum entropy is equivalent to the problem of finding minimum joint entropy given $n$ marginal distributions, also known as minimum entropy coupling problem. We propose an efficient greedy algorithm for the minimum entropy coupling problem, that for $n=2$ provably finds a local optimum. This gives a greedy algorithm for finding the exogenous variable with minimum $H_1$ (Shannon Entropy). Our greedy entropy-based causal inference algorithm has similar performance to the state of the art additive noise models in real datasets. One advantage of our approach is that we make no use of the values of random variables but only their distributions. Our method can therefore be used for causal inference for both ordinal and also categorical data, unlike additive noise models.
  • Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are $L$ observed contexts and $K$ arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality $m$ ($m \ll L,K$). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the $L \times K$ mean reward matrix $\mathbf{U}$ (for each context in $[L]$ and each arm in $[K]$) factorizes into non-negative factors $\mathbf{A}$ ($L \times m$) and $\mathbf{W}$ ($m \times K$). This insight enables us to propose an $\epsilon$-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. Our algorithm achieves a regret of $\mathcal{O}\left(L\mathrm{poly}(m, \log K) \log T \right)$ at time $T$, as compared to $\mathcal{O}(LK\log T)$ for conventional contextual bandits, assuming a constant gap between the best arm and the rest for each context. These guarantees are obtained under mild sufficiency conditions on the factors that are weaker versions of the well-known Statistical RIP condition. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound of $\mathcal{O}\left(Km\log T\right)$. These are the first regret guarantees for online matrix completion with bandit feedback, when the rank is greater than one. We further compare the performance of our algorithm with the state of the art, on synthetic and real world data-sets.
  • We consider the problem of learning causal networks with interventions, when each intervention is limited in size under Pearl's Structural Equation Model with independent errors (SEM-IE). The objective is to minimize the number of experiments to discover the causal directions of all the edges in a causal graph. Previous work has focused on the use of separating systems for complete graphs for this task. We prove that any deterministic adaptive algorithm needs to be a separating system in order to learn complete graphs in the worst case. In addition, we present a novel separating system construction, whose size is close to optimal and is arguably simpler than previous work in combinatorics. We also develop a novel information theoretic lower bound on the number of interventions that applies in full generality, including for randomized adaptive learning algorithms. For general chordal graphs, we derive worst case lower bounds on the number of interventions. Building on observations about induced trees, we give a new deterministic adaptive algorithm to learn directions on any chordal skeleton completely. In the worst case, our achievable scheme is an $\alpha$-approximation algorithm where $\alpha$ is the independence number of the graph. We also show that there exist graph classes for which the sufficient number of experiments is close to the lower bound. In the other extreme, there are graph classes for which the required number of experiments is multiplicatively $\alpha$ away from our lower bound. In simulations, our algorithm almost always performs very close to the lower bound, while the approach based on separating systems for complete graphs is significantly worse for random chordal graphs.
  • Let $f:\{-1,1\}^n$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing f given random examples from the uniform distribution on $\{-1,1\}^n$ that runs in time polynomial in $n$ and $2s$ and succeeds if the function satisfies the unique sign property: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of f is perturbed by a small random noise, or satisfied with high probability when s parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in $n$ and $2s$ is considered notoriously hard in the worst-case. Our result shows that the problem is tractable for almost all sparse polynomials. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset.