
A classical problem in causal inference is that of matching, where treatment
units need to be matched to control units based on covariate information. In
this work, we propose a method that computes high quality almostexact matches
for highdimensional categorical datasets. This method, called FLAME (Fast
Largescale Almost Matching Exactly), learns a distance metric for matching
using a holdout training data set. In order to perform matching efficiently
for large datasets, FLAME leverages techniques that are natural for query
processing in the area of database management, and two implementations of FLAME
are provided: the first uses SQL queries and the second uses bitvector
techniques. The algorithm starts by constructing matches of the highest quality
(exact matches on all covariates), and successively eliminates variables in
order to match exactly on as many variables as possible, while still
maintaining interpretable highquality matches and balance between treatment
and control groups. We leverage these high quality matches to estimate
conditional average treatment effects (CATEs). Our experiments show that FLAME
scales to huge datasets with millions of observations where existing
stateoftheart methods fail, and that it achieves significantly better
performance than other matching methods.

There are many approaches is use today to either prevent or minimize the
impact of interquery interactions on a shared cluster. Despite these measures,
performance issues due to concurrent executions of mixed workloads still
prevail causing undue waiting times for queries. Analyzing these resource
interferences is thus critical in order to answer time sensitive questions like
'who is causing my query to slowdown' in a multitenant environment. More
importantly, dignosing whether the slowdown of a query is a result of resource
contentions caused by other queries or some other external factor can help an
admin narrow down the many possibilities of performance degradation. This
process of investigating the symptoms of resource contentions and attributing
blame to concurrent queries is nontrivial and tedious, and involves hours of
manually debugging through a cycle of query interactions.
In this paper, we present ProtoXplore  a Proto or first system to eXplore
contentions, that helps administrators determine whether the blame for resource
bottlenecks can be attributed to concurrent queries, and uses a methodology
called Resource Acquire Time Penalty (RATP) to quantify this blame towards
contentious sources accurately. Further, ProtoXplore builds on the theory of
explanations and enables a stepwise deep exploration of various levels of
performance bottlenecks faced by a query during its execution using a
multilevel directed acyclic graph called ProtoGraph. Our experimental
evaluation uses ProtoXplore to analyze the interactions between TPCDS queries
on Apache Spark to show how ProtoXplore provides explanations that help in
diagnosing contention related issues and better managing a changing mixed
workload in a shared cluster.

The study of causality or causal inference  how much a given treatment
causally affects a given outcome in a population  goes way beyond correlation
or association analysis of variables, and is critical in making sound data
driven decisions and policies in a multitude of applications. The gold standard
in causal inference is performing "controlled experiments", which often is not
possible due to logistical or ethical reasons. As an alternative, inferring
causality on "observational data" based on the "NeymanRubin potential outcome
model" has been extensively used in statistics, economics, and social sciences
over several decades. In this paper, we present a formal framework for sound
causal analysis on observational datasets that are given as multiple relations
and where the population under study is obtained by joining these base
relations. We study a crucial condition for inferring causality from
observational data, called the "strong ignorability assumption" (the treatment
and outcome variables should be independent in the joined relation given the
observed covariates), using known conditional independences that hold in the
base relations. We also discuss how the structure of the conditional
independences in base relations given as graphical models help infer new
conditional independences in the joined relation. The proposed framework
combines concepts from databases, statistics, and graphical models, and aims to
initiate new research directions spanning these fields to facilitate powerful
datadriven decisions in today's big data world.

In this paper, we study the complexity of answering conjunctive queries (CQ)
with inequalities). In particular, we are interested in comparing the
complexity of the query with and without inequalities. The main contribution of
our work is a novel combinatorial technique that enables us to use any
SelectProjectJoin query plan for a given CQ without inequalities in answering
the CQ with inequalities, with an additional factor in running time that only
depends on the query. The key idea is to define a new projection operator,
which keeps a small representation (independent of the size of the database) of
the set of input tuples that map to each tuple in the output of the projection;
this representation is used to evaluate all the inequalities in the query.
Second, we generalize a result by PapadimitriouYannakakis [17] and give an
alternative algorithm based on the colorcoding technique [4] to evaluate a CQ
with inequalities by using an algorithm for the CQ without inequalities. Third,
we investigate the structure of the query graph, inequality graph, and the
augmented query graph with inequalities, and show that even if the query and
the inequality graphs have bounded treewidth, the augmented graph not only can
have an unbounded treewidth but can also be NPhard to evaluate. Further, we
illustrate classes of queries and inequalities where the augmented graphs have
unbounded treewidth, but the CQ with inequalities can be evaluated in
polytime. Finally, we give necessary properties and sufficient properties that
allow a class of CQs to have polytime combined complexity with respect to any
inequality pattern. We also illustrate classes of queries where our
queryplanbased technique outperforms the alternative approaches discussed in
the paper.

Query evaluation in tupleindependent probabilistic databases is the problem
of computing the probability of an answer to a query given independent
probabilities of the individual tuples in a database instance. There are two
main approaches to this problem: (1) in `grounded inference' one first obtains
the lineage for the query and database instance as a Boolean formula, then
performs weighted model counting on the lineage (i.e., computes the probability
of the lineage given probabilities of its independent Boolean variables); (2)
in methods known as `lifted inference' or `extensional query evaluation', one
exploits the highlevel structure of the query as a firstorder formula.
Although it is widely believed that lifted inference is strictly more powerful
than grounded inference on the lineage alone, no formal separation has
previously been shown for query evaluation. In this paper we show such a formal
separation for the first time.
We exhibit a class of queries for which model counting can be done in
polynomial time using extensional query evaluation, whereas the algorithms used
in stateoftheart exact model counters on their lineages provably require
exponential time. Our lower bounds on the running times of these exact model
counters follow from new exponential size lower bounds on the kinds of dDNNF
representations of the lineages that these model counters (either explicitly or
implicitly) produce. Though some of these queries have been studied before, no
nontrivial lower bounds on the sizes of these representations for these
queries were previously known.

The best current methods for exactly computing the number of satisfying
assignments, or the satisfying probability, of Boolean formulas can be seen,
either directly or indirectly, as building 'decisionDNNF' (decision
decomposable negation normal form) representations of the input Boolean
formulas. DecisionDNNFs are a special case of 'dDNNF's where 'd' stands for
'deterministic'. We show that any decisionDNNF can be converted into an
equivalent 'FBDD' (free binary decision diagram)  also known as a 'readonce
branching program' (ROBP or 1BP)  with only a quasipolynomial increase in
representation size in general, and with only a polynomial increase in size in
the special case of monotone kDNF formulas. Leveraging known exponential lower
bounds for FBDDs, we then obtain similar exponential lower bounds for
decisionDNNFs which provide lower bounds for the recent algorithms. We also
separate the power of decisionDNNFs from dDNNFs and a generalization of
decisionDNNFs known as ANDFBDDs. Finally we show how these imply exponential
lower bounds for natural problems associated with probabilistic databases.

We study the problem of concealing functionality of a proprietary or private
module when provenance information is shown over repeated executions of a
workflow which contains both `public' and `private' modules. Our approach is to
use `provenance views' to hide carefully chosen subsets of data over all
executions of the workflow to ensure Gprivacy: for each private module and
each input x, the module's output f(x) is indistinguishable from G 1 other
possible values given the visible data in the workflow executions. We show that
Gprivacy cannot be achieved simply by combining solutions for individual
private modules; data hiding must also be `propagated' through public modules.
We then examine how much additional data must be hidden and when it is safe to
stop propagating data hiding. The answer depends strongly on the workflow
topology as well as the behavior of public modules on the visible data. In
particular, for a class of workflows (which include the common tree and chain
workflows), taking private solutions for each private module, augmented with a
`public closure' that is `upstreamdownstream safe', ensures Gprivacy. We
define these notions formally and show that the restrictions are necessary. We
also study the related optimization problems of minimizing the amount of hidden
data.

Scientific workflow systems increasingly store provenance information about
the module executions used to produce a data item, as well as the parameter
settings and intermediate data items passed between module executions. However,
authors/owners of workflows may wish to keep some of this information
confidential. In particular, a module may be proprietary, and users should not
be able to infer its behavior by seeing mappings between all data inputs and
outputs. The problem we address in this paper is the following: Given a
workflow, abstractly modeled by a relation R, a privacy requirement \Gamma and
costs associated with data. The owner of the workflow decides which data
(attributes) to hide, and provides the user with a view R' which is the
projection of R over attributes which have not been hidden. The goal is to
minimize the cost of hidden data while guaranteeing that individual modules are
\Gamma private. We call this the "secureview" problem. We formally define the
problem, study its complexity, and offer algorithmic solutions.

A boolean expression is in readonce form if each of its variables appears
exactly once. When the variables denote independent events in a probability
space, the probability of the event denoted by the whole expression in
readonce form can be computed in polynomial time (whereas the general problem
for arbitrary expressions is #Pcomplete). Known approaches to checking
readonce property seem to require putting these expressions in disjunctive
normal form. In this paper, we tell a better story for a large subclass of
boolean event expressions: those that are generated by conjunctive queries
without selfjoins and on tupleindependent probabilistic databases. We first
show that given a tupleindependent representation and the provenance graph of
an SPJ query plan without selfjoins, we can, without using the DNF of a result
event expression, efficiently compute its cooccurrence graph. From this, the
readonce form can already, if it exists, be computed efficiently using
existing techniques. Our second and key contribution is a complete, efficient,
and simple to implement algorithm for computing the readonce forms (whenever
they exist) directly, using a new concept, that of cotable graph, which can be
significantly smaller than the cooccurrence graph.