
This document defines extensions of the RDF data model and of the SPARQL
query language that capture an alternative approach to represent
statementlevel metadata. While this alternative approach is backwards
compatible with RDF reification as defined by the RDF standard, the approach
aims to address usability and data management shortcomings of RDF reification.
One of the great advantages of the proposed approach is that it clarifies a
means to (i) understand sparse matrices, the property graph model, hypergraphs,
and other data structures with an emphasis on link attributes, (ii) map such
data onto RDF, and (iii) query such data using SPARQL. Further, the proposal
greatly expands both the freedom that database designers enjoy when creating
physical indexing schemes and query plans for graph data annotated with link
attributes and the interoperability of those database solutions.

Exact Maximum Inner Product Search (MIPS) is an important task that is widely
pertinent to recommender systems and highdimensional similarity search. The
bruteforce approach to solving exact MIPS is computationally expensive, thus
spurring recent development of novel indexes and pruning techniques for this
task. In this paper, we show that a hardwareefficient bruteforce approach,
blocked matrix multiply (BMM), can outperform the stateoftheart MIPS solvers
by over an order of magnitude, for some  but not all  inputs.
In this paper, we also present a novel MIPS solution, MAXIMUS, that takes
advantage of hardware efficiency and pruning of the search space. Like BMM,
MAXIMUS is faster than other solvers by up to an order of magnitude, but again
only for some inputs. Since no single solution offers the best runtime
performance for all inputs, we introduce a new datadependent optimizer,
OPTIMUS, that selects online with minimal overhead the best MIPS solver for a
given input. Together, OPTIMUS and MAXIMUS outperform stateoftheart MIPS
solvers by 3.2$\times$ on average, and up to 10.9$\times$, on widely studied
MIPS datasets.

We study rewritability of monadic disjunctive Datalog programs, (the
complements of) MMSNP sentences, and ontologymediated queries (OMQs) based on
expressive description logics of the ALC family and on conjunctive queries. We
show that rewritability into FO and into monadic Datalog (MDLog) are decidable,
and that rewritability into Datalog is decidable when the original query
satisfies a certain condition related to equality. We establish
2NExpTimecompleteness for all studied problems except rewritability into MDLog
for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze
the shape of rewritings, which in the MMSNP case correspond to obstructions,
and give a new construction of canonical Datalog programs that is more
elementary than existing ones and also applies to formulas with free variables.

Data exchange is the problem of transforming data that is structured under a
source schema into data structured under another schema, called the target
schema, so that both the source and target data satisfy the relationship
between the schemas. Even though the formal framework of data exchange for
relational database systems is wellestablished, it does not immediately carry
over to the settings of temporal data, which necessitates reasoning over
unbounded periods of time. In this work, we study data exchange for temporal
data. We first motivate the need for two views of temporal data: the concrete
view, which depicts how temporal data is compactly represented and on which the
implementations are based, and the abstract view, which defines the semantics
of temporal data as a sequence of snapshots. We first extend the chase
procedure for the abstract view to have a conceptual basis for the data
exchange for temporal databases. Considering nontemporal sourcetotarget
tuple generating dependencies and equality generating dependencies, the chase
algorithm can be applied on each snapshot independently. Then we define a chase
procedure (called cchase) on concrete instances and show the result of cchase
on a concrete instance is semantically aligned with the result of chase on the
corresponding abstract instance. In order to interpret intervals as constants
while checking if a dependency or a query is satisfied by a concrete database,
we will normalize the instance with respect to the dependency or the query. To
obtain the semantic alignment, the nulls in the concrete view are annotated
with temporal information. Furthermore, we show that the result of the concrete
chase provides a foundation for query answering. We define naive evaluation on
the result of the cchase and show it produces certain answers.

We study a natural variant of the implicational fragment of propositional
logic. Its formulas are pairs of conjunctions of positive literals, related
together by an implicationallike connective; the semantics of this sort of
implication is defined in terms of a threshold on a conditional probability of
the consequent, given the antecedent: we are dealing with what the data
analysis community calls confidence of partial implications or association
rules. Existing studies of redundancy among these partial implications have
characterized so far only entailment from one premise and entailment from two
premises, both in the standalone case and in the case of presence of
additional classical implications (this is what we call "relative entailment").
By exploiting a previously noted alternative view of the entailment in terms of
linear programming duality, we characterize exactly the cases of entailment
from arbitrary numbers of premises, again both in the standalone case and in
the case of presence of additional classical implications. As a result, we
obtain decision algorithms of better complexity; additionally, for each
potential case of entailment, we identify a critical confidence threshold and
show that it is, actually, intrinsic to each set of premises and antecedent of
the conclusion.

A classical problem in causal inference is that of matching, where treatment
units need to be matched to control units based on covariate information. In
this work, we propose a method that computes high quality almostexact matches
for highdimensional categorical datasets. This method, called FLAME (Fast
Largescale Almost Matching Exactly), learns a distance metric for matching
using a holdout training data set. In order to perform matching efficiently
for large datasets, FLAME leverages techniques that are natural for query
processing in the area of database management, and two implementations of FLAME
are provided: the first uses SQL queries and the second uses bitvector
techniques. The algorithm starts by constructing matches of the highest quality
(exact matches on all covariates), and successively eliminates variables in
order to match exactly on as many variables as possible, while still
maintaining interpretable highquality matches and balance between treatment
and control groups. We leverage these high quality matches to estimate
conditional average treatment effects (CATEs). Our experiments show that FLAME
scales to huge datasets with millions of observations where existing
stateoftheart methods fail, and that it achieves significantly better
performance than other matching methods.

Data compression is a popular technique for improving the efficiency of data
processing workloads such as SQL queries and more recently, machine learning
(ML) with classical batch gradient methods. But the efficacy of such ideas for
minibatch stochastic gradient descent (MGD), arguably the workhorse algorithm
of modern ML, is an open question. MGD's unique data access pattern renders
prior art, including those designed for batch gradient methods, less effective.
We fill this crucial research gap by proposing a new lossless compression
scheme we call tupleoriented compression (TOC) that is inspired by an unlikely
source, the string/text compression scheme LempelZivWelch, but tailored to
MGD in a way that preserves tuple boundaries within minibatches. We then
present a suite of novel compressed matrix operation execution techniques
tailored to the TOC compression scheme that operate directly over the
compressed data representation and avoid decompression overheads. An extensive
empirical evaluation with realworld datasets shows that TOC consistently
achieves substantial compression ratios by up to 51x and reduces runtimes for
MGD workloads by up to 10.2x in popular ML systems.

Visual query systems (VQSs) empower users to interactively search for line
charts with desired visual patterns, typically specified using intuitive
sketchbased interfaces. Despite decades of past work on VQSs, these efforts
have not translated to adoption in practice, possibly because VQSs are largely
evaluated in unrealistic labbased settings. To remedy this gap in adoption, we
collaborated with experts from three diverse domainsastronomy, genetics, and
material sciencevia a yearlong usercentered design process to develop a
VQS that supports their workflow and analytical needs, and evaluate how VQSs
can be used in practice. Our study results reveal that adhoc sketchonly
querying is not as commonly used as prior work suggests, since analysts are
often unable to precisely express their patterns of interest. In addition, we
characterize three essential sensemaking processes supported by our enhanced
VQS. We discover that participants employ all three processes, but in different
proportions, depending on the analytical needs in each domain. Our findings
suggest that all three sensemaking processes must be integrated in order to
make future VQSs useful for a wide range of analytical inquiries.

Hypertree decompositions, as well as the more powerful generalized hypertree
decompositions (GHDs), and the yet more general fractional hypertree
decompositions (FHD) are hypergraph decomposition methods successfully used for
answering conjunctive queries and for the solution of constraint satisfaction
problems. Every hypergraph H has a width relative to each of these
decomposition methods: its hypertree width hw(H), its generalized hypertree
width ghw(H), and its fractional hypertree width fhw(H), respectively.
It is known that hw(H) <= k can be checked in polynomial time for fixed k,
while checking ghw(H) <= k is NPcomplete for any k greater than or equal to 3.
The complexity of checking fhw(H) <= k for a fixed k has been open for more
than a decade.
We settle this open problem by showing that checking fhw(H) <= k is
NPcomplete, even for k=2. The same construction allows us to prove also the
NPcompleteness of checking ghw(H) <= k for k=2. After proving these hardness
results, we identify meaningful restrictions, for which checking for bounded
ghw or fhw becomes tractable.

Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structureaware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structureaware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24hour timeout, or encounter
internal design limitations.

The study of node selection query languages for (finite) trees has been a
major topic in the recent research on query languages for Web documents. On one
hand, there has been an extensive study of XPath and its various extensions. On
the other hand, query languages based on classical logics, such as firstorder
logic (FO) or Monadic SecondOrder Logic (MSO), have been considered. Results
in this area typically relate an XPathbased language to a classical logic.
What has yet to emerge is an XPathrelated language that is as expressive as
MSO, and at the same time enjoys the computational properties of XPath, which
are linear time query evaluation and exponential time querycontainment test.
In this paper we propose muXPath, which is the alternationfree fragment of
XPath extended with fixpoint operators. Using twoway alternating automata, we
show that this language does combine desired expressiveness and computational
properties, placing it as an attractive candidate for the definite
nodeselection query language for trees.

Singlesource and top$k$ SimRank queries are two important types of
similarity search in graphs with numerous applications in web mining, social
network analysis, spam detection, etc. A plethora of techniques have been
proposed for these two types of queries, but very few can efficiently support
similarity search over large dynamic graphs, due to either significant
preprocessing time or large space overheads.
This paper presents ProbeSim, an indexfree algorithm for singlesource and
top$k$ SimRank queries that provides a nontrivial theoretical guarantee in
the absolute error of query results. ProbeSim estimates SimRank similarities
without precomputing any indexing structures, and thus can naturally support
realtime SimRank queries on dynamic graphs. Besides the theoretical guarantee,
ProbeSim also offers satisfying practical efficiency and effectiveness due to
several nontrivial optimizations. We conduct extensive experiments on a number
of benchmark datasets, which demonstrate that our solutions significantly
outperform the existing methods in terms of efficiency and effectiveness.
Notably, our experiments include the first empirical study that evaluates the
effectiveness of SimRank algorithms on graphs with billion edges, using the
idea of pooling.

XML query can be modeled by twig pattern query (TPQ) specifying predicates on
XML nodes and XPath relationships satisfied between them. A lot of TPQ types
have been proposed; this paper takes into account a TPQ model extended by a
specification of output and nonoutput query nodes since it complies with the
XQuery semantics and, in many cases, it leads to a more efficient query
processing. In general, there are two approaches to process the TPQ: holistic
joins and binary joins. Whereas the binary join approach builds a query plan as
a tree of interconnected binary operators, the holistic join approach evaluates
a whole query using one operator (i.e., using one complex algorithm).
Surprisingly, a thorough analytical and experimental comparison is still
missing despite an enormous research effort in this area. In this paper, we try
to fill this gap; we analytically and experimentally show that the binary joins
used in a fullypipelined plan (i.e., the plan where each join operation does
not wait for the complete result of the previous operation and no explicit
sorting is used) can often outperform the holistic joins, especially for TPQs
with a higher ratio of nonoutput query nodes. The main contributions of this
paper can be summarized as follows: (i) we introduce several improvements of
existing binary join approaches allowing to build a fullypipelined plan for a
TPQ considering nonoutput query nodes, (ii) we prove that for a certain class
of TPQs such a plan has the linear time complexity with respect to the size of
the input and output as well as the linear space complexity with respect to the
XML document depth (i.e., the same complexity as the holistic join approaches),
(iii) we show that our improved binary join approach outperforms the holistic
join approaches in many situations, and (iv) we propose a simple combined
approach that uses advantages of both types of approaches.

The $k$ nearest neighbor ($k$NN) query is a fundamental problem in databases.
Given a set of multidimensional data points and a query point, $k$NN returns
the $k$ nearest neighbors based on a scoring function such as weighted sum
given an attribute weight vector. However, the attribute weight vector can be
difficult to specify in practice. Skyline returns the points including all
possible nearest neighbors without requiring the exact attribute weight vector
or a scoring function but the number of returned points can be prohibitively
large for practical use.
In this paper, we propose a novel \emph{eclipse} definition which provides a
more flexible and customizable definition than the classic $1$NN and skyline.
In eclipse, users can specify a range of attribute weights and control the
number of returned points. We show that both $1$NN and skyline are
instantiations of eclipse. To compute eclipse points, we propose a baseline
algorithm with time complexity of $O(n^22^{d1})$, and an improved $O(n\log
^{d1}n)$ time transformationbased algorithm by transforming the eclipse
problem to the skyline problem, where $n$ is the number of points and $d$ is
the number of dimensions. Furthermore, we propose a novel indexbased algorithm
utilizing duality transform with much better efficiency. The experimental
results on the real NBA dataset and the synthetic datasets demonstrate the
effectiveness and efficiency of our eclipse algorithms.

We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these familiesthough they require a large buffer of random numbersare
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
lowpowered processors. Our tests include hash functions designed for
processors with the CarryLess Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.

With the advent of Big Data, nowadays in many applications databases
containing large quantities of similar time series are available. Forecasting
time series in these domains with traditional univariate forecasting procedures
leaves great potentials for producing accurate forecasts untapped. Recurrent
neural networks (RNNs), and in particular Long ShortTerm Memory (LSTM)
networks, have proven recently that they are able to outperform
stateoftheart univariate time series forecasting methods in this context
when trained across all available time series. However, if the time series
database is heterogeneous, accuracy may degenerate, so that on the way towards
fully automatic forecasting methods in this space, a notion of similarity
between the time series needs to be built into the methods. To this end, we
present a prediction model that can be used with different types of RNN models
on subgroups of similar time series, which are identified by time series
clustering techniques. We assess our proposed methodology using LSTM networks,
a widely popular RNN variant. Our method achieves competitive results on
benchmarking datasets under competition evaluation procedures. In particular,
in terms of mean sMAPE accuracy, it consistently outperforms the baseline LSTM
model and outperforms all other methods on the CIF2016 forecasting competition
dataset.

The kregret query aims to return a sizek subset S of a database D such
that, for any query user that selects a data object from this sizek subset S
rather than from database D, her regret ratio is minimized. The regret ratio
here is modeled by the relative difference in the optimality between the
locally optimal object in S and the globally optimal object in D. The
optimality of a data object in turn is modeled by a utility function of the
query user. Unlike traditional topk queries, the kregret query does not
minimize the regret ratio for a specific utility function. Instead, it
considers a family of infinite utility functions F, and aims to find a sizek
subset that minimizes the maximum regret ratio of any utility function in F.
Studies on kregret queries have focused on the family of additive utility
functions, which have limitations in modeling individuals' preferences and
decision making processes, especially for a common observation called the
diminishing marginal rate of substitution (DMRS). We introduce kregret queries
with multiplicative utility functions, which are more expressive in modeling
the DMRS, to overcome those limitations. We propose a query algorithm with
bounded regret ratios. To showcase the applicability of the algorithm, we apply
it to a special family of multiplicative utility functions, the CobbDouglas
family of utility functions, and a closely related family of utility functions,
the Constant Elasticity of Substitution family of utility functions, both of
which are frequently used utility functions in microeconomics. After a further
study of the query properties, we propose a heuristic algorithm that produces
even smaller regret ratios in practice. Extensive experiments on the proposed
algorithms confirm that they consistently achieve small maximum regret ratios.

The volume of time series data has exploded due to the popularity of new
applications, such as data center management and IoT. Subsequence matching is a
fundamental task in mining time series data. All indexbased approaches only
consider raw subsequence matching (RSM) and do not support subsequence
normalization. UCR Suite can deal with normalized subsequence match problem
(NSM), but it needs to scan full time series. In this paper, we propose a novel
problem, named constrained normalized subsequence matching problem (cNSM),
which adds some constraints to NSM problem. The cNSM problem provides a knob to
flexibly control the degree of offset shifting and amplitude scaling, which
enables users to build the index to process the query. We propose a new index
structure, KVindex, and the matching algorithm, KVmatch. With a single index,
our approach can support both RSM and cNSM problems under either ED or DTW
distance. KVindex is a keyvalue structure, which can be easily implemented on
local files or HBase tables. To support the query of arbitrary lengths, we
extend KVmatch to KVmatch$_{DP}$, which utilizes multiple variedlength
indexes to process the query. We conduct extensive experiments on synthetic and
realworld datasets. The results verify the effectiveness and efficiency of our
approach.

Interactive dataintensive applications are becoming ever more pervasive in
domains such as finance, web applications, mobile computing, and Internet of
Things. Increasingly, these applications are being deployed in sophisticated
parallel and distributed hardware infrastructures. With this growing diversity
of the software and hardware landscape, there is a pressure on programming
models and systems to enable developers to design modular, scalable, efficient,
and consistent dataintensive applications. In response to this challenge,
recent research has advocated the integration of actor programming models and
database management. This integration promises to help developers build
logically distributed microapplications well adapted to modern hardware trends
as opposed to existing approaches targeted at optimizing monolithic
applications.
Towards this aim, in this paper we analyze, make the case for, and present a
broad vision of actorrelational database systems. We argue why the time is
ripe today to examine the research opportunities afforded by this emerging
system paradigm. Based on this discussion, we present design principles as well
as candidate feature sets to help concretize the vision for such systems. To
illustrate the usefulness of the proposed feature set and motivate the need for
this class of systems, we show a detailed case study inspired by a smart
supermarket application with selfcheckout, along with evidence for performance
benefits on modern hardware.

Recently, with the rapid development of mobile devices and the crowdsourcing
platforms, the spatial crowdsourcing has attracted much attention from the
database community. Specifically, spatial crowdsourcing refers to sending a
locationbased request to workers according to their positions, and workers
need to physically move to specified locations to conduct tasks. Many works
have studied task assignment problems in spatial crowdsourcing, however, their
problem settings are different from each other. Thus, it is hard to compare the
performances of existing algorithms on task assignment in spatial
crowdsourcing. In this paper, we present a comprehensive experimental
comparison of most existing algorithms on task assignment in spatial
crowdsourcing. Specifically, we first give general definitions about spatial
workers and spatial tasks based on definitions in the existing works such that
the existing algorithms can be applied on the same synthetic and real data
sets. Then, we provide an uniform implementation for all the tested algorithms
of task assignment problems in spatial crowdsourcing (open sourced). Finally,
based on the results on both synthetic and real data sets, we discuss the
strengths and weaknesses of tested algorithms, which can guide future research
on the same area and practical implementations of spatial crowdsourcing
systems.

Differential privacy promises to enable general data analytics while
protecting individual privacy, but existing differential privacy mechanisms do
not support the wide variety of features and databases used in realworld
SQLbased analytics systems.
This paper presents the first practical approach for differential privacy of
SQL queries. Using 8.1 million realworld queries, we conduct an empirical
study to determine the requirements for practical differential privacy, and
discuss limitations of previous approaches in light of these requirements. To
meet these requirements we propose elastic sensitivity, a novel method for
approximating the local sensitivity of queries with general equijoins. We prove
that elastic sensitivity is an upper bound on local sensitivity and can
therefore be used to enforce differential privacy using any local
sensitivitybased mechanism.
We build FLEX, a practical endtoend system to enforce differential privacy
for SQL queries using elastic sensitivity. We demonstrate that FLEX is
compatible with any existing database, can enforce differential privacy for
realworld SQL queries, and incurs negligible (0.03%) performance overhead.

We consider answering queries where the underlying data is available only
over limited interfaces which provide lookup access to the tuples matching a
given binding, but possibly restricting the number of output tuples returned.
Interfaces imposing such "result bounds" are common in accessing data via the
web. Given a query over a set of relations as well as some integrity
constraints that relate the queried relations to the data sources, we examine
the problem of deciding if the query is answerable over the interfaces; that
is, whether there exists a plan that returns all answers to the query, assuming
the source data satisfies the integrity constraints.
The first component of our analysis of answerability is a reduction to a
query containment problem with constraints. The second component is a set of
"schema simplification" theorems capturing limitations on how interfaces with
result bounds can be useful to obtain complete answers to queries. These
results also help to show decidability for the containment problem that
captures answerability, for many classes of constraints. The final component in
our analysis of answerability is a "linearization" method, showing that query
containment with certain guarded dependencies  including those that emerge
from answerability problems  can be reduced to query containment for a
wellbehaved class of linear dependencies. Putting these components together,
we get a detailed picture of how to check answerability over resultbounded
services.

Complex Event Processing (CEP) has emerged as the unifying field for
technologies that require processing and correlating distributed data sources
in realtime. CEP finds applications in diverse domains, which has resulted in
a large number of proposals for expressing and processing complex events.
However, existing CEP languages lack from a clear semantics, making them hard
to understand and generalize. Moreover, there are no general techniques for
evaluating CEP query languages with clear performance guarantees.
In this paper we embark on the task of giving a rigorous and efficient
framework to CEP. We propose a formal language for specifying complex events,
called CEL, that contains the main features used in the literature and has a
denotational and compositional semantics. We also formalize the socalled
selection strategies, which had only been presented as bydesign extensions to
existing frameworks. With a welldefined semantics at hand, we study how to
efficiently evaluate CEL for processing complex events in the case of unary
filters. We start by studying the syntactical properties of CEL and propose
rewriting optimization techniques for simplifying the evaluation of formulas.
Then, we introduce a formal computational model for CEP, called complex event
automata (CEA), and study how to compile CEL formulas into CEA. Furthermore, we
provide efficient algorithms for evaluating CEA over event streams using
constant time per event followed by constantdelay enumeration of the results.
By gathering these results together, we propose a framework for efficiently
evaluating CEL with unary filters. Finally, we show experimentally that this
framework consistently outperforms the competition, and even over trivial
queries can be orders of magnitude more efficient.

People learn whenever and wherever possible, and whatever they like or
encounterMathematics, Drama, Art, Languages, Physics, Philosophy, and so on.
With the bursting of knowledge, evaluation of one's understanding of conceptual
knowledge becomes increasingly difficult. There are a lot of demands for
evaluating one's understanding of a piece of knowledge, e.g., facilitating
personalized recommendations; discovering one's expertises and deficiencies in
a field; recommending topics for a conversation between people with different
educational or cultural backgrounds in their first encounter; recommending a
learning material to practice a meaningful learning etc. Assessment of
understanding of knowledge is conventionally practiced through tests or
interviews, but they have some limitations such as lowefficiency and
notcomprehensive. We propose a method to estimate one's understanding of
conceptual knowledge, by keeping track of his/her learning activities. It
overcomes some limitations of traditional methods, hence complements
traditional methods.

We propose a novel generic inverted index framework on the GPU (called
GENIE), aiming to reduce the programming complexity of the GPU for parallel
similarity search of different data types. Not every data type and similarity
measure are supported by GENIE, but many popular ones are. We present the
system design of GENIE, and demonstrate similarity search with GENIE on several
data types along with a theoretical analysis of search results. A new concept
of locality sensitive hashing (LSH) named $\tau$ANN search, and a novel data
structure cPQ on the GPU are also proposed for achieving this purpose.
Extensive experiments on different reallife datasets demonstrate the
efficiency and effectiveness of our framework. The implemented system has been
released as open source.