• This document defines extensions of the RDF data model and of the SPARQL query language that capture an alternative approach to represent statement-level metadata. While this alternative approach is backwards compatible with RDF reification as defined by the RDF standard, the approach aims to address usability and data management shortcomings of RDF reification. One of the great advantages of the proposed approach is that it clarifies a means to (i) understand sparse matrices, the property graph model, hypergraphs, and other data structures with an emphasis on link attributes, (ii) map such data onto RDF, and (iii) query such data using SPARQL. Further, the proposal greatly expands both the freedom that database designers enjoy when creating physical indexing schemes and query plans for graph data annotated with link attributes and the interoperability of those database solutions.
  • Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers by over an order of magnitude, for some -- but not all -- inputs. In this paper, we also present a novel MIPS solution, MAXIMUS, that takes advantage of hardware efficiency and pruning of the search space. Like BMM, MAXIMUS is faster than other solvers by up to an order of magnitude, but again only for some inputs. Since no single solution offers the best runtime performance for all inputs, we introduce a new data-dependent optimizer, OPTIMUS, that selects online with minimal overhead the best MIPS solver for a given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS solvers by 3.2$\times$ on average, and up to 10.9$\times$, on widely studied MIPS datasets.
  • We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the ALC family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze the shape of rewritings, which in the MMSNP case correspond to obstructions, and give a new construction of canonical Datalog programs that is more elementary than existing ones and also applies to formulas with free variables.
  • Data exchange is the problem of transforming data that is structured under a source schema into data structured under another schema, called the target schema, so that both the source and target data satisfy the relationship between the schemas. Even though the formal framework of data exchange for relational database systems is well-established, it does not immediately carry over to the settings of temporal data, which necessitates reasoning over unbounded periods of time. In this work, we study data exchange for temporal data. We first motivate the need for two views of temporal data: the concrete view, which depicts how temporal data is compactly represented and on which the implementations are based, and the abstract view, which defines the semantics of temporal data as a sequence of snapshots. We first extend the chase procedure for the abstract view to have a conceptual basis for the data exchange for temporal databases. Considering non-temporal source-to-target tuple generating dependencies and equality generating dependencies, the chase algorithm can be applied on each snapshot independently. Then we define a chase procedure (called c-chase) on concrete instances and show the result of c-chase on a concrete instance is semantically aligned with the result of chase on the corresponding abstract instance. In order to interpret intervals as constants while checking if a dependency or a query is satisfied by a concrete database, we will normalize the instance with respect to the dependency or the query. To obtain the semantic alignment, the nulls in the concrete view are annotated with temporal information. Furthermore, we show that the result of the concrete chase provides a foundation for query answering. We define naive evaluation on the result of the c-chase and show it produces certain answers.
  • We study a natural variant of the implicational fragment of propositional logic. Its formulas are pairs of conjunctions of positive literals, related together by an implicational-like connective; the semantics of this sort of implication is defined in terms of a threshold on a conditional probability of the consequent, given the antecedent: we are dealing with what the data analysis community calls confidence of partial implications or association rules. Existing studies of redundancy among these partial implications have characterized so far only entailment from one premise and entailment from two premises, both in the stand-alone case and in the case of presence of additional classical implications (this is what we call "relative entailment"). By exploiting a previously noted alternative view of the entailment in terms of linear programming duality, we characterize exactly the cases of entailment from arbitrary numbers of premises, again both in the stand-alone case and in the case of presence of additional classical implications. As a result, we obtain decision algorithms of better complexity; additionally, for each potential case of entailment, we identify a critical confidence threshold and show that it is, actually, intrinsic to each set of premises and antecedent of the conclusion.
  • A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional categorical datasets. This method, called FLAME (Fast Large-scale Almost Matching Exactly), learns a distance metric for matching using a hold-out training data set. In order to perform matching efficiently for large datasets, FLAME leverages techniques that are natural for query processing in the area of database management, and two implementations of FLAME are provided: the first uses SQL queries and the second uses bit-vector techniques. The algorithm starts by constructing matches of the highest quality (exact matches on all covariates), and successively eliminates variables in order to match exactly on as many variables as possible, while still maintaining interpretable high-quality matches and balance between treatment and control groups. We leverage these high quality matches to estimate conditional average treatment effects (CATEs). Our experiments show that FLAME scales to huge datasets with millions of observations where existing state-of-the-art methods fail, and that it achieves significantly better performance than other matching methods.
  • Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. We fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/text compression scheme Lempel-Ziv-Welch, but tailored to MGD in a way that preserves tuple boundaries within mini-batches. We then present a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that operate directly over the compressed data representation and avoid decompression overheads. An extensive empirical evaluation with real-world datasets shows that TOC consistently achieves substantial compression ratios by up to 51x and reduces runtimes for MGD workloads by up to 10.2x in popular ML systems.
  • Visual query systems (VQSs) empower users to interactively search for line charts with desired visual patterns, typically specified using intuitive sketch-based interfaces. Despite decades of past work on VQSs, these efforts have not translated to adoption in practice, possibly because VQSs are largely evaluated in unrealistic lab-based settings. To remedy this gap in adoption, we collaborated with experts from three diverse domains---astronomy, genetics, and material science---via a year-long user-centered design process to develop a VQS that supports their workflow and analytical needs, and evaluate how VQSs can be used in practice. Our study results reveal that ad-hoc sketch-only querying is not as commonly used as prior work suggests, since analysts are often unable to precisely express their patterns of interest. In addition, we characterize three essential sensemaking processes supported by our enhanced VQS. We discover that participants employ all three processes, but in different proportions, depending on the analytical needs in each domain. Our findings suggest that all three sensemaking processes must be integrated in order to make future VQSs useful for a wide range of analytical inquiries.
  • Hypertree decompositions, as well as the more powerful generalized hypertree decompositions (GHDs), and the yet more general fractional hypertree decompositions (FHD) are hypergraph decomposition methods successfully used for answering conjunctive queries and for the solution of constraint satisfaction problems. Every hypergraph H has a width relative to each of these decomposition methods: its hypertree width hw(H), its generalized hypertree width ghw(H), and its fractional hypertree width fhw(H), respectively. It is known that hw(H) <= k can be checked in polynomial time for fixed k, while checking ghw(H) <= k is NP-complete for any k greater than or equal to 3. The complexity of checking fhw(H) <= k for a fixed k has been open for more than a decade. We settle this open problem by showing that checking fhw(H) <= k is NP-complete, even for k=2. The same construction allows us to prove also the NP-completeness of checking ghw(H) <= k for k=2. After proving these hardness results, we identify meaningful restrictions, for which checking for bounded ghw or fhw becomes tractable.
  • Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.
  • The study of node selection query languages for (finite) trees has been a major topic in the recent research on query languages for Web documents. On one hand, there has been an extensive study of XPath and its various extensions. On the other hand, query languages based on classical logics, such as first-order logic (FO) or Monadic Second-Order Logic (MSO), have been considered. Results in this area typically relate an XPath-based language to a classical logic. What has yet to emerge is an XPath-related language that is as expressive as MSO, and at the same time enjoys the computational properties of XPath, which are linear time query evaluation and exponential time query-containment test. In this paper we propose muXPath, which is the alternation-free fragment of XPath extended with fixpoint operators. Using two-way alternating automata, we show that this language does combine desired expressiveness and computational properties, placing it as an attractive candidate for the definite node-selection query language for trees.
  • Single-source and top-$k$ SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.
  • XML query can be modeled by twig pattern query (TPQ) specifying predicates on XML nodes and XPath relationships satisfied between them. A lot of TPQ types have been proposed; this paper takes into account a TPQ model extended by a specification of output and non-output query nodes since it complies with the XQuery semantics and, in many cases, it leads to a more efficient query processing. In general, there are two approaches to process the TPQ: holistic joins and binary joins. Whereas the binary join approach builds a query plan as a tree of interconnected binary operators, the holistic join approach evaluates a whole query using one operator (i.e., using one complex algorithm). Surprisingly, a thorough analytical and experimental comparison is still missing despite an enormous research effort in this area. In this paper, we try to fill this gap; we analytically and experimentally show that the binary joins used in a fully-pipelined plan (i.e., the plan where each join operation does not wait for the complete result of the previous operation and no explicit sorting is used) can often outperform the holistic joins, especially for TPQs with a higher ratio of non-output query nodes. The main contributions of this paper can be summarized as follows: (i) we introduce several improvements of existing binary join approaches allowing to build a fully-pipelined plan for a TPQ considering non-output query nodes, (ii) we prove that for a certain class of TPQs such a plan has the linear time complexity with respect to the size of the input and output as well as the linear space complexity with respect to the XML document depth (i.e., the same complexity as the holistic join approaches), (iii) we show that our improved binary join approach outperforms the holistic join approaches in many situations, and (iv) we propose a simple combined approach that uses advantages of both types of approaches.
  • The $k$ nearest neighbor ($k$NN) query is a fundamental problem in databases. Given a set of multidimensional data points and a query point, $k$NN returns the $k$ nearest neighbors based on a scoring function such as weighted sum given an attribute weight vector. However, the attribute weight vector can be difficult to specify in practice. Skyline returns the points including all possible nearest neighbors without requiring the exact attribute weight vector or a scoring function but the number of returned points can be prohibitively large for practical use. In this paper, we propose a novel \emph{eclipse} definition which provides a more flexible and customizable definition than the classic $1$NN and skyline. In eclipse, users can specify a range of attribute weights and control the number of returned points. We show that both $1$NN and skyline are instantiations of eclipse. To compute eclipse points, we propose a baseline algorithm with time complexity of $O(n^22^{d-1})$, and an improved $O(n\log ^{d-1}n)$ time transformation-based algorithm by transforming the eclipse problem to the skyline problem, where $n$ is the number of points and $d$ is the number of dimensions. Furthermore, we propose a novel index-based algorithm utilizing duality transform with much better efficiency. The experimental results on the real NBA dataset and the synthetic datasets demonstrate the effectiveness and efficiency of our eclipse algorithms.
  • We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families.
  • With the advent of Big Data, nowadays in many applications databases containing large quantities of similar time series are available. Forecasting time series in these domains with traditional univariate forecasting procedures leaves great potentials for producing accurate forecasts untapped. Recurrent neural networks (RNNs), and in particular Long Short-Term Memory (LSTM) networks, have proven recently that they are able to outperform state-of-the-art univariate time series forecasting methods in this context when trained across all available time series. However, if the time series database is heterogeneous, accuracy may degenerate, so that on the way towards fully automatic forecasting methods in this space, a notion of similarity between the time series needs to be built into the methods. To this end, we present a prediction model that can be used with different types of RNN models on subgroups of similar time series, which are identified by time series clustering techniques. We assess our proposed methodology using LSTM networks, a widely popular RNN variant. Our method achieves competitive results on benchmarking datasets under competition evaluation procedures. In particular, in terms of mean sMAPE accuracy, it consistently outperforms the baseline LSTM model and outperforms all other methods on the CIF2016 forecasting competition dataset.
  • The k-regret query aims to return a size-k subset S of a database D such that, for any query user that selects a data object from this size-k subset S rather than from database D, her regret ratio is minimized. The regret ratio here is modeled by the relative difference in the optimality between the locally optimal object in S and the globally optimal object in D. The optimality of a data object in turn is modeled by a utility function of the query user. Unlike traditional top-k queries, the k-regret query does not minimize the regret ratio for a specific utility function. Instead, it considers a family of infinite utility functions F, and aims to find a size-k subset that minimizes the maximum regret ratio of any utility function in F. Studies on k-regret queries have focused on the family of additive utility functions, which have limitations in modeling individuals' preferences and decision making processes, especially for a common observation called the diminishing marginal rate of substitution (DMRS). We introduce k-regret queries with multiplicative utility functions, which are more expressive in modeling the DMRS, to overcome those limitations. We propose a query algorithm with bounded regret ratios. To showcase the applicability of the algorithm, we apply it to a special family of multiplicative utility functions, the Cobb-Douglas family of utility functions, and a closely related family of utility functions, the Constant Elasticity of Substitution family of utility functions, both of which are frequently used utility functions in microeconomics. After a further study of the query properties, we propose a heuristic algorithm that produces even smaller regret ratios in practice. Extensive experiments on the proposed algorithms confirm that they consistently achieve small maximum regret ratios.
  • The volume of time series data has exploded due to the popularity of new applications, such as data center management and IoT. Subsequence matching is a fundamental task in mining time series data. All index-based approaches only consider raw subsequence matching (RSM) and do not support subsequence normalization. UCR Suite can deal with normalized subsequence match problem (NSM), but it needs to scan full time series. In this paper, we propose a novel problem, named constrained normalized subsequence matching problem (cNSM), which adds some constraints to NSM problem. The cNSM problem provides a knob to flexibly control the degree of offset shifting and amplitude scaling, which enables users to build the index to process the query. We propose a new index structure, KV-index, and the matching algorithm, KV-match. With a single index, our approach can support both RSM and cNSM problems under either ED or DTW distance. KV-index is a key-value structure, which can be easily implemented on local files or HBase tables. To support the query of arbitrary lengths, we extend KV-match to KV-match$_{DP}$, which utilizes multiple varied-length indexes to process the query. We conduct extensive experiments on synthetic and real-world datasets. The results verify the effectiveness and efficiency of our approach.
  • Interactive data-intensive applications are becoming ever more pervasive in domains such as finance, web applications, mobile computing, and Internet of Things. Increasingly, these applications are being deployed in sophisticated parallel and distributed hardware infrastructures. With this growing diversity of the software and hardware landscape, there is a pressure on programming models and systems to enable developers to design modular, scalable, efficient, and consistent data-intensive applications. In response to this challenge, recent research has advocated the integration of actor programming models and database management. This integration promises to help developers build logically distributed micro-applications well adapted to modern hardware trends as opposed to existing approaches targeted at optimizing monolithic applications. Towards this aim, in this paper we analyze, make the case for, and present a broad vision of actor-relational database systems. We argue why the time is ripe today to examine the research opportunities afforded by this emerging system paradigm. Based on this discussion, we present design principles as well as candidate feature sets to help concretize the vision for such systems. To illustrate the usefulness of the proposed feature set and motivate the need for this class of systems, we show a detailed case study inspired by a smart supermarket application with self-checkout, along with evidence for performance benefits on modern hardware.
  • Recently, with the rapid development of mobile devices and the crowdsourcing platforms, the spatial crowdsourcing has attracted much attention from the database community. Specifically, spatial crowdsourcing refers to sending a location-based request to workers according to their positions, and workers need to physically move to specified locations to conduct tasks. Many works have studied task assignment problems in spatial crowdsourcing, however, their problem settings are different from each other. Thus, it is hard to compare the performances of existing algorithms on task assignment in spatial crowdsourcing. In this paper, we present a comprehensive experimental comparison of most existing algorithms on task assignment in spatial crowdsourcing. Specifically, we first give general definitions about spatial workers and spatial tasks based on definitions in the existing works such that the existing algorithms can be applied on the same synthetic and real data sets. Then, we provide an uniform implementation for all the tested algorithms of task assignment problems in spatial crowdsourcing (open sourced). Finally, based on the results on both synthetic and real data sets, we discuss the strengths and weaknesses of tested algorithms, which can guide future research on the same area and practical implementations of spatial crowdsourcing systems.
  • Differential privacy promises to enable general data analytics while protecting individual privacy, but existing differential privacy mechanisms do not support the wide variety of features and databases used in real-world SQL-based analytics systems. This paper presents the first practical approach for differential privacy of SQL queries. Using 8.1 million real-world queries, we conduct an empirical study to determine the requirements for practical differential privacy, and discuss limitations of previous approaches in light of these requirements. To meet these requirements we propose elastic sensitivity, a novel method for approximating the local sensitivity of queries with general equijoins. We prove that elastic sensitivity is an upper bound on local sensitivity and can therefore be used to enforce differential privacy using any local sensitivity-based mechanism. We build FLEX, a practical end-to-end system to enforce differential privacy for SQL queries using elastic sensitivity. We demonstrate that FLEX is compatible with any existing database, can enforce differential privacy for real-world SQL queries, and incurs negligible (0.03%) performance overhead.
  • We consider answering queries where the underlying data is available only over limited interfaces which provide lookup access to the tuples matching a given binding, but possibly restricting the number of output tuples returned. Interfaces imposing such "result bounds" are common in accessing data via the web. Given a query over a set of relations as well as some integrity constraints that relate the queried relations to the data sources, we examine the problem of deciding if the query is answerable over the interfaces; that is, whether there exists a plan that returns all answers to the query, assuming the source data satisfies the integrity constraints. The first component of our analysis of answerability is a reduction to a query containment problem with constraints. The second component is a set of "schema simplification" theorems capturing limitations on how interfaces with result bounds can be useful to obtain complete answers to queries. These results also help to show decidability for the containment problem that captures answerability, for many classes of constraints. The final component in our analysis of answerability is a "linearization" method, showing that query containment with certain guarded dependencies -- including those that emerge from answerability problems -- can be reduced to query containment for a well-behaved class of linear dependencies. Putting these components together, we get a detailed picture of how to check answerability over result-bounded services.
  • Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we study how to efficiently evaluate CEL for processing complex events in the case of unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. By gathering these results together, we propose a framework for efficiently evaluating CEL with unary filters. Finally, we show experimentally that this framework consistently outperforms the competition, and even over trivial queries can be orders of magnitude more efficient.
  • People learn whenever and wherever possible, and whatever they like or encounter--Mathematics, Drama, Art, Languages, Physics, Philosophy, and so on. With the bursting of knowledge, evaluation of one's understanding of conceptual knowledge becomes increasingly difficult. There are a lot of demands for evaluating one's understanding of a piece of knowledge, e.g., facilitating personalized recommendations; discovering one's expertises and deficiencies in a field; recommending topics for a conversation between people with different educational or cultural backgrounds in their first encounter; recommending a learning material to practice a meaningful learning etc. Assessment of understanding of knowledge is conventionally practiced through tests or interviews, but they have some limitations such as low-efficiency and not-comprehensive. We propose a method to estimate one's understanding of conceptual knowledge, by keeping track of his/her learning activities. It overcomes some limitations of traditional methods, hence complements traditional methods.
  • We propose a novel generic inverted index framework on the GPU (called GENIE), aiming to reduce the programming complexity of the GPU for parallel similarity search of different data types. Not every data type and similarity measure are supported by GENIE, but many popular ones are. We present the system design of GENIE, and demonstrate similarity search with GENIE on several data types along with a theoretical analysis of search results. A new concept of locality sensitive hashing (LSH) named $\tau$-ANN search, and a novel data structure c-PQ on the GPU are also proposed for achieving this purpose. Extensive experiments on different real-life datasets demonstrate the efficiency and effectiveness of our framework. The implemented system has been released as open source.