• ### To Index or Not to Index: Optimizing Exact Maximum Inner Product Search(1706.01449)

March 15, 2019 cs.PF, cs.DB, cs.DS, cs.IR
Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers by over an order of magnitude, for some -- but not all -- inputs. In this paper, we also present a novel MIPS solution, MAXIMUS, that takes advantage of hardware efficiency and pruning of the search space. Like BMM, MAXIMUS is faster than other solvers by up to an order of magnitude, but again only for some inputs. Since no single solution offers the best runtime performance for all inputs, we introduce a new data-dependent optimizer, OPTIMUS, that selects online with minimal overhead the best MIPS solver for a given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS solvers by 3.2$\times$ on average, and up to 10.9$\times$, on widely studied MIPS datasets.
• ### ObliDB: Oblivious Query Processing for Secure Databases(1710.00458)

Sept. 19, 2019 cs.CR
Hardware enclaves such as Intel SGX are a promising technology for improving the security of databases outsourced to the cloud. These enclaves provide an execution environment isolated from the hypervisor/OS, and encrypt data in RAM. However, for applications that use large amounts of memory, including most databases, enclaves do not protect against access pattern leaks, which let attackers gain a large amount of information about the data. Moreover,the naive way to address this issue, using Oblivious RAM (ORAM) primitives from the security literature, adds substantial overhead. A number of recent works explore trusted hardware enclaves as a path toward secure, access-pattern oblivious outsourcing of data storage and analysis. While these works efficiently solve specific subproblems (e.g. building secure indexes or running analytics queries that always scan entire tables), no prior work has supported oblivious query processing for general query workloads on a DBMS engine with multiple access methods. Moreover, applying these techniques individually does not guarantee that an end-to-end workload, such as a complex SQL query over multiple tables, will be oblivious. In this paper, we introduce ObliDB, an oblivious database engine design that is the first system to provide obliviousness for general database read workloads over multiple access methods. ObliDB supports a broad range of queries, including aggregation, joins, insertions, deletions and point queries. We implement ObliDB and show that, on analytics work-loads, ObliDB ranges from 1.1-19x faster than Opaque,a previous oblivious, enclave-based system designed only for analytics, and comes within 2.6x of Spark SQL. ObliDB supports point queries with 3-10ms latency, which runs over 7x faster than HIRB, a previous encryption-based oblivious index system.
• ### BlazeIt: Fast Exploratory Video Queries using Neural Networks(1805.01046)

May 2, 2018 cs.DB
• ### Infrastructure for Usable Machine Learning: The Stanford DAWN Project(1705.07538)

June 9, 2017 cs.DB, cs.LG, stat.ML
Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering organizations. This expense comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting end-to-end machine learning application development, from data preparation and labeling to productionization and monitoring. In this document, we outline opportunities for infrastructure supporting usable, end-to-end machine learning applications in the context of the nascent DAWN (Data Analytics for What's Next) project at Stanford.
• ### Matrix Computations and Optimization in Apache Spark(1509.02256)

July 13, 2016 cs.DC
We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.
• ### MLlib: Machine Learning in Apache Spark(1505.06807)

May 26, 2015 cs.DC, cs.MS, cs.LG, stat.ML
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
• ### Large Scale Estimation in Cyberphysical Systems using Streaming Data: a Case Study with Smartphone Traces(1212.3393)

Dec. 14, 2012 cs.SE, cs.RO
Controlling and analyzing cyberphysical and robotics systems is increasingly becoming a Big Data challenge. Pushing this data to, and processing in the cloud is more efficient than on-board processing. However, current cloud-based solutions are not suitable for the latency requirements of these applications. We present a new concept, Discretized Streams or D-Streams, that enables massively scalable computations on streaming data with latencies as short as a second. We experiment with an implementation of D-Streams on top of the Spark computing framework. We demonstrate the usefulness of this concept with a novel algorithm to estimate vehicular traffic in urban networks. Our online EM algorithm can estimate traffic on a very large city network (the San Francisco Bay Area) by processing tens of thousands of observations per second, with a latency of a few seconds.
• ### Shark: SQL and Rich Analytics at Scale(1211.6176)

Nov. 27, 2012 cs.DB
Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100x faster than Apache Hive, and machine learning programs up to 100x faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.
• ### Faster and More Accurate Sequence Alignment with SNAP(1111.5572)

Nov. 23, 2011 q-bio.GN, cs.DS
We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more accurate (i.e., aligns more reads with fewer errors) and 10-100x faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST's. However, SNAP greatly reduces the number and cost of local alignment checks performed through several measures: it uses longer seeds to reduce the false positive locations considered, leverages larger memory capacities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The result is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that today's fast aligners ignore. We calculate that SNAP can align a dataset with 30x coverage of a human genome in less than an hour for a cost of \$2 on Amazon EC2, with higher accuracy than BWA. Finally, we describe ongoing work to further improve SNAP.
• ### ICTD for Healthcare in Ghana: Two Parallel Case Studies(0905.0203)

May 2, 2009 cs.HC
This paper examines two parallel case studies to promote remote medical consultation in Ghana. These projects, initiated independently by different researchers in different organizations, both deployed ICT solutions in the same medical community in the same year. The Ghana Consultation Network currently has over 125 users running a Web-based application over a delay-tolerant network of servers. OneTouch MedicareLine is currently providing 1700 doctors in Ghana with free mobile phone calls and text messages to other members of the medical community. We present the consequences of (1) the institutional context and identity of the investigators, as well as specific decisions made with respect to (2) partnerships formed, (3) perceptions of technological infrastructure, and (4) high-level design decisions. In concluding, we discuss lessons learned and high-level implications for future ICTD research agendas.