• We present and study the Static-Routing-Resiliency problem, motivated by routing on the Internet: Given a graph $G$, a unique destination vertex $d$, and an integer constant $c>0$, does there exist a static and destination-based routing scheme such that the correct delivery of packets from any source $s$ to the destination $d$ is guaranteed so long as (1) no more than $c$ edges fail and (2) there exists a physical path from $s$ to $d$? We embark upon a systematic exploration of this fundamental question in a variety of models (deterministic routing, randomized routing, with packet-duplication, with packet-header-rewriting) and present both positive and negative results that relate the edge-connectivity of a graph, i.e., the minimum number of edges whose deletion partitions $G$, to its resiliency.
  • Recent work has made great progress in verifying the forwarding correctness of networks . However, these approaches cannot be used to verify networks containing middleboxes, such as caches and firewalls, whose forwarding behavior depends on previously observed traffic. We explore how to verify reachability properties for networks that include such "mutable datapath" elements. We want our verification results to hold not just for the given network, but also in the presence of failures. The main challenge lies in scaling the approach to handle large and complicated networks, We address by developing and leveraging the concept of slices, which allow network-wide verification to only require analyzing small portions of the network. We show that with slices the time required to verify an invariant on many production networks is independent of the size of the network itself.
  • Control planes for global carrier networks should be programmable (so that new functionality can be easily introduced) and scalable (so they can handle the numerical scale and geographic scope of these networks). Neither traditional control planes nor new SDN-based control planes meet both of these goals. In this paper, we propose a framework for recursive routing computations that combines the best of SDN (programmability) and traditional networks (scalability through hierarchy) to achieve these two desired properties. Through simulation on graphs of up to 10,000 nodes, we evaluate our design's ability to support a variety of routing and traffic engineering solutions, while incorporating a fast failure recovery mechanism.
  • Great progress has been made recently in verifying the correctness of router forwarding tables. However, these approaches do not work for networks containing middleboxes such as caches and firewalls whose forwarding behavior depends on previously observed traffic. We explore how to verify isolation properties in networks that include such "dynamic datapath" elements using model checking. Our work leverages recent advances in SMT solvers, and the main challenge lies in scaling the approach to handle large and complicated networks. While the straightforward application of model checking to this problem can only handle very small networks (if at all), our approach can verify simple realistic invariants on networks containing 30,000 middleboxes in a few minutes.
  • Many modern network designs incorporate "failover" paths into routers' forwarding tables. We initiate the theoretical study of the conditions under which such resilient routing tables can guarantee delivery of packets.
  • In this paper, we present BlinkDB, a massively parallel, sampling-based approximate query engine for running ad-hoc, interactive SQL queries on large volumes of data. The key insight that BlinkDB builds on is that one can often make reasonable decisions in the absence of perfect answers. For example, reliably detecting a malfunctioning server using a distributed collection of system logs does not require analyzing every request processed by the system. Based on this insight, BlinkDB allows one to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas that differentiate it from previous work in this area: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional, multi-resolution samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy and/or response time requirements. We have built an open-source version of BlinkDB and validated its effectiveness using the well-known TPC-H benchmark as well as a real-world analytic workload derived from Conviva Inc. Our experiments on a 100 node cluster show that BlinkDB can answer a wide range of queries from a real-world query trace on up to 17 TBs of data in less than 2 seconds (over 100\times faster than Hive), within an error of 2 - 10%.