We present and study the Static-Routing-Resiliency problem, motivated by
routing on the Internet: Given a graph $G$, a unique destination vertex $d$,
and an integer constant $c>0$, does there exist a static and destination-based
routing scheme such that the correct delivery of packets from any source $s$ to
the destination $d$ is guaranteed so long as (1) no more than $c$ edges fail
and (2) there exists a physical path from $s$ to $d$? We embark upon a
systematic exploration of this fundamental question in a variety of models
(deterministic routing, randomized routing, with packet-duplication, with
packet-header-rewriting) and present both positive and negative results that
relate the edge-connectivity of a graph, i.e., the minimum number of edges
whose deletion partitions $G$, to its resiliency.
Recent work has made great progress in verifying the forwarding correctness
of networks . However, these approaches cannot be used to verify networks
containing middleboxes, such as caches and firewalls, whose forwarding behavior
depends on previously observed traffic. We explore how to verify reachability
properties for networks that include such "mutable datapath" elements. We want
our verification results to hold not just for the given network, but also in
the presence of failures. The main challenge lies in scaling the approach to
handle large and complicated networks, We address by developing and leveraging
the concept of slices, which allow network-wide verification to only require
analyzing small portions of the network. We show that with slices the time
required to verify an invariant on many production networks is independent of
the size of the network itself.
Control planes for global carrier networks should be programmable (so that
new functionality can be easily introduced) and scalable (so they can handle
the numerical scale and geographic scope of these networks). Neither
traditional control planes nor new SDN-based control planes meet both of these
goals. In this paper, we propose a framework for recursive routing computations
that combines the best of SDN (programmability) and traditional networks
(scalability through hierarchy) to achieve these two desired properties.
Through simulation on graphs of up to 10,000 nodes, we evaluate our design's
ability to support a variety of routing and traffic engineering solutions,
while incorporating a fast failure recovery mechanism.
Great progress has been made recently in verifying the correctness of router
forwarding tables. However, these approaches do not work for networks
containing middleboxes such as caches and firewalls whose forwarding behavior
depends on previously observed traffic. We explore how to verify isolation
properties in networks that include such "dynamic datapath" elements using
model checking. Our work leverages recent advances in SMT solvers, and the main
challenge lies in scaling the approach to handle large and complicated
networks. While the straightforward application of model checking to this
problem can only handle very small networks (if at all), our approach can
verify simple realistic invariants on networks containing 30,000 middleboxes in
a few minutes.
Many modern network designs incorporate "failover" paths into routers'
forwarding tables. We initiate the theoretical study of the conditions under
which such resilient routing tables can guarantee delivery of packets.
In this paper, we present BlinkDB, a massively parallel, sampling-based
approximate query engine for running ad-hoc, interactive SQL queries on large
volumes of data. The key insight that BlinkDB builds on is that one can often
make reasonable decisions in the absence of perfect answers. For example,
reliably detecting a malfunctioning server using a distributed collection of
system logs does not require analyzing every request processed by the system.
Based on this insight, BlinkDB allows one to trade-off query accuracy for
response time, enabling interactive queries over massive data by running
queries on data samples and presenting results annotated with meaningful error
bars. To achieve this, BlinkDB uses two key ideas that differentiate it from
previous work in this area: (1) an adaptive optimization framework that builds
and maintains a set of multi-dimensional, multi-resolution samples from
original data over time, and (2) a dynamic sample selection strategy that
selects an appropriately sized sample based on a query's accuracy and/or
response time requirements. We have built an open-source version of BlinkDB and
validated its effectiveness using the well-known TPC-H benchmark as well as a
real-world analytic workload derived from Conviva Inc. Our experiments on a 100
node cluster show that BlinkDB can answer a wide range of queries from a
real-world query trace on up to 17 TBs of data in less than 2 seconds (over
100\times faster than Hive), within an error of 2 - 10%.