• Read-Copy Update (RCU) is a scalable, high-performance Linux-kernel synchronization mechanism that runs low-overhead readers concurrently with updaters. Production-quality RCU implementations for multi-core systems are decidedly non-trivial. Giving the ubiquity of Linux, a rare "million-year" bug can occur several times per day across the installed base. Stringent validation of RCU's complex behaviors is thus critically important. Exhaustive testing is infeasible due to the exponential number of possible executions, which suggests use of formal verification. Previous verification efforts on RCU either focus on simple implementations or use modeling languages, the latter requiring error-prone manual translation that must be repeated frequently due to regular changes in the Linux kernel's RCU implementation. In this paper, we first describe the implementation of Tree RCU in the Linux kernel. We then discuss how to construct a model directly from Tree RCU's source code in C, and use the CBMC model checker to verify its safety and liveness properties. To our best knowledge, this is the first verification of a significant part of RCU's source code, and is an important step towards integration of formal verification into the Linux kernel's regression test suite.
  • With the emergence of Non-Volatile Memories (NVMs) and their shortcomings such as limited endurance and high power consumption in write requests, several studies have suggested hybrid memory architecture employing both Dynamic Random Access Memory (DRAM) and NVM in a memory system. By conducting a comprehensive experiments, we have observed that such studies lack to consider very important aspects of hybrid memories including the effect of: a) data migrations on performance, b) data migrations on power, and c) the granularity of data migration. This paper presents an efficient data migration scheme at the Operating System level in a hybrid DRAMNVM memory architecture. In the proposed scheme, two Least Recently Used (LRU) queues, one for DRAM section and one for NVM section, are used for the sake of data migration. With careful characterization of the workloads obtained from PARSEC benchmark suite, the proposed scheme prevents unnecessary migrations and only allows migrations which benefits the system in terms of power and performance. The experimental results show that the proposed scheme can reduce the power consumption up to 79% compared to DRAM-only memory and up to 48% compared to the state-of-the art techniques.
  • Modern GPUs face a trade-off on how the page size used for memory management affects address translation and demand paging. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing and splintering policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages. We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low.
  • Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Tegra K1 platform which has the integrated CPU-GPU architecture, we noticed that in the worst case scenario, the GPU kernels can suffer as much as 4X slowdown in the presence of co-running memory intensive CPU applications compared to their solo execution. In this paper, we propose a software mechanism, which we call BWLOCK++, to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications.
  • Poor time predictability of multicore processors has been a long-standing challenge in the real-time systems community. In this paper, we make a case that a fundamental problem that prevents efficient and predictable real-time computing on multicore is the lack of a proper memory abstraction to express memory criticality, which cuts across various layers of the system: the application, OS, and hardware. We, therefore, propose a new holistic resource management approach driven by a new memory abstraction, which we call Deterministic Memory. The key characteristic of deterministic memory is that the platform - the OS and hardware - guarantees small and tightly bounded worst-case memory access timing. In contrast, we call the conventional memory abstraction as best-effort memory in which only highly pessimistic worst-case bounds can be achieved. We propose to utilize both abstractions to achieve high time predictability but without significantly sacrificing performance. We present deterministic memory-aware OS and architecture designs, including OS-level page allocator, hardware-level cache, and DRAM controller designs. We implement the proposed OS and architecture extensions on Linux and gem5 simulator. Our evaluation results, using a set of synthetic and real-world benchmarks, demonstrate the feasibility and effectiveness of our approach.
  • Reproducing executions of multithreaded programs is very challenging due to many intrinsic and external non-deterministic factors. Existing RnR systems achieve significant progress in terms of performance overhead, but none targets the in-situ setting, in which replay occurs within the same process as the recording process. Also, most existing work cannot achieve identical replay, which may prevent the reproduction of some errors. This paper presents iReplayer, which aims to identically replay multithreaded programs in the original process (under the "in-situ" setting). The novel in-situ and identical replay of iReplayer makes it more likely to reproduce errors, and allows it to directly employ debugging mechanisms (e.g. watchpoints) to aid failure diagnosis. Currently, iReplayer only incurs 3% performance overhead on average, which allows it to be always enabled in the production environment. iReplayer enables a range of possibilities, and this paper presents three examples: two automatic tools for detecting buffer overflows and use-after-free bugs, and one interactive debugging tool that is integrated with GDB.
  • Applications in the AI and HPC fields require much memory capacity, and the amount of energy consumed by main memory of server machines is ever increasing. Energy consumption of main memory can be greatly reduced by applying approximate computing in exchange for increased bit error rates. AI and HPC applications are to some extent robust to bit errors because small numerical errors are amortized by their iterative nature. However, a single occurrence of a NaN due to bit-flips corrupts the whole calculation result. The issue is that fixing every bit-flip using ECC incurs too much overhead because the bit error rate is much higher than in normal environments. We propose a low-overhead method to fix NaNs when approximate computing is applied to main memory. The main idea is to reactively repair NaNs while leaving other non-fatal numerical errors as-is to reduce the overhead. We implemented a prototype by leveraging floating-point exceptions of x86 CPUs, and the preliminary evaluations showed that our method incurs negligible overhead.
  • In this work we present the Secure Machine, SeM for short, a CPU architecture extension for secure computing. SeM uses a small amount of in-chip additional hardware that monitors key communication channels inside the CPU chip, and only acts when required. SeM provides confidentiality and integrity for a secure program without trusting the platform software or any off-chip hardware. SeM supports existing binaries of single- and multi-threaded applications running on single- or multi-core, multi-CPU. The performance reduction caused by it is only few percent, most of which is due to the memory encryption layer that is commonly used in many secure architectures. We also developed SeM-Prepare, a software tool that automatically instruments existing applications (binaries) with additional instructions so they can be securely executed on our architecture without requiring any programming efforts or the availability of the desired program`s source code. To enable secure data sharing in shared memory environments, we developed Secure Distributed Shared Memory (SDSM), an efficient (time and memory) algorithm for allowing thousands of compute nodes to share data securely while running on an untrusted computing environment. SDSM shows a negligible reduction in performance, and it requires negligible and hardware resources. We developed Distributed Memory Integrity Trees, a method for enhancing single node integrity trees for preserving the integrity of a distributed application running on an untrusted computing environment. We show that our method is applicable to existing single node integrity trees such as Merkle Tree, Bonsai Merkle Tree, and Intel`s SGX memory integrity engine. All these building blocks may be used together to form a practical secure system, and some can be used in conjunction with other secure systems.
  • The sporadic task model is often used to analyze recurrent execution of identical tasks in real-time systems. A sporadic task defines an infinite sequence of task instances, also called jobs, that arrive under the minimum inter-arrival time constraint. To ensure the system safety, timeliness has to be guaranteed in addition to functional correctness, i.e., all jobs of all tasks have to be finished before the job deadlines. We focus on analyzing arbitrary-deadline task sets on a homogeneous (identical) multiprocessor system under any given global fixed-priority scheduling approach and provide a series of schedulability tests with different tradeoffs between their time complexity and their accuracy. Under the arbitrary-deadline setting, the relative deadline of a task can be longer than the minimum inter-arrival time of the jobs of the task. We show that global deadline-monotonic (DM) scheduling has a speedup bound of $3-1/M$ against any optimal scheduling algorithms, where $M$ is the number of identical processors, and prove that this bound is asymptotically tight.
  • Commodity OS kernels are known to have broad attack surfaces due to the large code base and the numerous features such as device drivers. For a real-world use case (e.g., an Apache Server), many kernel services are unused and only a small amount of kernel code is used. Within the used code, a certain part is invoked only at the runtime phase while the rest are executed at startup and/or shutdown phases in the kernel's lifetime run. In this paper, we propose a reliable and practical system, named KASR, which transparently reduces attack surfaces of commodity OS kernels at the runtime phase without requiring their source code. The KASR system, residing in a trusted hypervisor, achieves the attack surface reduction through two reliable steps: (1) deprives unused code of executable permissions, and (2) transparently segments used code and selectively activates them according to their phases. KASR also prevents unknown code from executing in the kernel space, and thus it is able to defend against all kernel code injection attacks. We implement a prototype of KASR on the Xen-4.8.2 hypervisor and evaluate its security effectiveness on Linux kernel-4.4.0-87-generic. Our evaluation shows that KASR reduces kernel attack surface by 64%, trims off 40% of in-memory CVE vulnerabilities and 66% of system calls, and prohibits 99% of on-disk device drivers (including their related CVEs) from executing. Besides, KASR successfully detects and blocks all 6 real-world kernel rootkits. We measure its performance overhead with three benchmark tools (i.e., SPECINT, httperf and bonnie++). The experimental results indicate that KASR imposes less than 1% averaged performance overhead compared to an unmodified Xen hypervisor.
  • Timing guarantees are crucial to cyber-physical applications that must bound the end-to-end delay between sensing, processing and actuation. For example, in a flight controller for a multirotor drone, the data from a gyro or inertial sensor must be gathered and processed to determine the attitude of the aircraft. Sensor data fusion is followed by control decisions that adjust the flight of a drone by altering motor speeds. If the processing pipeline between sensor input and actuation is not bounded, the drone will lose control and possibly fail to maintain flight. Motivated by the implementation of a multithreaded drone flight controller on the Quest RTOS, we develop a composable pipe model based on the system's task, scheduling and communication abstractions. This pipe model is used to analyze two semantics of end-to-end time: reaction time and freshness time. We also argue that end-to-end timing properties should be factored in at the early stage of application design. Thus, we provide a mathematical framework to derive feasible task periods that satisfy both a given set of end-to-end timing constraints and the schedulability requirement. We demonstrate the applicability of our design approach by using it to port the Cleanflight flight controller firmware to Quest on the Intel Aero board. Experiments show that Cleanflight ported to Quest is able to achieve end-to-end latencies within the predicted time bounds derived by analysis.
  • Time awareness is critical to a broad range of emerging applications -- in Cyber-Physical Systems and Internet of Things -- running on commodity platforms and operating systems. Traditionally, time is synchronized across devices through a best-effort background service whose performance is neither observable nor controllable, thus consuming system resources independently of application needs while not allowing the applications and OS services to adapt to changes in uncertainty in system time. We advocate for rethinking how time is managed in a system stack. In this paper, we propose a new clock model that characterizes various sources of timing uncertainties in true time. We then present a Kalman filter based time synchronization protocol that adapts to the uncertainties exposed by the clock model. Our realization of a uncertainty-aware clock model and synchronization protocol is based on a standard embedded Linux platform.
  • Creating a model of a computer system that can be used for tasks such as predicting future resource usage and detecting anomalies is a challenging problem. Most current systems rely on heuristics and overly simplistic assumptions about the workloads and system statistics. These heuristics are typically a one-size-fits-all solution so as to be applicable in a wide range of applications and systems environments. With this paper, we present our ongoing work of integrating systems telemetry ranging from standard resource usage statistics to kernel and library calls of applications into a machine learning model. Intuitively, such a ML model approximates, at any point in time, the state of a system and allows us to solve tasks such as resource usage prediction and anomaly detection. To achieve this goal, we leverage readily-available information that does not require any changes to the applications run on the system. We train recurrent neural networks to learn a model of the system under consideration. As a proof of concept, we train models specifically to predict future resource usage of running applications.
  • This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding, such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.
  • Efficient, reliable trapping of execution in a program at the desired location is a hot area of research for security professionals. The progression of debuggers and malware is akin to a game of cat and mouse - each are constantly in a state of trying to thwart one another. At the core of most efficient debuggers today is a combination of virtual machines and traditional binary modification breakpoints (int3). In this paper, we present a design for Virtual Breakpoints, a modification to the x86 MMU which brings breakpoint management into hardware alongside page tables. We demonstrate the fundamental abstraction failures of current trapping methods, and rebuild the mechanism from the ground up. Our design delivers fast, reliable trapping without the pitfalls of binary modification.
  • Basic mirroring (BM) classified as RAID level 1 replicates data on two disks, thus doubling disk access bandwidth for read requests. RAID1/0 is an array of BM pairs with balanced loads due to striping. When a disk fails the read load on its pair is doubled, which results in halving the maximum attainable bandwidth. We review RAID1 organizations which attain a balanced load upon disk failure, but as shown by reliability analysis tend to be less reliable than RAID1/0. Hybrid disk arrays which store XORed instead of replicated data tend to have a higher reliability than mirrored disks, but incur a higher overhead in updating data. Read request response time can be improved by processing them at a higher priority than writes, since they have a direct effect on application response time. Shortest seek distance and affinity based routing both shorten seek time. Anticipatory arm placement places arms optimally to minimize the seek distance. The analysis of RAID1 in normal, degraded, and rebuild mode is provided to quantify RAID1/0 performance. We compare the reliability of mirrored disk organizations against each other and hybrid disks and erasure coded disk arrays.
  • Many applications have service requirements that are not easily met by existing operating systems. Real-time and security-critical tasks, for example, often require custom OSes to meet their needs. However, development of special purpose OSes is a time-consuming and difficult exercise. Drivers, libraries and applications have to be written from scratch or ported from existing sources. Many researchers have tackled this problem by developing ways to extend existing systems with application-specific services. However, it is often difficult to ensure an adequate degree of separation between legacy and new services, especially when security and timing requirements are at stake. Virtualization, for example, supports logical isolation of separate guest services, but suffers from inadequate temporal isolation of time-critical code required for real-time systems. This paper presents vLibOS, a master-slave paradigm for new systems, whose services are built on legacy code that is temporally and spatially isolated in separate VM domains. Existing OSes are treated as sandboxed libraries, providing legacy services that are requested by inter-VM calls, which execute with the time budget of the caller. We evaluate a real-time implementation of vLibOS. Empirical results show that vLibOS achieves as much as a 50\% reduction in performance slowdown for real-time threads, when competing for a shared memory bus with a Linux VM.
  • To satisfy increasing storage demands in both capacity and performance, industry has turned to multiple storage technologies, including Flash SSDs and SMR disks. These devices employ a translation layer that conceals the idiosyncrasies of their mediums and enables random access. Device translation layers are, however, inherently constrained: resources on the drive are scarce, they cannot be adapted to application requirements, and lack visibility across multiple devices. As a result, performance and durability of many storage devices is severely degraded. In this paper, we present SALSA: a translation layer that executes on the host and allows unmodified applications to better utilize commodity storage. SALSA supports a wide range of single- and multi-device optimizations and, because is implemented in software, can adapt to specific workloads. We describe SALSA's design, and demonstrate its significant benefits using microbenchmarks and case studies based on three applications: MySQL, the Swift object store, and a video server.
  • Data retrieval systems such as online search engines and online social networks must comply with the privacy policies of personal and selectively shared data items, regulatory policies regarding data retention and censorship, and the provider's own policies regarding data use. Enforcing these policies is difficult and error-prone. Systematic techniques to enforce policies are either limited to type-based policies that apply uniformly to all data of the same type, or incur significant runtime overhead. This paper presents Shai, the first system that systematically enforces data-specific policies with near-zero overhead in the common case. Shai's key idea is to push as many policy checks as possible to an offline, ahead-of-time analysis phase, often relying on predicted values of runtime parameters such as the state of access control lists or connected users' attributes. Runtime interception is used sparingly, only to verify these predictions and to make any remaining policy checks. Our prototype implementation relies on efficient, modern OS primitives for sandboxing and isolation. We present the design of Shai and quantify its overheads on an experimental data indexing and search pipeline based on the popular search engine Apache Lucene.
  • The Internet of Things (IoT) is rapidly evolving based on low-power compliant protocol standards that extend the Internet into the embedded world. Pioneering implementations have proven it is feasible to inter-network very constrained devices, but had to rely on peculiar cross-layered designs and offer a minimalistic set of features. In the long run, however, professional use and massive deployment of IoT devices require full-featured, cleanly composed, and flexible network stacks. This paper introduces the networking architecture that turns RIOT into a powerful IoT system, to enable low-power wireless scenarios. RIOT networking offers (i) a modular architecture with generic interfaces for plugging in drivers, protocols, or entire stacks, (ii) support for multiple heterogeneous interfaces and stacks that can concurrently operate, and (iii) GNRC, its cleanly layered, recursively composed default network stack. We contribute an in-depth analysis of the communication performance and resource efficiency of RIOT, both on a micro-benchmarking level as well as by comparing IoT communication across different platforms. Our findings show that, though it is based on significantly different design trade-offs, the networking subsystem of RIOT achieves a performance equivalent to that of Contiki and TinyOS, the two operating systems which pioneered IoT software platforms.
  • Modern Operating Systems are typically POSIX-compliant. The system calls are the fundamental layer of interaction between user-space applications and the OS kernel and its implementation of fundamental abstractions and primitives used in modern computing. The next generation of NVM/SCM memory raises critical questions about the efficiency of modern OS architecture. This paper investigates how the POSIX API drives performance for a system with NVM/SCM memory. We show that OS and metadata related system calls represent the most important area of optimization. However, the synchronization related system calls (poll(), futex(), wait4()) are the most time-consuming overhead that even a RAMdisk platform fails to eliminate. Attempting to preserve the POSIX-based approach will likely result in fundamental inefficiencies for any future applications of NVM/SCM memory.
  • The security of billions of devices worldwide depends on the security and robustness of the mainline Linux kernel. However, the increasing number of kernel-specific vulnerabilities, especially memory safety vulnerabilities, shows that the kernel is a popular and practically exploitable target. Two major causes of memory safety vulnerabilities are reference counter overflows (temporal memory errors) and lack of pointer bounds checking (spatial memory errors). To succeed in practice, security mechanisms for critical systems like the Linux kernel must also consider performance and deployability as critical design objectives. We present and systematically analyze two such mechanisms for improving memory safety in the Linux kernel: (a) an overflow-resistant reference counter data structure designed to accommodate typical reference counter usage in kernel source code, and (b) runtime pointer bounds checking using Intel MPX in the kernel.
  • Computing technology has gotten cheaper and more powerful, allowing users to have a growing number of personal computing devices at their disposal. While this trend is beneficial for the user, it also creates a growing management burden for the user. Each device must be managed independently and users must repeat the same management tasks on the each device, such as updating software, changing configurations, backup, and replicating data for availability. To prevent the management burden from increasing with the number of devices, we propose that all devices run a single system image called a personal computing image. Personal computing images export a device-specific user interface on each device, but provide a consistent view of application and operating state across all devices. As a result, management tasks can be performed once on any device and will be automatically propagated to all other devices belonging to the user. We discuss evolutionary steps that can be taken to achieve personal computing images for devices and elaborate on challenges that we believe building such systems will face.
  • Virtualization technologies have evolved along with the development of computational environments since virtualization offered needed features at that time such as isolation, accountability, resource allocation, resource fair sharing and so on. Novel processor technologies bring to commodity computers the possibility to emulate diverse environments where a wide range of computational scenarios can be run. Along with processors evolution, system developers have created different virtualization mechanisms where each new development enhanced the performance of previous virtualized environments. Recently, operating system-based virtualization technologies captured the attention of communities abroad (from industry to academy and research) because their important improvements on performance area. In this paper, the features of three container-based operating systems virtualization tools (LXC, Docker and Singularity) are presented. LXC, Docker, Singularity and bare metal are put under test through a customized single node HPL-Benchmark and a MPI-based application for the multi node testbed. Also the disk I/O performance, Memory (RAM) performance, Network bandwidth and GPU performance are tested for the COS technologies vs bare metal. Preliminary results and conclusions around them are presented and discussed.
  • In spite of years of improvements to software security, heap-related attacks still remain a severe threat. One reason is that many existing memory allocators fall short in a variety of aspects. For instance, performance-oriented allocators are designed with very limited countermeasures against attacks, but secure allocators generally suffer from significant performance overhead, e.g., running up to 10x slower. This paper, therefore, introduces FreeGuard, a secure memory allocator that prevents or reduces a wide range of heap-related attacks, such as heap overflows, heap over-reads, use-after-frees, as well as double and invalid frees. FreeGuard has similar performance to the default Linux allocator, with less than 2% overhead on average, but provides significant improvement to security guarantees. FreeGuard also addresses multiple implementation issues of existing secure allocators, such as the issue of scalability. Experimental results demonstrate that FreeGuard is very effective in defending against a variety of heap-related attacks.