
Most of today's highspeed switches and routers adopt an inputqueued
crossbar switch architecture. Such a switch needs to compute a matching
(crossbar schedule) between the input ports and output ports during each
switching cycle (time slot). A key research challenge in designing large (in
number of input/output ports $N$) inputqueued crossbar switches is to develop
crossbar scheduling algorithms that can compute "high quality" matchings 
i.e., those that result in high switch throughput (ideally $100\%$) and low
queueing delays for packets  at line rates. SERENA is one such algorithm: it
outputs excellent matching decisions that result in $100\%$ switch throughput
and reasonably good queueing delays. However, since SERENA is a centralized
algorithm with $O(N)$ computational complexity, it cannot support switches that
both are large and have a very high line rate per port. In this work, we
propose SERENADE (SERENA, the Distributed Edition), a parallel iterative
algorithm that emulates SERENA in only $O(\log N)$ iterations between input
ports and output ports, and hence has a time complexity of only $O(\log N)$ per
port. We prove that SERENADE can exactly emulate SERENA. We also propose an
earlystop version of SERENADE, called OSERENADE, to only approximately
emulate SERENA. Through extensive simulations, we show that OSERENADE can
achieve 100\% throughput and that it has similar as or slightly better delay
performance than SERENA under various load conditions and traffic patterns.

Exact Maximum Inner Product Search (MIPS) is an important task that is widely
pertinent to recommender systems and highdimensional similarity search. The
bruteforce approach to solving exact MIPS is computationally expensive, thus
spurring recent development of novel indexes and pruning techniques for this
task. In this paper, we show that a hardwareefficient bruteforce approach,
blocked matrix multiply (BMM), can outperform the stateoftheart MIPS solvers
by over an order of magnitude, for some  but not all  inputs.
In this paper, we also present a novel MIPS solution, MAXIMUS, that takes
advantage of hardware efficiency and pruning of the search space. Like BMM,
MAXIMUS is faster than other solvers by up to an order of magnitude, but again
only for some inputs. Since no single solution offers the best runtime
performance for all inputs, we introduce a new datadependent optimizer,
OPTIMUS, that selects online with minimal overhead the best MIPS solver for a
given input. Together, OPTIMUS and MAXIMUS outperform stateoftheart MIPS
solvers by 3.2$\times$ on average, and up to 10.9$\times$, on widely studied
MIPS datasets.

We first consider the static problem of allocating resources to ( i.e. ,
scheduling) multiple distributed application framework s, possibly with
different priorities and server preferences , in a private cloud with
heterogeneous servers. Several fai r scheduling mechanisms have been proposed
for this purpose. We extend pr ior results on maxmin and proportional fair
scheduling to t his constrained multiresource and multiserver case for generi c
fair scheduling criteria. The task efficiencies (a metric r elated to
proportional fairness) of maxmin fair allocations found b y progressive
filling are compared by illustrative examples . They show that "server
specific" fairness criteria and those that are b ased on residual (unreserved)
resources are more efficient.

Motivated by recent concerns that queuing delays in the Internet are on the
rise, we conduct a performance evaluation of Compound TCP (CTCP) in two
topologies: a single bottleneck and a multibottleneck topology, under
different traffic scenarios. The first topology consists of a single bottleneck
router, and the second consists of two distinct sets of TCP flows, regulated by
two edge routers, feeding into a common core router. We focus on some dynamical
and statistical properties of the underlying system. From a dynamical
perspective, we develop fluid models in a regime wherein the number of flows is
large, bandwidthdelay product is high, buffers are dimensioned small
(independent of the bandwidthdelay product) and routers deploy a DropTail
queue policy. A detailed local stability analysis for these models yields the
following key insight: smaller buffers favour stability. Additionally, we
highlight that larger buffers, in addition to increasing latency, are prone to
inducing limit cycles in the system dynamics, via a Hopf bifurcation. These
limit cycles in turn cause synchronisation among the TCP flows, and also result
in a loss of link utilisation. For the topologies considered, we also
empirically analyse some statistical properties of the bottleneck queues. These
statistical analyses serve to validate an important modelling assumption: that
in the regime considered, each bottleneck queue may be approximated as either
an $M/M/1/B$ or an $M/D/1/B$ queue. This immediately makes the modelling
perspective attractive and the analysis tractable. Finally, we show that
smaller buffers, in addition to ensuring stability and low latency, would also
yield fairly good system performance, in terms of throughput and flow
completion times.

A performance regression is a degradation of software performance due to a
change in code. A common way to identify such regressions is to manually create
tests for them. This is known as regression testing. Although widely used,
regression testing can be slow and errorprone due to its historically manual
nature. Moreover, such tests' utility is usually constrained by the expertise
of the developer creating them. Due to these limitations, researchers have
turned their attention to the automatic generation of regression tests. In this
paper, we present AutoPerf  a novel approach to automate regression testing
that utilizes three core techniques: (i) zeropositive learning, (ii)
autoencoders, and (iii) hardware telemetry. We demonstrate AutoPerf's
generality and efficacy against 3 types of performance regressions across 10
real performance bugs in 7 benchmark and opensource programs. On average,
AutoPerf exhibits 4% profiling overhead and accurately diagnoses more
performance bugs than prior stateoftheart approaches.

Deep Neural Networks (DNNs) require very large amounts of computation both
for training and for inference when deployed in the field. Many different
algorithms have been proposed to implement the most computationally expensive
layers of DNNs. Further, each of these algorithms has a large number of
variants, which offer different tradeoffs of parallelism, data locality,
memory footprint, and execution time. In addition, specific algorithms operate
much more efficiently on specialized data layouts and formats.
We state the problem of optimal primitive selection in the presence of data
format transformations, and show that it is NPhard by demonstrating an
embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP).
We propose an analytic solution via a PBQP solver, and evaluate our approach
experimentally by optimizing several popular DNNs using a library of more than
70 DNN primitives, on an embedded platform and a general purpose platform. We
show experimentally that significant gains are possible versus the state of the
art vendor libraries by using a principled analytic solution to the problem of
layout selection in the presence of data format transformations.

Ultrareliable PointtoMultipoint (PtM) communications are expected to
become pivotal in networks offering future dependable services for smart
cities. In this regard, sparse Random Linear Network Coding (RLNC) techniques
have been widely employed to provide an efficient way to improve the
reliability of broadcast and multicast data streams. This paper addresses the
pressing concern of providing a tight approximation to the probability of a
user recovering a data stream protected by this kind of coding technique. In
particular, by exploiting the SteinChen method, we provide a novel and
general performance framework applicable to any combination of system and
service parameters, such as finite field sizes, lengths of the data stream and
level of sparsity. The deviation of the proposed approximation from Monte Carlo
simulations is negligible, improving significantly on the state of the art
performance bounds.

The massive quantities of genomic data being made available through gene
sequencing techniques are enabling breakthroughs in genomic science in many
areas such as medical advances in the diagnosis and treatment of diseases.
Analyzing this data, however, is a computational challenge insofar as the
computational costs of the relevant algorithms can grow with quadratic, cubic
or higher complexityleading to the need for leadership scale computing. In
this paper we describe a new approach to calculations of the Custom Correlation
Coefficient (CCC) between Single Nucleotide Polymorphisms (SNPs) across a
population, suitable for parallel systems equipped with graphics processing
units (GPUs) or Intel Xeon Phi processors. We describe the mapping of the
algorithms to accelerated processors, techniques used for eliminating redundant
calculations due to symmetries, and strategies for efficient mapping of the
calculations to manynode parallel systems. Results are presented demonstrating
high pernode performance and nearideal parallel scalability with rates of
more than nine quadrillion elementwise comparisons achieved per second with the
latest optimized code on the ORNL Titan system, this being orders of magnitude
faster than rates achieved using other codes and platforms as reported in the
literature. Also it is estimated that as many as 90 quadrillion comparisons per
second may be achievable on the upcoming ORNL Summit system, an additional 10X
performance increase. In a companion paper we describe corresponding techniques
applied to calculations of the Proportional Similarity metric for comparative
genomics applications.

We study delay of jobs that consist of multiple parallel tasks, which is a
critical performance metric in a wide range of applications such as data file
retrieval in coded storage systems and parallel computing. In this problem,
each job is completed only when all of its tasks are completed, so the delay of
a job is the maximum of the delays of its tasks. Despite the wide attention
this problem has received, tight analysis is still largely unknown since
analyzing job delay requires characterizing the complicated correlation among
task delays, which is hard to do.
We first consider an asymptotic regime where the number of servers, $n$, goes
to infinity, and the number of tasks in a job, $k^{(n)}$, is allowed to
increase with $n$. We establish the asymptotic independence of any $k^{(n)}$
queues under the condition $k^{(n)} = o(n^{1/4})$. This greatly generalizes the
asymptoticindependence type of results in the literature where asymptotic
independence is shown only for a fixed constant number of queues. As a
consequence of our independence result, the job delay converges to the maximum
of independent task delays.
We next consider the nonasymptotic regime. Here we prove that independence
yields a stochastic upper bound on job delay for any $n$ and any $k^{(n)}$ with
$k^{(n)}\le n$. The key component of our proof is a new technique we develop,
called "Poisson oversampling". Our approach converts the job delay problem into
a corresponding ballsandbins problem. However, in contrast with typical
ballsandbins problems where there is a negative correlation among bins, we
prove that our variant exhibits positive correlation.

This paper presents a methodology for simulating the Internet of Things (IoT)
using multilevel simulation models. With respect to conventional simulators,
this approach allows us to tune the level of detail of different parts of the
model without compromising the scalability of the simulation. As a use case, we
have developed a twolevel simulator to study the deployment of smart services
over rural territories. The higher level is base on a coarse grained,
agentbased adaptive parallel and distributed simulator. When needed, this
simulator spawns OMNeT++ model instances to evaluate in more detail the issues
concerned with wireless communications in restricted areas of the simulated
world. The performance evaluation confirms the viability of multilevel
simulations for IoT environments.

PageRank is a fundamental link analysis algorithm that also functions as a
key representative of the performance of Sparse MatrixVector (SpMV)
multiplication. The traditional PageRank implementation generates fine
granularity random memory accesses resulting in large amount of wasteful DRAM
traffic and poor bandwidth utilization. In this paper, we present a novel
PartitionCentric Processing Methodology (PCPM) to compute PageRank, that
drastically reduces the amount of DRAM communication while achieving high
sustained memory bandwidth. PCPM uses a Partitioncentric abstraction coupled
with the GatherApplyScatter (GAS) programming model. By carefully examining
how a PCPM based implementation impacts communication characteristics of the
algorithm, we propose several system optimizations that improve the execution
time substantially. More specifically, we develop (1) a new data layout that
significantly reduces communication and random DRAM accesses, and (2) branch
avoidance mechanisms to get rid of unpredictable datadependent branches.
We perform detailed analytical and experimental evaluation of our approach
using 6 large graphs and demonstrate an average 2.7x speedup in execution time
and 1.7x reduction in communication volume, compared to the stateoftheart.
We also show that unlike other GAS based implementations, PCPM is able to
further reduce main memory traffic by taking advantage of intelligent node
labeling that enhances locality. Although we use PageRank as the target
application in this paper, our approach can be applied to generic SpMV
computation.

The modeling and analysis of an LRU cache is extremely challenging as exact
results for the main performance metrics (e.g. hit rate) are either lacking or
cannot be used because of their high computational complexity for large caches.
As a result, various approximations have been proposed. The stateoftheart
method is the socalled TTL approximation, first proposed and shown to be
asymptotically exact for IRM requests by Fagin. It has been applied to various
other workload models and numerically demonstrated to be accurate but without
theoretical justification. In this paper we provide theoretical justification
for the approximation in the case where distinct contents are described by
independent stationary and ergodic processes. We show that this approximation
is exact as the cache size and the number of contents go to infinity. This
extends earlier results for the independent reference model. Moreover, we
establish results not only for the aggregate cache hit probability but also for
every individual content. Last, we obtain bounds on the rate of convergence.

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultrahigh dimensional datasets on a single machine, that does not
require similarity computations and is tailored for highperformance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in onepass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, highdimensional datasets from different
domains, including text, malicious URL, clickthrough prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current stateoftheart
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate kNN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full kNN graph in less than 10 seconds on the webspam
dataset, using bruteforce ($n^2D$), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results.

The goal of privacy metrics is to measure the degree of privacy enjoyed by
users in a system and the amount of protection offered by privacyenhancing
technologies. In this way, privacy metrics contribute to improving user privacy
in the digital world. The diversity and complexity of privacy metrics in the
literature makes an informed choice of metrics challenging. As a result,
instead of using existing metrics, new metrics are proposed frequently, and
privacy studies are often incomparable. In this survey we alleviate these
problems by structuring the landscape of privacy metrics. To this end, we
explain and discuss a selection of over eighty privacy metrics and introduce
categorizations based on the aspect of privacy they measure, their required
inputs, and the type of data that needs protection. In addition, we present a
method on how to choose privacy metrics based on nine questions that help
identify the right privacy metrics for a given scenario, and highlight topics
where additional work on privacy metrics is needed. Our survey spans multiple
privacy domains and can be understood as a general framework for privacy
measurement.

Web developers use base64 formats to include images, fonts, sounds and other
resources directly inside HTML, JavaScript, JSON and XML files. We estimate
that billions of base64 messages are decoded every day. We are motivated to
improve the efficiency of base64 encoding and decoding. Compared to
stateoftheart implementations, we multiply the speeds of both the encoding
(~10x) and the decoding (~7x). We achieve these good results by using the
singleinstructionmultipledata (SIMD) instructions available on recent Intel
processors (AVX2). Our accelerated software abides by the specification and
reports errors when encountering characters outside of the base64 set. It is
available online as free software under a liberal license.

Graphics Processing Units (GPUs) support dynamic voltage and frequency
scaling (DVFS) in order to balance computational performance and energy
consumption. However, there still lacks simple and accurate performance
estimation of a given GPU kernel under different frequency settings on real
hardware, which is important to decide best frequency configuration for energy
saving. This paper reveals a finegrained model to estimate the execution time
of GPU kernels with both core and memory frequency scaling. Over a 2.5x range
of both core and memory frequencies among 12 GPU kernels, our model achieves
accurate results (within 3.5\%) on real hardware. Compared with the cyclelevel
simulators, our model only needs some simple microbenchmark to extract a set
of hardware parameters and performance counters of the kernels to produce this
high accuracy.

Tightening performance bounds of ring networks with cyclic dependencies is
still an open problem in the literature. In this paper, we tackle such a
challenging issue based on Network Calculus. First, we review the conventional
timing approaches in the area and identify their main limitations, in terms of
delay bounds pessimism. Afterwards, we have introduced a new concept called Pay
Multiplexing Only at Convergence points (PMOC) to overcome such limitations.
PMOC considers the flow serialization phenomena along the flow path, by paying
the bursts of interfering flows only at the convergence points. The guaranteed
endto end service curves under such a concept have been defined and proved for
monoring and multiplering networks, as well as under Arbitrary and Fixed
Priority multiplexing. A sensitivity analysis of the computed delay bounds for
mono and multiplering networks is conducted with respect to various flow and
network parameters, and their tightness is assessed in comparison with an
achievable worstcase delay. A noticeable enhancement of the delay bounds, thus
network resource efficiency and scalability, is highlighted under our proposal
with reference to conventional approaches. Finally, the efficiency of the PMOC
approach to provide timing guarantees is confirmed in the case of a realistic
avionics application.

Calibration is an important step towards building reliable IoT systems. For
example, accurate sensor reading requires ADC calibration, and power monitoring
chips must be calibrated before being used for measuring the energy consumption
of IoT devices. In this paper, we present ProCal, a lowcost, accurate, and
scalable power calibration tool. ProCal is a programmable platform which
provides dynamic voltage and current output for calibration. The basic idea is
to use a digital potentiometer connected to a parallel resistor network
controlled through digital switches. The resistance and output frequency of
ProCal is controlled by a software communicating with the board through the SPI
interface. Our design provides a simple synchronization mechanism which
prevents the need for accurate time synchronization. We present mathematical
modeling and validation of the tool by incorporating the concept of Fibonacci
sequence. Our extensive experimental studies show that this tool can
significantly improve measurement accuracy. For example, for ATMega2560, the
ADC error reduces from 0.2% to 0.01%. ProCal not only costs less than 2\% of
the current commercial solutions, it is also highly accurate by being able to
provide extensive range of current and voltage values.

The recent groundbreaking advances in deep learning networks ( DNNs ) make
them attractive for embedded systems. However, it can take a long time for DNNs
to make an inference on resourcelimited embedded devices. Offloading the
computation into the cloud is often infeasible due to privacy concerns, high
latency, or the lack of connectivity. As such, there is a critical need to find
a way to effectively execute the DNN models locally on the devices. This paper
presents an adaptive scheme to determine which DNN model to use for a given
input, by considering the desired accuracy and inference time. Our approach
employs machine learning to develop a predictive model to quickly select a
pretrained DNN to use for a given input and the optimization constraint. We
achieve this by first training offline a predictive model, and then use the
learnt model to select a DNN model to use for new, unseen inputs. We apply our
approach to the image classification task and evaluate it on a Jetson TX2
embedded deep learning platform using the ImageNet ILSVRC 2012 validation
dataset. We consider a range of influential DNN models. Experimental results
show that our approach achieves a 7.52% improvement in inference accuracy, and
a 1.8x reduction in inference time over the mostcapable single DNN model.

A key enabler for the emerging autonomous and cooperative driving services is
highthroughput and reliable VehicletoNetwork (V2N) communication. In this
respect, the millimeter wave (mmWave) frequencies hold great promises because
of the large available bandwidth which may provide the required link capacity.
However, this potential is hindered by the challenging propagation
characteristics of highfrequency channels and the dynamic topology of the
vehicular scenarios, which affect the reliability of the connection. Moreover,
mmWave transmissions typically leverage beamforming gain to compensate for the
increased path loss experienced at high frequencies. This, however, requires
fine alignment of the transmitting and receiving beams, which may be difficult
in vehicular scenarios. Those limitations may undermine the performance of V2N
communications and pose new challenges for proper vehicular communication
design. In this paper, we study by simulation the practical feasibility of some
mmWaveaware strategies to support V2N, in comparison to the traditional LTE
connectivity below 6 GHz. The results show that the orchestration among
different radios represents a viable solution to enable both highcapacity and
robust V2N communications.

A discreteevent simulation (DES) involves the execution of a sequence of
event handlers dynamically scheduled at runtime. As a consequence, a priori
knowledge of the control flow of the overall simulation program is limited. In
particular, powerful optimizations supported by modern compilers can only be
applied on the scope of individual event handlers, which frequently involve
only a few lines of code. We propose a method that extends the scope for
compiler optimizations in discreteevent simulations by generating batches of
multiple events that are subjected to compiler optimizations as contiguous
procedures. A runtime mechanism executes suitable batches at negligible
overhead. Our method does not require any compiler extensions and introduces
only minor additional effort during model development. The feasibility and
potential performance gains of the approach are illustrated on the example of
an idealized proofofconcept model. We believe that the applicability of the
approach extends to general eventdriven programs.

Linear algebraic expressions are the essence of many computationally
intensive problems, including scientific simulations and machine learning
applications. However, translating highlevel formulations of these expressions
to efficient machinelevel representations is far from trivial: developers
should be assisted by automatic optimization tools so that they can focus their
attention on highlevel problems, rather than lowlevel details. The
tractability of these optimizations is highly dependent on the choice of the
primitive constructs in terms of which the computations are to be expressed. In
this work we propose to describe operations on multidimensional arrays using a
selection of higherorder functions, inspired by functional programming, and we
present rewrite rules for these such that they can be automatically optimized
for modern hierarchical and heterogeneous architectures. Using this formalism
we systematically construct and analyse different subdivisions and permutations
of the dense matrix multiplication problem.

The sparse matrixvector product (SpMV) is a fundamental operation in many
scientific applications from various fields. The High Performance Computing
(HPC) community has therefore continuously invested a lot of effort to provide
an efficient SpMV kernel on modern CPU architectures. Although it has been
shown that blockbased kernels help to achieve high performance, they are
difficult to use in practice because of the zero padding they require. In the
current paper, we propose new kernels using the AVX512 instruction set, which
makes it possible to use a blocking scheme without any zero padding in the
matrix memory storage. We describe maskbased sparse matrix formats and their
corresponding SpMV kernels highly optimized in assembly language. Considering
that the optimal blocking size depends on the matrix, we also provide a method
to predict the best kernel to be used utilizing a simple interpolation of
results from previous executions. We compare the performance of our approach to
that of the Intel MKL CSR kernel and the CSR5 opensource package on a set of
standard benchmark matrices. We show that we can achieve significant
improvements in many cases, both for sequential and for parallel executions.
Finally, we provide the corresponding code in an open source library, called
SPC5.

For reasons of both performance and energy efficiency, highperformance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming nextgeneration accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures  three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.

We study a parallel queueing system with multiple types of servers and
customers. A bipartite graph describes which pairs of customerserver types are
compatible. We consider the service policy that always assigns servers to the
first, longest waiting compatible customer, and that always assigns customers
to the longest idle compatible server if on arrival, multiple compatible
servers are available. For a general renewal stream of arriving customers and
general service time distributions, the behavior of such systems is very
complicated. In particular, the calculation of matching rates, the fraction of
services of customerserver type, is intractable. We suggest through a
heuristic argument that if the number of servers becomes large, the matching
rates are well approximated by matching rates calculated from the tractable
bipartite infinite matching model. We present simulation evidence to support
this heuristic argument, and show how this can be used to design systems with
desired performance requirements.