
Deep Neural Networks (DNNs) require very large amounts of computation both
for training and for inference when deployed in the field. Many different
algorithms have been proposed to implement the most computationally expensive
layers of DNNs. Further, each of these algorithms has a large number of
variants, which offer different tradeoffs of parallelism, data locality,
memory footprint, and execution time. In addition, specific algorithms operate
much more efficiently on specialized data layouts and formats.
We state the problem of optimal primitive selection in the presence of data
format transformations, and show that it is NPhard by demonstrating an
embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP).
We propose an analytic solution via a PBQP solver, and evaluate our approach
experimentally by optimizing several popular DNNs using a library of more than
70 DNN primitives, on an embedded platform and a general purpose platform. We
show experimentally that significant gains are possible versus the state of the
art vendor libraries by using a principled analytic solution to the problem of
layout selection in the presence of data format transformations.

Convolutional neural networks (CNNs) are one of the most successful machine
learning techniques for image, voice and video processing. CNNs require large
amounts of processing capacity and memory bandwidth. Hardware accelerators have
been proposed for CNNs which typically contain large numbers of
multiplyaccumulate (MAC) units, the multipliers of which are large in an
integrated circuit (IC) gate count and power consumption. "Weight sharing"
accelerators have been proposed where the full range of weight values in a
trained CNN are compressed and put into bins and the bin index used to access
the weightshared value. We reduce power and area of the CNN by implementing
parallel accumulate shared MAC (PASM) in a weight shared CNN. PASM
rearchitects the MAC to instead count the frequency of each weight and place
it in a bin. The accumulated value is computed in a subsequent multiply phase,
significantly reducing gate count and power consumption of the CNN. In this
paper, we implement PASM in a weightshared CNN convolution hardware
accelerator and analyze its effectiveness. Experiments show that for a clock
speed 1GHz implemented on a 45nm ASIC process our approach results in fewer
gates, smaller logic, and reduced power with only a slight increase in latency.
We also show that the same weightsharedwithPASM CNN accelerator can be
implemented in resourceconstrained FPGAs, where the FPGA has limited numbers
of digital signal processor (DSP) units to accelerate the MAC operations.

Modern deep neural networks (DNNs) spend a large amount of their execution
time computing convolutions. Winograd's minimal algorithm for small
convolutions can greatly reduce the number of arithmetic operations. However, a
large reduction in floating point (FP) operations in these algorithms can
result in significantly reduced FP accuracy of the result. In this paper we
propose several methods for reducing the FP error of these algorithms. Minimal
convolution algorithms depend on the selection of several numeric
\textit{points} that have a large impact on the accuracy of the result. Some
points are known to be better than others, but there is no systematic method
selecting points for small convolutions. We show that there are a relatively
small number of important cases for DNN convolution, that can be searched
empirically. We compared both standard and modified versions of the Winograd
algorithm. Further, we demonstrate that both the ordering and value of the
points is important, and we propose a canonical evaluation ordering that both
reduces FP error and the size of the search space based on Huffman coding. We
find that good point selections depend on the values of the points themselves
and on symmetries between different points. We show that sets of points with
symmetric groups give better results. In addition, we explore other methods to
reduce FP error, including mixedprecision convolution, and pairwise addition
across DNN channels. Using our methods we can significantly reduce FP error for
a given Winograd convolution block size, which allows larger block sizes and
reduced computation.

Deep neural networks (DNNs) require very large amounts of computation both
for training and for inference when deployed in the field. A common approach to
implementing DNNs is to recast the most computationally expensive operations as
general matrix multiplication (GEMM). However, as we demonstrate in this paper,
there are a great many different ways to express DNN convolution operations
using GEMM. Although different approaches all perform the same number of
operations, the size of temporary data structures differs significantly.
Convolution of an input matrix with dimensions $C \times H \times W$, requires
$O(K^2CHW)$ additional space using the classical im2col approach. More recently
memoryefficient approaches requiring just $O(KCHW)$ auxiliary space have been
proposed.
We present two novel GEMMbased algorithms that require just $O(MHW)$ and
$O(KW)$ additional space respectively, where $M$ is the number of channels in
the result of the convolution. These algorithms dramatically reduce the space
overhead of DNN convolution, making it much more suitable for memorylimited
embedded systems. Experimental evaluation shows that our lowmemory algorithms
are just as fast as the best patchbuilding approaches despite requiring just a
fraction of the amount of additional memory. Our lowmemory algorithms have
excellent data locality which gives them a further edge over patchbuilding
algorithms when multiple cores are used. As a result, our low memory algorithms
often outperform the best patchbuilding algorithms using multiple threads.

Convolutional neural networks (CNNs) have emerged as one of the most
successful machine learning technologies for image and video processing. The
most computationally intensive parts of CNNs are the convolutional layers,
which convolve multichannel images with multiple kernels. A common approach to
implementing convolutional layers is to expand the image into a column matrix
(im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using
an existing parallel General Matrix Multiplication (GEMM) library. This im2col
conversion greatly increases the memory footprint of the input matrix and
reduces data locality.
In this paper we propose a new approach to MCMK convolution that is based on
General Matrix Multiplication (GEMM), but not on im2col. Our algorithm
eliminates the need for data replication on the input thereby enabling us to
apply the convolution kernels on the input images directly. We have implemented
several variants of our algorithm on a CPU processor and an embedded ARM
processor. On the CPU, our algorithm is faster than im2col in most cases.

The critical path of a group of tasks is an important measure that is
commonly used to guide task allocation and scheduling on parallel computers.
The critical path is the longest chain of dependencies in an acyclic task
dependence graph. A problem arises on heterogeneous parallel machines where
computation and communication costs can vary between different types of
processor. Existing solutions for heterogeneous machines attempt to estimate
the critical path using average values of computation and communication costs.
However, this ignores opportunities to match specific tasks to specific classes
of processor and communication links, and can result in quite misleading paths
being identified as critical. We argue that an accurate critical path must
consider the mapping of tasks to classes of processor and communication links.
We formulate a polynomial time algorithm to find such a critical path. Our
Critical Earliest Finish Time (CEFT) algorithm finds both the length of the
critical path and an allocation of tasks to processors on that path. We
compared CEFT experimentally to existing approaches such as averaging execution
times across processors. The latter approach fails to accurately model the
execution cost of tasks, and as a result fails to identify a correct critical
path in 83.99% of cases in our experiments. We also adapted a critical
pathoriented scheduling algorithm (CPOP) to use our critical path algorithm
and found that the resulting schedules are faster.

Convolutional Neural Networks (CNNs) are one of the most successful deep
machine learning technologies for processing image, voice and video data. CNNs
require large amounts of processing capacity and memory, which can exceed the
resources of low power mobile and embedded systems. Several designs for
hardware accelerators have been proposed for CNNs which typically contain large
numbers of Multiply Accumulate (MAC) units. One approach to reducing data sizes
and memory traffic in CNN accelerators is "weight sharing", where the full
range of values in a trained CNN are put in bins and the bin index is stored
instead of the original weight value. In this paper we propose a novel MAC
circuit that exploits binning in weightsharing CNNs. Rather than computing the
MAC directly we instead count the frequency of each weight and place it in a
bin. We then compute the accumulated value in a subsequent multiply phase. This
allows hardware multipliers in the MAC circuit to be replaced with adders and
selection logic. Experiments show that for the same clock speed our approach
results in fewer gates, smaller logic, and reduced power.

Previous research has shown that computation of convolution in the frequency
domain provides a significant speedup versus traditional convolution network
implementations. However, this performance increase comes at the expense of
repeatedly computing the transform and its inverse in order to apply other
network operations such as activation, pooling, and dropout. We show,
mathematically, how convolution and activation can both be implemented in the
frequency domain using either the Fourier or Laplace transformation. The main
contributions are a description of spectral activation under the Fourier
transform and a further description of an efficient algorithm for computing
both convolution and activation under the Laplace transform. By computing both
the convolution and activation functions in the frequency domain, we can reduce
the number of transforms required, as well as reducing overall complexity. Our
description of a spectral activation function, together with existing spectral
analogs of other network functions may then be used to compose a fully spectral
implementation of a convolution network.

We propose a scheme for reducedprecision representation of floating point
data on a continuum between IEEE754 floating point types. Our scheme enables
the use of lower precision formats for a reduction in storage space
requirements and data transfer volume. We describe how our scheme can be
accelerated using existing hardware vector units on a generalpurpose processor
(GPP). Exploiting native vector hardware allows us to support reduced precision
floating point with low overhead. We demonstrate that supporting reduced
precision in the compiler as opposed to using a library approach can yield a
low overhead solution for GPPs.

Customizing the precision of data can provide attractive tradeoffs between
accuracy and hardware resources. We propose a novel form of vector computing
aimed at arrays of customprecision floating point data. We represent these
vectors in bitslice format. Bitwise instructions are used to implement
arithmetic circuits in software that operate on customized bitprecision.
Experiments show that this approach can be efficient for vectors of
lowprecision custom floating point types, while providing arbitrary bit
precision.

The subitemset isomorphism problem is really important and there are
excellent practical solutions described in the literature. However, the
computational complexity analysis and classification of the BZ (Bundala and
Zavodny) subitemset isomorphism problem is currently an open problem. In this
paper we prove that checking whether two sorting networks are BZ isomorphic to
each other is GIComplete; the general GI (Graph Isomorphism) problem is known
to be in NP and LWPP, but widely believed to be neither P nor NPComplete;
recent research suggests that the problem is in QP. Moreover, we state the BZ
sorting network isomorphism problem as a general isomorphism problem on
itemsets  because every sorting network is represented by Bundala and
Zavodny as an itemset. The complexity classification presented in this paper
applies sorting networks, as well as the general itemset isomorphism problem.
The main consequence of our work is that currently no polynomialtime algorithm
exists for solving the BZ sorting network subitemset isomorphism problem;
however the CM (Choi and Moon) sorting network isomorphism problem can be
efficiently solved in polynomial time.

The minimal sets within a collection of sets are defined as the ones which do
not have a proper subset within the collection, and the maximal sets are the
ones which do not have a proper superset within the collection. Identifying
extremal sets is a fundamental problem with a widerange of applications in SAT
solvers, datamining and social network analysis. In this paper, we present two
novel improvements of the highquality extremal set identification algorithm,
\textit{AMSLex}, described by Bayardo and Panda. The first technique uses
memoization to improve the execution time of the singlethreaded variant of the
AMSLex, whilst our second improvement uses parallel programming methods. In a
subset of the presented experiments our memoized algorithm executes more than
$400$ times faster than the highly efficient publicly available implementation
of AMSLex. Moreover, we show that our modified algorithm's speedup is not
bounded above by a constant and that it increases as the length of the common
prefixes in successive input \textit{itemsets} increases. We provide
experimental results using both realworld and synthetic data sets, and show
our multithreaded variant algorithm outperforming AMSLex by $3$ to $6$
times. We find that on synthetic input datasets when executed using $16$ CPU
cores of a $32$core machine, our multithreaded program executes about as fast
as the state of the art parallel GPUbased program using an NVIDIA GTX 580
graphics processing unit.

A complete set of filters $F_n$ for the optimaldepth $n$input sorting
network problem is such that if there exists an $n$input sorting network of
depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$.
Previous work on the topic presents a method for finding complete set of
filters $R_{n, 1}$ and $R_{n, 2}$ that consists only of networks of depths one
and two respectively, whose outputs are minimal and representative up to
permutation and reflection. Our main contribution is a practical approach for
finding a complete set of filters $R_{n, 3}$ containing only networks of depth
three whose outputs are minimal and representative up to permutation and
reflection. In previous work, we have developed a highly efficient algorithm
for finding extremal sets ( i.e. outputs of comparator networks; itemsets; ) up
to permutation. In this paper we present a modification to this algorithm that
identifies the representative itemsets up to permutation and reflection. Hence,
the presented practical approach is the successful combination of known theory
and practice that we apply to the domain of sorting networks. For $n < 17$, we
empirically compute the complete set of filters $R_{n, 2}$, $R_{n, 3}$, $R_{n,
2} \upharpoonright w $ and $R_{n, 3}^w$ of the representative minimal up to
permutation and reflection $n$input networks, where all but $R_{n, 2}$ are
novel to this work.

In this paper we extend the knowledge on the problem of empirically searching
for sorting networks of minimal depth. We present new search space pruning
techniques for the last four levels of a candidate sorting network by
considering only the output set representation of a network. We present an
algorithm for checking whether an $n$input sorting network of depth $d$ exists
by considering the minimal up to permutation and reflection itemsets at each
level and using the pruning at the last four levels. We experimentally
evaluated this algorithm to find the optimal depth sorting networks for all $n
\leq 12$.