• Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of parallelism, data locality, memory footprint, and execution time. In addition, specific algorithms operate much more efficiently on specialized data layouts and formats. We state the problem of optimal primitive selection in the presence of data format transformations, and show that it is NP-hard by demonstrating an embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP). We propose an analytic solution via a PBQP solver, and evaluate our approach experimentally by optimizing several popular DNNs using a library of more than 70 DNN primitives, on an embedded platform and a general purpose platform. We show experimentally that significant gains are possible versus the state of the art vendor libraries by using a principled analytic solution to the problem of layout selection in the presence of data format transformations.
  • Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in an integrated circuit (IC) gate count and power consumption. "Weight sharing" accelerators have been proposed where the full range of weight values in a trained CNN are compressed and put into bins and the bin index used to access the weight-shared value. We reduce power and area of the CNN by implementing parallel accumulate shared MAC (PASM) in a weight shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. In this paper, we implement PASM in a weight-shared CNN convolution hardware accelerator and analyze its effectiveness. Experiments show that for a clock speed 1GHz implemented on a 45nm ASIC process our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency. We also show that the same weight-shared-with-PASM CNN accelerator can be implemented in resource-constrained FPGAs, where the FPGA has limited numbers of digital signal processor (DSP) units to accelerate the MAC operations.
  • Modern deep neural networks (DNNs) spend a large amount of their execution time computing convolutions. Winograd's minimal algorithm for small convolutions can greatly reduce the number of arithmetic operations. However, a large reduction in floating point (FP) operations in these algorithms can result in significantly reduced FP accuracy of the result. In this paper we propose several methods for reducing the FP error of these algorithms. Minimal convolution algorithms depend on the selection of several numeric \textit{points} that have a large impact on the accuracy of the result. Some points are known to be better than others, but there is no systematic method selecting points for small convolutions. We show that there are a relatively small number of important cases for DNN convolution, that can be searched empirically. We compared both standard and modified versions of the Winograd algorithm. Further, we demonstrate that both the ordering and value of the points is important, and we propose a canonical evaluation ordering that both reduces FP error and the size of the search space based on Huffman coding. We find that good point selections depend on the values of the points themselves and on symmetries between different points. We show that sets of points with symmetric groups give better results. In addition, we explore other methods to reduce FP error, including mixed-precision convolution, and pairwise addition across DNN channels. Using our methods we can significantly reduce FP error for a given Winograd convolution block size, which allows larger block sizes and reduced computation.
  • Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different ways to express DNN convolution operations using GEMM. Although different approaches all perform the same number of operations, the size of temporary data structures differs significantly. Convolution of an input matrix with dimensions $C \times H \times W$, requires $O(K^2CHW)$ additional space using the classical im2col approach. More recently memory-efficient approaches requiring just $O(KCHW)$ auxiliary space have been proposed. We present two novel GEMM-based algorithms that require just $O(MHW)$ and $O(KW)$ additional space respectively, where $M$ is the number of channels in the result of the convolution. These algorithms dramatically reduce the space overhead of DNN convolution, making it much more suitable for memory-limited embedded systems. Experimental evaluation shows that our low-memory algorithms are just as fast as the best patch-building approaches despite requiring just a fraction of the amount of additional memory. Our low-memory algorithms have excellent data locality which gives them a further edge over patch-building algorithms when multiple cores are used. As a result, our low memory algorithms often outperform the best patch-building algorithms using multiple threads.
  • Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases.
  • The critical path of a group of tasks is an important measure that is commonly used to guide task allocation and scheduling on parallel computers. The critical path is the longest chain of dependencies in an acyclic task dependence graph. A problem arises on heterogeneous parallel machines where computation and communication costs can vary between different types of processor. Existing solutions for heterogeneous machines attempt to estimate the critical path using average values of computation and communication costs. However, this ignores opportunities to match specific tasks to specific classes of processor and communication links, and can result in quite misleading paths being identified as critical. We argue that an accurate critical path must consider the mapping of tasks to classes of processor and communication links. We formulate a polynomial time algorithm to find such a critical path. Our Critical Earliest Finish Time (CEFT) algorithm finds both the length of the critical path and an allocation of tasks to processors on that path. We compared CEFT experimentally to existing approaches such as averaging execution times across processors. The latter approach fails to accurately model the execution cost of tasks, and as a result fails to identify a correct critical path in 83.99% of cases in our experiments. We also adapted a critical path-oriented scheduling algorithm (CPOP) to use our critical path algorithm and found that the resulting schedules are faster.
  • Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. CNNs require large amounts of processing capacity and memory, which can exceed the resources of low power mobile and embedded systems. Several designs for hardware accelerators have been proposed for CNNs which typically contain large numbers of Multiply Accumulate (MAC) units. One approach to reducing data sizes and memory traffic in CNN accelerators is "weight sharing", where the full range of values in a trained CNN are put in bins and the bin index is stored instead of the original weight value. In this paper we propose a novel MAC circuit that exploits binning in weight-sharing CNNs. Rather than computing the MAC directly we instead count the frequency of each weight and place it in a bin. We then compute the accumulated value in a subsequent multiply phase. This allows hardware multipliers in the MAC circuit to be replaced with adders and selection logic. Experiments show that for the same clock speed our approach results in fewer gates, smaller logic, and reduced power.
  • Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. We show, mathematically, how convolution and activation can both be implemented in the frequency domain using either the Fourier or Laplace transformation. The main contributions are a description of spectral activation under the Fourier transform and a further description of an efficient algorithm for computing both convolution and activation under the Laplace transform. By computing both the convolution and activation functions in the frequency domain, we can reduce the number of transforms required, as well as reducing overall complexity. Our description of a spectral activation function, together with existing spectral analogs of other network functions may then be used to compose a fully spectral implementation of a convolution network.
  • We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native vector hardware allows us to support reduced precision floating point with low overhead. We demonstrate that supporting reduced precision in the compiler as opposed to using a library approach can yield a low overhead solution for GPPs.
  • Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. We propose a novel form of vector computing aimed at arrays of custom-precision floating point data. We represent these vectors in bitslice format. Bitwise instructions are used to implement arithmetic circuits in software that operate on customized bit-precision. Experiments show that this approach can be efficient for vectors of low-precision custom floating point types, while providing arbitrary bit precision.
  • The subitemset isomorphism problem is really important and there are excellent practical solutions described in the literature. However, the computational complexity analysis and classification of the BZ (Bundala and Zavodny) subitemset isomorphism problem is currently an open problem. In this paper we prove that checking whether two sorting networks are BZ isomorphic to each other is GI-Complete; the general GI (Graph Isomorphism) problem is known to be in NP and LWPP, but widely believed to be neither P nor NP-Complete; recent research suggests that the problem is in QP. Moreover, we state the BZ sorting network isomorphism problem as a general isomorphism problem on itemsets --- because every sorting network is represented by Bundala and Zavodny as an itemset. The complexity classification presented in this paper applies sorting networks, as well as the general itemset isomorphism problem. The main consequence of our work is that currently no polynomial-time algorithm exists for solving the BZ sorting network subitemset isomorphism problem; however the CM (Choi and Moon) sorting network isomorphism problem can be efficiently solved in polynomial time.
  • The minimal sets within a collection of sets are defined as the ones which do not have a proper subset within the collection, and the maximal sets are the ones which do not have a proper superset within the collection. Identifying extremal sets is a fundamental problem with a wide-range of applications in SAT solvers, data-mining and social network analysis. In this paper, we present two novel improvements of the high-quality extremal set identification algorithm, \textit{AMS-Lex}, described by Bayardo and Panda. The first technique uses memoization to improve the execution time of the single-threaded variant of the AMS-Lex, whilst our second improvement uses parallel programming methods. In a subset of the presented experiments our memoized algorithm executes more than $400$ times faster than the highly efficient publicly available implementation of AMS-Lex. Moreover, we show that our modified algorithm's speedup is not bounded above by a constant and that it increases as the length of the common prefixes in successive input \textit{itemsets} increases. We provide experimental results using both real-world and synthetic data sets, and show our multi-threaded variant algorithm out-performing AMS-Lex by $3$ to $6$ times. We find that on synthetic input datasets when executed using $16$ CPU cores of a $32$-core machine, our multi-threaded program executes about as fast as the state of the art parallel GPU-based program using an NVIDIA GTX 580 graphics processing unit.
  • A complete set of filters $F_n$ for the optimal-depth $n$-input sorting network problem is such that if there exists an $n$-input sorting network of depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$. Previous work on the topic presents a method for finding complete set of filters $R_{n, 1}$ and $R_{n, 2}$ that consists only of networks of depths one and two respectively, whose outputs are minimal and representative up to permutation and reflection. Our main contribution is a practical approach for finding a complete set of filters $R_{n, 3}$ containing only networks of depth three whose outputs are minimal and representative up to permutation and reflection. In previous work, we have developed a highly efficient algorithm for finding extremal sets ( i.e. outputs of comparator networks; itemsets; ) up to permutation. In this paper we present a modification to this algorithm that identifies the representative itemsets up to permutation and reflection. Hence, the presented practical approach is the successful combination of known theory and practice that we apply to the domain of sorting networks. For $n < 17$, we empirically compute the complete set of filters $R_{n, 2}$, $R_{n, 3}$, $R_{n, 2} \upharpoonright w $ and $R_{n, 3}^w$ of the representative minimal up to permutation and reflection $n$-input networks, where all but $R_{n, 2}$ are novel to this work.
  • In this paper we extend the knowledge on the problem of empirically searching for sorting networks of minimal depth. We present new search space pruning techniques for the last four levels of a candidate sorting network by considering only the output set representation of a network. We present an algorithm for checking whether an $n$-input sorting network of depth $d$ exists by considering the minimal up to permutation and reflection itemsets at each level and using the pruning at the last four levels. We experimentally evaluated this algorithm to find the optimal depth sorting networks for all $n \leq 12$.