
A tight lower bound for required I/O when computing an ordinary matrixmatrix
multiplication on a processor with two layers of memory is established. Prior
work obtained weaker lower bounds by reasoning about the number of segments
needed to perform $C:=AB$, for distinct matrices $A$, $B$, and $C$, where each
segment is a series of operations involving $M$ reads and writes to and from
fast memory, and $M$ is the size of fast memory. A lower bound on the number of
segments was then determined by obtaining an upper bound on the number of
elementary multiplications performed per segment. This paper follows the same
high level approach, but improves the lower bound by (1) transforming
algorithms for MMM so that they perform all computation via fused multiplyadd
instructions (FMAs) and using this to reason about only the cost associated
with reading the matrices, and (2) decoupling the persegment I/O cost from the
size of fast memory. For $n \times n$ matrices, the lower bound's leadingorder
term is $2n^3/\sqrt{M}$. A theoretical algorithm whose leading terms attains
this is introduced. To what extent the stateoftheart Goto's Algorithm
attains the lower bound is discussed.

Dijkstra observed that verifying correctness of a program is difficult and
conjectured that derivation of a program handinhand with its proof of
correctness was the answer. We illustrate this goaloriented approach by
applying it to the domain of dense linear algebra libraries for distributed
memory parallel computers. We show that algorithms that underlie the
implementation of most functionality for this domain can be systematically
derived to be correct. The benefit is that an entire family of algorithms for
an operation is discovered so that the best algorithm for a given architecture
can be chosen. This approach is very practical: Ideas inspired by it have been
used to rewrite the dense linear algebra software stack starting below the
Basic Linear Algebra Subprograms (BLAS) and reaching up through the Elemental
distributed memory library, and every level in between. The paper demonstrates
how formal methods and rigorous mathematical techniques for correctness impact
HPC.

Tensor contraction (TC) is an important computational kernel widely used in
numerous applications. It is a multidimensional generalization of matrix
multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in
theory and practice, extending it to accelerate TC has not been previously
pursued. Thus, we believe this to be the first paper to demonstrate how one can
in practice speed up tensor contraction with Strassen's algorithm. By adopting
a BlockScatterMatrix format, a novel matrixcentric tensor layout, we can
conceptually view TC as GEMM for a general stride storage, with an implicit
tensortomatrix transformation. This insight enables us to tailor a recent
stateoftheart implementation of Strassen's algorithm to TC, avoiding
explicit transpositions (permutations) and extra workspace, and reducing the
overhead of memory movement that is incurred. Performance benefits are
demonstrated with a performance model as well as in practice on modern single
core, multicore, and distributed memory parallel architectures, achieving up to
1.3x speedup. The resulting implementations can serve as a dropin replacement
for various applications with significant speedup.

Matrix multiplication (GEMM) is a core operation to numerous scientific
applications. Traditional implementations of Strassenlike fast matrix
multiplication (FMM) algorithms often do not perform well except for very large
matrix sizes, due to the increased cost of memory movement, which is
particularly noticeable for nonsquare matrices. Such implementations also
require considerable workspace and modifications to the standard BLAS
interface. We propose a code generator framework to automatically implement a
large family of FMM algorithms suitable for multiplications of arbitrary matrix
sizes and shapes. By representing FMM with a triple of matrices [U,V,W] that
capture the linear combinations of submatrices that are formed, we can use the
Kronecker product to define a multilevel representation of Strassenlike
algorithms. Incorporating the matrix additions that must be performed for
Strassenlike algorithms into the inherent packing and microkernel operations
inside GEMM avoids extra workspace and reduces the cost of memory movement.
Adopting the same loop structures as highperformance GEMM implementations
allows parallelization of all FMM algorithms with simple but efficient data
parallelism without the overhead of task parallelism. We present a simple
performance model for general FMM algorithms and compare actual performance of
20+ FMM algorithms to modeled predictions. Our implementations demonstrate a
performance benefit over conventional GEMM on single core and multicore
systems. This study shows that Strassenlike fast matrix multiplication can be
incorporated into libraries for practical use.

Matrixmatrix multiplication is a fundamental operation of great importance
to scientific computing and, increasingly, machine learning. It is a simple
enough concept to be introduced in a typical high school algebra course yet in
practice important enough that its implementation on computers continues to be
an active research topic. This note describes a set of exercises that use this
operation to illustrate how high performance can be attained on modern CPUs
with hierarchical memories (multiple caches). It does so by building on the
insights that underly the BLASlike Library Instantiation Software (BLIS)
framework by exposing a simplified "sandbox" that mimics the implementation in
BLIS. As such, it also becomes a vehicle for the "crowd sourcing" of the
optimization of BLIS. We call this set of exercises BLISlab.

We dispel with "street wisdom" regarding the practical implementation of
Strassen's algorithm for matrixmatrix multiplication (DGEMM). Conventional
wisdom: it is only practical for very large matrices. Our implementation is
practical for small matrices. Conventional wisdom: the matrices being
multiplied should be relatively square. Our implementation is practical for
rankk updates, where k is relatively small (a shape of importance for
libraries like LAPACK). Conventional wisdom: it inherently requires substantial
workspace. Our implementation requires no workspace beyond buffers already
incorporated into conventional highperformance DGEMM implementations.
Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our
implementation requires no such workspace and can be plugcompatible with the
standard DGEMM interface. Conventional wisdom: it is hard to demonstrate
speedup on multicore architectures. Our implementation demonstrates speedup
over conventional DGEMM even on an Intel(R) Xeon Phi(TM) coprocessor utilizing
240 threads. We show how a distributed memory matrixmatrix multiplication also
benefits from these advances.

Symmetric tensor operations arise in a wide variety of computations. However,
the benefits of exploiting symmetry in order to reduce storage and computation
is in conflict with a desire to simplify memory access patterns. In this paper,
we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein
we consider the tensor by blocks and store only the unique blocks of a
symmetric tensor. We propose an algorithmbyblocks, already shown of benefit
for matrix computations, that exploits this storage format by utilizing a
series of temporary tensors to avoid redundant computation. Further, partial
symmetry within temporaries is exploited to further avoid redundant storage and
redundant computation. A detailed analysis shows that, relative to storing and
computing with tensors without taking advantage of symmetry and partial
symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$
and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$,
where $ m $ is the order of the tensor. However, as the analysis shows, care
must be taken in choosing the correct block size to ensure these storage and
computational benefits are achieved (particularly for loworder tensors). An
implementation demonstrates that storage is greatly reduced and the complexity
introduced by storing and computing with tensors by blocks is manageable.
Preliminary results demonstrate that computational time is also reduced. The
paper concludes with a discussion of how insights in this paper point to
opportunities for generalizing recent advances in the domain of linear algebra
libraries to the field of multilinear computation.