While Strassen's matrix multiplication algorithm reduces the complexity of
naive matrix multiplication, general-purpose hardware is not suitable for
achieving the algorithm's promised theoretical speedups. This leaves the
question of if it could be better exploited in custom hardware architectures
designed specifically for executing the algorithm. However, there is limited
prior work on this and it is not immediately clear how to derive such
architectures or if they can ultimately lead to real improvements. We bridge
this gap, presenting and evaluating new systolic array architectures that
efficiently translate the theoretical complexity reductions of Strassen's
algorithm directly into hardware resource savings. Furthermore, the
architectures are multisystolic array designs that can multiply smaller
matrices with higher utilization than single-systolic array designs. The
proposed designs implemented on FPGA reduce DSP requirements by a factor of
$1.14^r$ for $r$ implemented Strassen recursion levels, and otherwise require
overall similar soft logic resources when instantiated to support matrix sizes
down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We
evaluate the proposed designs both in isolation and in an end-to-end machine
learning accelerator compared to baseline designs and prior works, achieving
state-of-the-art performance.