Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations
Abstract
While the Karatsuba algorithm reduces the complexity of large integer
multiplication, the extra additions required minimize its benefits for smaller
integers of more commonly-used bitwidths. In this work, we propose the
extension of the scalar Karatsuba multiplication algorithm to matrix
multiplication, showing how this maintains the reduction in multiplication
complexity of the original Karatsuba algorithm while reducing the complexity of
the extra additions. Furthermore, we propose new matrix multiplication hardware
architectures for efficiently exploiting this extension of the Karatsuba
algorithm in custom hardware. We show that the proposed algorithm and hardware
architectures can provide real area or execution time improvements for integer
matrix multiplication compared to scalar Karatsuba or conventional matrix
multiplication algorithms, while also supporting implementation through proven
systolic array and conventional multiplier architectures at the core. We
provide a complexity analysis of the algorithm and architectures and evaluate
the proposed designs both in isolation and in an end-to-end deep learning
accelerator system compared to baseline designs and prior state-of-the-art
works implemented on the same type of compute platform, demonstrating their
ability to increase the performance-per-area of matrix multiplication hardware.