HW/SW Collaborative Techniques for Accelerating...

HW/SW Collaborative Techniques for Accelerating TinyML Inference Time at No Cost

Abstract

With the unprecedented boom in TinyML development, optimizing Artificial Intelligence (AI) inference on resource-constrained microcontrollers (M CU s) is of paramount importance. Most of the existing works focus on peak memory or computation reduction. The tasks are partitioned in the patch-based or device-based during the execution. However, it comes with a price of the latency and communication overhead. In this paper, we propose several techniques to accelerate the Convolutional Neural Networks (CNN s) inference process. These techniques are both architecture- and application-aware. From the application perspective, 1) we maximize computation reuse through instruction reordering, 2) fuse several linear layers together to improve computation patterns, and 3) enable memory reuse of intermediate buffers for improving memory behavior. From the architecture perspective, we propose techniques that take into account knowledge about underlying architecture of the MCU including 1) cache-aware and 2) multi-core parallelism-aware techniques. Those solutions only require the general MCUs features thus demonstrating board generalization across various networks and devices. These techniques come at no additional cost. It improve the inference latency without any compromise of the model accuracy or the model size. Our evaluation on a use-case from the health-care domain with real-data set for four CNNs - LeNet, AlexNet, ResNet20, and SqueezeNet - show that we achieve up to 71 % reduction in inference latency.