DeepScale: Online Frame Size Adaptation for Multi-object Tracking on Smart Cameras and Edge Servers
Abstract
In surveillance and search and rescue applications, it is important to
perform multi-target tracking (MOT) in real-time on low-end devices. Today's
MOT solutions employ deep neural networks, which tend to have high computation
complexity. Recognizing the effects of frame sizes on tracking performance, we
propose DeepScale, a model agnostic frame size selection approach that operates
on top of existing fully convolutional network-based trackers to accelerate
tracking throughput. In the training stage, we incorporate detectability scores
into a one-shot tracker architecture so that DeepScale can learn representation
estimations for different frame sizes in a self-supervised manner. During
inference, it can adapt frame sizes according to the complexity of visual
contents based on user-controlled parameters. To leverage computation resources
on edge servers, we propose two computation partition schemes tailored for MOT,
namely, edge server only with adaptive frame-size transmission and edge
server-assisted tracking. Extensive experiments and benchmark tests on MOT
datasets demonstrate the effectiveness and flexibility of DeepScale. Compared
to a state-of-the-art tracker, DeepScale++, a variant of DeepScale achieves
1.57X accelerated with only moderate degradation ~2.3\ in tracking accuracy on
the MOT15 dataset in one configuration. We have implemented and evaluated
DeepScale++ and the proposed computation partition schemes on a small-scale
testbed consisting of an NVIDIA Jetson TX2 board and a GPU server. The
experiments reveal non-trivial trade-offs between tracking performance and
latency compared to server-only or smart camera-only solutions.