Efficient Action Recognition Using Confidence Distillation
Abstract
Modern neural networks are powerful predictive models. However, when it comes
to recognizing that they may be wrong about their predictions, they perform
poorly. For example, for one of the most common activation functions, the ReLU
and its variants, even a well-calibrated model can produce incorrect but high
confidence predictions. In the related task of action recognition, most current
classification methods are based on clip-level classifiers that densely sample
a given video for non-overlapping, same-sized clips and aggregate the results
using an aggregation function - typically averaging - to achieve video level
predictions. While this approach has shown to be effective, it is sub-optimal
in recognition accuracy and has a high computational overhead. To mitigate both
these issues, we propose the confidence distillation framework to teach a
representation of uncertainty of the teacher to the student sampler and divide
the task of full video prediction between the student and the teacher models.
We conduct extensive experiments on three action recognition datasets and
demonstrate that our framework achieves significant improvements in action
recognition accuracy (up to 20%) and computational efficiency (more than 40%).