Refining activation downsampling with SoftPool
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer]
Video Classification / Action Recognition
2014 … 2021 [MViT / MViTv1] [MViTv2]
==== My Other Paper Readings Are Also Over Here ====
- SoftPool is proposed to replace average pooling or max pooling, which can retain more information in the reduced activation maps.
- By using SoftPool, the performance of image classification and action recognition is improved.
SoftPool is influenced by the cortex neural simulations of Riesenhuber and Poggio  as well as the early pooling experiments with hand-coded features of Boureau et al. .
- The proposed method is based on the natural exponent (e) which ensures that large activation values will have greater effect on the output. The operation is differentiable.
Each activation ai with index i is applied a weight wi that is calculated as the ratio of the natural exponent of that activation with respect to the sum of the natural exponents of all activations within neighborhood R:
- i.e. softmax operation.
The output value of the SoftPool operation is produced through a standard summation of all weighted activations within the kernel neighborhood R:
- Softpool is easily extended to 3D version such as for the applications of action recognition.
1.2. Comparisons with Other Pooling Variants
- There are other pooling variants as above.
- In the max pooling case, discarding the majority of the activations presents the risk of losing important information.
- Conversely, an equal contribution of activations in average pooling can correspond to local intensity reductions by considering the overall regional feature intensity equally.
Average pooling decreases the effect of all activations in the region equally, while max pooling selects only the single highest activation in the region.
SoftPool falls between the two, as all activations in the region contribute to the final output, with higher activations are more dominant than lower ones. This balances the effects of both average and max pooling, while leveraging the beneficial properties of both.
2.1. SSIM & PSNR
- Figure 5: The proposed SoftPool method can represent regions with borders between low and high frequencies better than other methods.
Table 1: SoftPool outperforms all other methods by a reasonable margin.
SoftPool achieves low inference times for both CPU and CUDA-based operations, while remaining memory-efficient.
2.2. Image Classification
Networks trained from scratch with pooling layers replaced by SoftPool yield consistent accuracy improvements over the original networks.
- The same trend is also visible for the pre-trained networks for which the models have been trained with their original pooling methods.
SoftPool does not include trainable variables and thus does not affect the number of parameters.
2.3. Action Recognition
For the SRTG model with ResNet-(2+1) backbone, the network with proposed SoftPool achieves state-of-the-art performance on HACS. Also when using spatio-temporal data, SoftPool does not add computational complexity (GFLOPS).
- SRTG r3d-101 with SoftPool is the best performing model with a top-1 accuracy of 98.06% and top-5 of 99.82%.