Review — S3D, S3D-G: Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

S3D: Using Separable 3D Convolution; S3D-G: Further Improved With Spatio-Temporal Feature Gating

Sik-Ho Tsang
4 min readMay 20, 2022

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification, S3D, S3D-G, by Google Research, and University of California San Diego
2018 ECCV, Over 700 Citations (Sik-Ho Tsang @ Medium)
Video Classification, Action Recognition

  • 3D CNNs are useful to convolve jointly over time and space. But 3D CNNs are much more expensive than 2D CNNs and prone to overfit.
  • It is found that it is possible to replace many of the 3D convolutions at early stages by low-cost 2D convolutions.

Outline

  1. I2D, I3D, Bottom-Heavy I3D, Top-Heavy I3D
  2. Separable 3D CNN (S3D)
  3. Experimental Results

1. I2D, I3D, Bottom-Heavy I3D, Top-Heavy I3D

4 main variants for video classification
  1. I2D, which is a 2D CNN, operating on multiple frames.
  2. I3D, which is a 3D CNN, convolving over space and time.
  3. Bottom-Heavy I3D, which uses 3D in the lower layers, and 2D in the higher layers.
  4. Top-Heavy I3D, which uses 2D in the lower (larger) layers, and 3D in the upper layers.
  • The architecture details are shown below:
4 main variants for video classification

2. S3D

2.1. Separable 3D Convolution

(a) 2D Inception block, (b) 3D Inception block, (c) 3D temporal separable Inception block used in S3D networks
  • To separate space and time, 3D convolutions with spatial and temporal are replaced by separable 3D convolutions, i.e., replace filters of the form kt×k×k by 1×k×k followed by kt×1×1, where kt is the width of the filter in time, and k is the height/width of the filter in space.
  • (For Inception, please feel free to read GoogLeNet / Inception-v1, BN-Inception / Inception-v2, Inception-v3, and Inception-v4)

2.2. Separable 3D CNN (S3D)

Separable 3D CNN (S3D)
  • The Separable 3D Convolution is used at the early layers.
  • The resulting model is called S3D, which stands for “Separable 3D CNN”.
  • (For separable convolution, please feel free to read Xception, MobileNetV1, MobileNetV2, and MobileNetV3)

2.3. Spatio-Temporal Feature Gating

  • The accuracy of S3D is further improved by using feature gating:
  • This gating module is plugged into any layer of the network, and forms S3D-G.
  • (For gating, please feel free to read Highway.)

3. Experimental Results

Top-1 accuracy on Kinetics-Full and Something-something datasets

I2D underperforms I3D by a large margin.

Effect of separable convolution and feature gating on the Kinetics-Full validation set using RGB features

S3D and S3D-G outperforms I3D by a large margin with fewer FLOPs.

Effect of separable convolution and feature gating on the Something-something validation and test sets using RGB features

S3D-G also outperforms S3D and I3D on Something-something.

Benefits of using optical flow on the Kinetics-Full validation set

Using optical flow features as input, the performance is competitive compared with recent Kinetics Challenge winners and concurrent works.

Results of various methods on action classification on the UCF-101 and HMDB-51 datasets

On UCF-101, the proposed S3D-G architecture, which only uses Kinetics for pretraining, outperforms I3D, and matches R(2+1)D, both of which use largescale datasets (Kinetics and Sports-1M) for pretraining.

On HMDB-51, S3D-G outperforms all previous methods published to date.

Results of various methods on action detection in JHMDB and UCF101
  • Faster R-CNN object detection algorithm is used to jointly perform person localization and action recognition.
  • The model uses a 2D ResNet-50 network that takes the annotated keyframe (frame with box annotations) as input, and extract features for region proposal generation on the keyframe.
  • A 3D network (such as I3D or S3D-G) is then used to take the frames surrounding the keyframe as input, and feature maps are extracted, which are then pooled for bounding box classification.

Both 3D networks outperform previous architectures by large margins, while S3D-G is consistently better than I3D.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.