Review — S3D, S3D-G: Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

S3D: Using Separable 3D Convolution; S3D-G: Further Improved With Spatio-Temporal Feature Gating

4 min readMay 20, 2022

--

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification, S3D, S3D-G, by Google Research, and University of California San Diego
2018 ECCV, Over 700 Citations (Sik-Ho Tsang @ Medium)
Video Classification, Action Recognition

3D CNNs are useful to convolve jointly over time and space. But 3D CNNs are much more expensive than 2D CNNs and prone to overfit.
It is found that it is possible to replace many of the 3D convolutions at early stages by low-cost 2D convolutions.

Outline

I2D, I3D, Bottom-Heavy I3D, Top-Heavy I3D
Separable 3D CNN (S3D)
Experimental Results

1. I2D, I3D, Bottom-Heavy I3D, Top-Heavy I3D

**4 main variants for video classification**

I2D, which is a 2D CNN, operating on multiple frames.
I3D, which is a 3D CNN, convolving over space and time.
Bottom-Heavy I3D, which uses 3D in the lower layers, and 2D in the higher layers.
Top-Heavy I3D, which uses 2D in the lower (larger) layers, and 3D in the upper layers.

The architecture details are shown below:

2. S3D

2.1. Separable 3D Convolution

**(a) 2D Inception block, (b) 3D Inception block, (c) 3D temporal separable Inception block used in S3D networks**

To separate space and time, 3D convolutions with spatial and temporal are replaced by separable 3D convolutions, i.e., replace filters of the form kt×k×k by 1×k×k followed by kt×1×1, where kt is the width of the filter in time, and k is the height/width of the filter in space.
(For Inception, please feel free to read GoogLeNet / Inception-v1, BN-Inception / Inception-v2, Inception-v3, and Inception-v4)

2.2. Separable 3D CNN (S3D)

The Separable 3D Convolution is used at the early layers.
The resulting model is called S3D, which stands for “Separable 3D CNN”.
(For separable convolution, please feel free to read Xception, MobileNetV1, MobileNetV2, and MobileNetV3)

2.3. Spatio-Temporal Feature Gating

The accuracy of S3D is further improved by using feature gating:

This gating module is plugged into any layer of the network, and forms S3D-G.
(For gating, please feel free to read Highway.)

3. Experimental Results

**Top-1 accuracy on Kinetics-Full and Something-something datasets**

I2D underperforms I3D by a large margin.

**Effect of separable convolution and feature gating on the Kinetics-Full validation set using RGB features**

S3D and S3D-G outperforms I3D by a large margin with fewer FLOPs.

**Effect of separable convolution and feature gating on the Something-something validation and test sets using RGB features**

S3D-G also outperforms S3D and I3D on Something-something.

**Benefits of using optical flow on the Kinetics-Full validation set**

Using optical flow features as input, the performance is competitive compared with recent Kinetics Challenge winners and concurrent works.

**Results of various methods on action classification on the UCF-101 and HMDB-51 datasets**

On UCF-101, the proposed S3D-G architecture, which only uses Kinetics for pretraining, outperforms I3D, and matches R(2+1)D, both of which use largescale datasets (Kinetics and Sports-1M) for pretraining.
On HMDB-51, S3D-G outperforms all previous methods published to date.

**Results of various methods on action detection in JHMDB and UCF101**

Faster R-CNN object detection algorithm is used to jointly perform person localization and action recognition.
The model uses a 2D ResNet-50 network that takes the annotated keyframe (frame with box annotations) as input, and extract features for region proposal generation on the keyframe.
A 3D network (such as I3D or S3D-G) is then used to take the frames surrounding the keyframe as input, and feature maps are extracted, which are then pooled for bounding box classification.

Both 3D networks outperform previous architectures by large margins, while S3D-G is consistently better than I3D.

Reference

[2018 ECCV] [S3D, S3D-G]
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Video Classification / Action Recognition

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G]

Review — S3D, S3D-G: Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

S3D: Using Separable 3D Convolution; S3D-G: Further Improved With Spatio-Temporal Feature Gating

Outline

1. I2D, I3D, Bottom-Heavy I3D, Top-Heavy I3D

2. S3D

2.1. Separable 3D Convolution

2.2. Separable 3D CNN (S3D)

2.3. Spatio-Temporal Feature Gating

3. Experimental Results

Reference

Video Classification / Action Recognition

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet