Review — TSN: Temporal Segment Network (Video Classification)

Two-Stream ConvNet + Temporal segment network (TSN), for Video Classification/Action Recognition

5 min readJun 20, 2021

--

In this story, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, (TSN), by ETH Zurich, The Chinese University of Hong Kong, and Chinese Academy of Sciences (CAS), is reviewed. In this paper:

Temporal segment network (TSN) is designed, which combines a sparse temporal sampling strategy and video-level supervision.
Some practices in learning ConvNets is described.

This is a paper in 2016 ECCV with over 2100 citations. (Sik-Ho Tsang @ Medium)

Outline

Temporal segment network (TSN)
Some Practices in Learning ConvNets
Experimental Results

1. Temporal segment network (TSN)

Formally, a video V is divided into K segments {S1, S2, …, SK} of equal durations. Then, the temporal segment network models a sequence of snippets as follows:

where F is the convolutional network, G is the aggregation function, and H is the softmax operation.
Each short snippet goes through the ConvNet F, then aggregated by G, and the score is output by H.
Thus, the standard cross entropy loss function is used:

where C is the number of action classes and yi the groundtruth label concerning class i.
In experiments, the number of snippets K is set to 3.
The aggregation function G can be averaging, maximum, and weighted averaging. And it is found out that averaging is already good enough.
Different networks are tried for F. And BN-Inception is found to be good.

2. Some Practices in Learning ConvNets

2.1. Input Modalities for TSN

**Input modality: RGB images, RGB difference, optical flow fields (x,y directions), and warped optical flow fields (x,y directions)**

Except RGB and optical flow field used in Two-Stream ConvNet, RGB difference and warped optical flow are also tried.
The RGB difference between the current frame and previous frame is also tried, to describe the appearance change, which may correspond to the motion salient region. But later on it is found out that it is not good.
The warped optical flow is obtained by first estimating homography matrix and then compensating camera motion. As shown in the above figure, the warped optical flow suppresses the background motion and makes motion concentrate on the actor.
The extraction of optical flow and warped optical flow is done by the TVL1 optical flow algorithm implemented in OpenCV with CUDA.

2.2. Some Training Details

Some pretraining strategies are used for the above input.
Batch norm, originated in BN-Inception, is frozen except the first layer. This is called Partial BN here.
An extra dropout layer is added after the global pooling layer.
Data augmentation is used. In the original Two-Stream ConvNet, random cropping and horizontal flipping are employed.
Here, two new data augmentation techniques: corner cropping and scale-jittering.
In corner cropping technique, the extracted regions are only selected from the corners or the center of the image to avoid implicitly focusing on the center area of a image.
In multi-scale cropping technique, the scale jittering technique used in VGGNet for ImageNet classification, is used for action recognition.
(These are quite in details. If interest, please feel free to read the paper.)

3. Experimental Results

3.1. Datasets

Two datasets are used: UCF101 and HMDB51.
The UCF101 dataset contains 101 action classes and 13,320 video clips.
The HMDB51 dataset is composed of 6,766 video clips from 51 action categories.
The whole training time on UCF101 is around 2 hours for spatial TSNs and 9 hours for temporal TSNs with 4 TITAN-X GPUs.

3.2. Ablation Study

**Different training strategies for two-stream ConvNets on the UCF101 dataset (split 1).**

TSN trained from scratched is not good.
TSN with spatial ConvNet pretrained by ImageNet, outperforms Two-Stream ConvNet [1].
ConvNet pretrained as well for temporal input modalities, the results are even better.
With partial BN and dropout, 92.0% accuracy is obtained.

**Exploration of different input modalities for two-stream ConvNets on the UCF101 dataset (split 1).**

The optical flow is better at capturing motion information and sometimes RGB difference may be unstable.
As RGB difference may describe similar but unstable motion patterns, the performance of combining the other three modalities brings better recognition accuracy (92.3% vs 91.7%).

**Exploration of different segmental consensus functions for temporal segment networks on the UCF101 dataset (split 1).**

Three candidates are evaluated: (1) max pooling, (2) average pooling, (3) weighted average, for the form of G.
Average is the best one for two-stream version.

**Exploration of different very deep ConvNet architectures on the UCF101 dataset (split 1).**

Using BN-Inception as backbone obtains the accuracy of 92.0%.
With TSN to segment the video into snippets, 93.5% accuracy is obtained, outperforms Two-Stream ConvNet [1].