Review — Two-Stream ConvNet: Spatial and Temporal Networks (Video Classification)

Video Classification/Action Recognition Using AlexNet-Like Two-Stream Spatial and Temporal Networks

4 min readJun 14, 2021

In this story, Two-Stream Convolutional Networks for Action Recognition in Videos, (Two-Stream ConvNet), by Visual Geometry Group, University of Oxford, is reviewed. Visual Geometry Group (VGG) is the famous research group. In this paper:

A two-stream ConvNet architecture which incorporates spatial and temporal networks.

This is a paper in 2014 NIPS with over 5400 citations. (Sik-Ho Tsang @ Medium)

Outline

Two-Stream CNN: Network Architecture
Experimental Results

1. Two-Stream CNN: Network Architecture

Video can be decomposed into spatial and temporal components.
The spatial part, in the form of individual frame appearance, carries information about scenes and objects.
The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.

1.1. Spatial Stream ConvNet

ImageNet-pretrained AlexNet-Like Network is used.

1.2. Temporal Stream ConvNet

**(a),(b): a pair of consecutive video frames, (c): a close-up of dense optical flow, (d): horizontal, (e) vertical component dx of the displacement vector field**

Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox.
The horizontal and vertical components of the flow were linearly rescaled to a [0; 255] range and compressed using JPEG.
This reduced the flow size for the UCF-101 dataset from 1.5TB to 27GB.

**ConvNet input derivation from the multi-frame optical flow**

There are horizontal and vertical components of the flow at frame t, i.e. dxt and dyt. The flow channels dx,yt of L consecutive frames to form a total of 2L input channels.
A ConvNet input volume Iτ has the size of w×h×2L for an arbitrary frame is:

Mean flow subtraction is used to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities.
This multi-frame optical flow is input into the temporal stream ConvNet.

1.3. Multi-Task Learning

UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K videos respectively, are small.
Two softmax classification layers are on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one — the UCF-101 scores.
Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset.
The overall training loss is computed as the sum of the individual tasks’ losses.

2. Experimental Results

2.1. Spatial Stream ConvNet

**Spatial Stream ConvNet UCF-101 (split 1).**

Interestingly, fine-tuning the whole network gives only marginal improvement over training the last layer only.
Network with training the last layer only is used.

2.2. Temporal Stream ConvNet

**Temporal Stream ConvNet UCF-101 (split 1).**

Mean subtraction is useful as improvement is always achieved.
The bi-directional optical flow is slightly better than a uni-directional forward flow.
But later on, the bi-directional optical flow is not used since the performance is dropped when the network is fused with spatial stream ConvNet.
(There are passages describing different types of optical flow as above. If interested, please feel free to read the paper.)

2.3. Multi-Task Learning

**Temporal ConvNet accuracy on HMDB-51**

Using Multi-Task Learning is better than trained from scratch, or pre-training on either dataset.

2.4. Two-Stream ConvNet

**Two-Stream ConvNet accuracy on UCF-101 (split 1)**

The softmax scores are fused using either averaging or a linear SVM.
SVM-based fusion of softmax scores outperforms fusion by averaging.
Using bi-directional flow is not beneficial in the case of ConvNet fusion.

2.5. SOTA Comparison

**Mean accuracy (over three splits) on UCF-101 and HMDB-51**

As can be seen from the above table, both the spatial and temporal nets alone outperform the deep architectures of [14, 16] by a large margin.
The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models.

Reference

[2014 NIPS] [Two-Stream ConvNet]
Two-Stream Convolutional Networks for Action Recognition in Videos

Video Classification

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] 2017 [P3D]