Review — Two-Stream ConvNet: Spatial and Temporal Networks (Video Classification)

Video Classification/Action Recognition Using AlexNet-Like Two-Stream Spatial and Temporal Networks

Sik-Ho Tsang
4 min readJun 14, 2021

In this story, Two-Stream Convolutional Networks for Action Recognition in Videos, (Two-Stream ConvNet), by Visual Geometry Group, University of Oxford, is reviewed. Visual Geometry Group (VGG) is the famous research group. In this paper:

  • A two-stream ConvNet architecture which incorporates spatial and temporal networks.

This is a paper in 2014 NIPS with over 5400 citations. (Sik-Ho Tsang @ Medium)


  1. Two-Stream CNN: Network Architecture
  2. Experimental Results

1. Two-Stream CNN: Network Architecture

Two-Stream CNN: Network Architecture
  • Video can be decomposed into spatial and temporal components.
  • The spatial part, in the form of individual frame appearance, carries information about scenes and objects.
  • The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.

1.1. Spatial Stream ConvNet

  • ImageNet-pretrained AlexNet-Like Network is used.

1.2. Temporal Stream ConvNet

(a),(b): a pair of consecutive video frames, (c): a close-up of dense optical flow, (d): horizontal, (e) vertical component dx of the displacement vector field
  • Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox.
  • The horizontal and vertical components of the flow were linearly rescaled to a [0; 255] range and compressed using JPEG.
  • This reduced the flow size for the UCF-101 dataset from 1.5TB to 27GB.
ConvNet input derivation from the multi-frame optical flow
  • There are horizontal and vertical components of the flow at frame t, i.e. dxt and dyt. The flow channels dx,yt of L consecutive frames to form a total of 2L input channels.
  • A ConvNet input volume Iτ has the size of w×h×2L for an arbitrary frame  is:
  • Mean flow subtraction is used to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities.
  • This multi-frame optical flow is input into the temporal stream ConvNet.

1.3. Multi-Task Learning

  • UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K videos respectively, are small.
  • Two softmax classification layers are on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one — the UCF-101 scores.
  • Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset.
  • The overall training loss is computed as the sum of the individual tasks’ losses.

2. Experimental Results

2.1. Spatial Stream ConvNet

Spatial Stream ConvNet UCF-101 (split 1).
  • Interestingly, fine-tuning the whole network gives only marginal improvement over training the last layer only.
  • Network with training the last layer only is used.

2.2. Temporal Stream ConvNet

Temporal Stream ConvNet UCF-101 (split 1).
  • Mean subtraction is useful as improvement is always achieved.
  • The bi-directional optical flow is slightly better than a uni-directional forward flow.
  • But later on, the bi-directional optical flow is not used since the performance is dropped when the network is fused with spatial stream ConvNet.
  • (There are passages describing different types of optical flow as above. If interested, please feel free to read the paper.)

2.3. Multi-Task Learning

Temporal ConvNet accuracy on HMDB-51
  • Using Multi-Task Learning is better than trained from scratch, or pre-training on either dataset.

2.4. Two-Stream ConvNet

Two-Stream ConvNet accuracy on UCF-101 (split 1)
  • The softmax scores are fused using either averaging or a linear SVM.
  • SVM-based fusion of softmax scores outperforms fusion by averaging.
  • Using bi-directional flow is not beneficial in the case of ConvNet fusion.

2.5. SOTA Comparison

Mean accuracy (over three splits) on UCF-101 and HMDB-51
  • As can be seen from the above table, both the spatial and temporal nets alone outperform the deep architectures of [14, 16] by a large margin.
  • The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.