Review — Two-Stream ConvNet: Spatial and Temporal Networks (Video Classification)

Video Classification/Action Recognition Using AlexNet-Like Two-Stream Spatial and Temporal Networks

In this story, Two-Stream Convolutional Networks for Action Recognition in Videos, (Two-Stream ConvNet), by Visual Geometry Group, University of Oxford, is reviewed. Visual Geometry Group (VGG) is the famous research group. In this paper:

  • A two-stream ConvNet architecture which incorporates spatial and temporal networks.

This is a paper in 2014 NIPS with over 5400 citations. (Sik-Ho Tsang @ Medium)


  1. Two-Stream CNN: Network Architecture
  2. Experimental Results

1. Two-Stream CNN: Network Architecture

Two-Stream CNN: Network Architecture
  • Video can be decomposed into spatial and temporal components.
  • The spatial part, in the form of individual frame appearance, carries information about scenes and objects.
  • The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.

1.1. Spatial Stream ConvNet

  • ImageNet-pretrained AlexNet-Like Network is used.

1.2. Temporal Stream ConvNet

(a),(b): a pair of consecutive video frames, (c): a close-up of dense optical flow, (d): horizontal, (e) vertical component dx of the displacement vector field
  • Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox.
  • The horizontal and vertical components of the flow were linearly rescaled to a [0; 255] range and compressed using JPEG.
  • This reduced the flow size for the UCF-101 dataset from 1.5TB to 27GB.
ConvNet input derivation from the multi-frame optical flow
  • There are horizontal and vertical components of the flow at frame t, i.e. dxt and dyt. The flow channels dx,yt of L consecutive frames to form a total of 2L input channels.
  • A ConvNet input volume Iτ has the size of w×h×2L for an arbitrary frame  is:
  • Mean flow subtraction is used to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities.
  • This multi-frame optical flow is input into the temporal stream ConvNet.

1.3. Multi-Task Learning

  • UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K videos respectively, are small.
  • Two softmax classification layers are on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one — the UCF-101 scores.
  • Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset.
  • The overall training loss is computed as the sum of the individual tasks’ losses.

2. Experimental Results

2.1. Spatial Stream ConvNet

Spatial Stream ConvNet UCF-101 (split 1).
  • Interestingly, fine-tuning the whole network gives only marginal improvement over training the last layer only.
  • Network with training the last layer only is used.

2.2. Temporal Stream ConvNet

Temporal Stream ConvNet UCF-101 (split 1).
  • Mean subtraction is useful as improvement is always achieved.
  • The bi-directional optical flow is slightly better than a uni-directional forward flow.
  • But later on, the bi-directional optical flow is not used since the performance is dropped when the network is fused with spatial stream ConvNet.
  • (There are passages describing different types of optical flow as above. If interested, please feel free to read the paper.)

2.3. Multi-Task Learning

Temporal ConvNet accuracy on HMDB-51
  • Using Multi-Task Learning is better than trained from scratch, or pre-training on either dataset.

2.4. Two-Stream ConvNet

Two-Stream ConvNet accuracy on UCF-101 (split 1)
  • The softmax scores are fused using either averaging or a linear SVM.
  • SVM-based fusion of softmax scores outperforms fusion by averaging.
  • Using bi-directional flow is not beneficial in the case of ConvNet fusion.

2.5. SOTA Comparison

Mean accuracy (over three splits) on UCF-101 and HMDB-51
  • As can be seen from the above table, both the spatial and temporal nets alone outperform the deep architectures of [14, 16] by a large margin.
  • The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models.

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sparkify Project- Udacity Nanodegree

Fig 1: Gender distribution across Churn/No Churn Users

Train a lines segmentation model using Pytorch

Reading: CNF — Context-wise Network Fusion Fusing Multiple CNNs (Super Resolution)

Comparing Emotion Recognition Tech: Microsoft, Neurodata Lab, Amazon, Affectiva

Review — Unsupervised Visual Representation Learning by Context Prediction (Self-Supervised)

Comparing multiple classification models for an NLP problem

Review: CNNAC TCSVT’19 —Convolutional Neural Network-Based Arithmetic Coding (HEVC Intra)

Pneumonia Detection using PyTorch

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Understanding Gradient Descent in PyTorch

A Lightweight PyTorch Implementation of Neural Style Transfer

How the DataLoader of OneFlow Works

Review — Stylized-ImageNet: ImageNet-Trained CNNs are Biased Towards Texture; Increasing Shape…