Review — Two-Stream ConvNet: Spatial and Temporal Networks (Video Classification)
Video Classification/Action Recognition Using AlexNet-Like Two-Stream Spatial and Temporal Networks
4 min readJun 14, 2021
In this story, Two-Stream Convolutional Networks for Action Recognition in Videos, (Two-Stream ConvNet), by Visual Geometry Group, University of Oxford, is reviewed. Visual Geometry Group (VGG) is the famous research group. In this paper:
- A two-stream ConvNet architecture which incorporates spatial and temporal networks.
This is a paper in 2014 NIPS with over 5400 citations. (Sik-Ho Tsang @ Medium)
Outline
- Two-Stream CNN: Network Architecture
- Experimental Results
1. Two-Stream CNN: Network Architecture
- Video can be decomposed into spatial and temporal components.
- The spatial part, in the form of individual frame appearance, carries information about scenes and objects.
- The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.
1.1. Spatial Stream ConvNet
- ImageNet-pretrained AlexNet-Like Network is used.
1.2. Temporal Stream ConvNet
- Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox.
- The horizontal and vertical components of the flow were linearly rescaled to a [0; 255] range and compressed using JPEG.
- This reduced the flow size for the UCF-101 dataset from 1.5TB to 27GB.
- There are horizontal and vertical components of the flow at frame t, i.e. dxt and dyt. The flow channels dx,yt of L consecutive frames to form a total of 2L input channels.
- A ConvNet input volume Iτ has the size of w×h×2L for an arbitrary frame is:
- Mean flow subtraction is used to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities.
- This multi-frame optical flow is input into the temporal stream ConvNet.
1.3. Multi-Task Learning
- UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K videos respectively, are small.
- Two softmax classification layers are on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one — the UCF-101 scores.
- Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset.
- The overall training loss is computed as the sum of the individual tasks’ losses.
2. Experimental Results
2.1. Spatial Stream ConvNet
- Interestingly, fine-tuning the whole network gives only marginal improvement over training the last layer only.
- Network with training the last layer only is used.
2.2. Temporal Stream ConvNet
- Mean subtraction is useful as improvement is always achieved.
- The bi-directional optical flow is slightly better than a uni-directional forward flow.
- But later on, the bi-directional optical flow is not used since the performance is dropped when the network is fused with spatial stream ConvNet.
- (There are passages describing different types of optical flow as above. If interested, please feel free to read the paper.)
2.3. Multi-Task Learning
- Using Multi-Task Learning is better than trained from scratch, or pre-training on either dataset.
2.4. Two-Stream ConvNet
- The softmax scores are fused using either averaging or a linear SVM.
- SVM-based fusion of softmax scores outperforms fusion by averaging.
- Using bi-directional flow is not beneficial in the case of ConvNet fusion.
2.5. SOTA Comparison
- As can be seen from the above table, both the spatial and temporal nets alone outperform the deep architectures of [14, 16] by a large margin.
- The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models.
Reference
[2014 NIPS] [Two-Stream ConvNet]
Two-Stream Convolutional Networks for Action Recognition in Videos
Video Classification
2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] 2017 [P3D]