In this story, Two-Stream Convolutional Networks for Action Recognition in Videos, (Two-Stream ConvNet), by Visual Geometry Group, University of Oxford, is reviewed. Visual Geometry Group (VGG) is the famous research group. In this paper:
A two-stream ConvNet architecture which incorporates spatial and temporal networks.
This is a paper in 2014 NIPS with over 5400 citations. (
(a),(b): a pair of consecutive video frames, (c): a close-up of dense optical flow, (d): horizontal, (e) vertical component dx of the displacement vector field
Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox.
The horizontal and vertical components of the flow were linearly rescaled to a [0; 255] range and compressed using JPEG.
This reduced the flow size for the UCF-101 dataset from 1.5TB to 27GB.
ConvNet input derivation from the multi-frame optical flow
There are horizontal and vertical components of the flow at frame t, i.e. dxt and dyt. The flow channels dx,yt of L consecutive frames to form a total of 2L input channels.
A ConvNet input volume Iτ has the size of w×h×2L for an arbitrary frame is:
Mean flow subtraction is used to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities.
This multi-frame optical flow is input into the temporal stream ConvNet.
1.3. Multi-Task Learning
UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K videos respectively, are small.
Two softmax classification layers are on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one — the UCF-101 scores.
Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset.
The overall training loss is computed as the sum of the individual tasks’ losses.
2. Experimental Results
2.1. Spatial Stream ConvNet
Press enter or click to view image in full size
Spatial Stream ConvNet UCF-101 (split 1).
Interestingly, fine-tuning the whole network gives only marginal improvement over training the last layer only.
Network with training the last layer only is used.
2.2. Temporal Stream ConvNet
Temporal Stream ConvNet UCF-101 (split 1).
Mean subtraction is useful as improvement is always achieved.
The bi-directional optical flow is slightly better than a uni-directional forward flow.
But later on, the bi-directional optical flow is not used since the performance is dropped when the network is fused with spatial stream ConvNet.
(There are passages describing different types of optical flow as above. If interested, please feel free to read the paper.)
2.3. Multi-Task Learning
Temporal ConvNet accuracy on HMDB-51
Using Multi-Task Learning is better than trained from scratch, or pre-training on either dataset.
2.4. Two-Stream ConvNet
Press enter or click to view image in full size
Two-Stream ConvNet accuracy on UCF-101 (split 1)
The softmax scores are fused using either averaging or a linear SVM.
SVM-based fusion of softmax scores outperforms fusion by averaging.
Using bi-directional flow is not beneficial in the case of ConvNet fusion.
2.5. SOTA Comparison
Press enter or click to view image in full size
Mean accuracy (over three splits) on UCF-101 and HMDB-51
As can be seen from the above table, both the spatial and temporal nets alone outperform the deep architectures of [14, 16] by a large margin.
The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models.