[Paper] P3D: Pseudo-3D Residual Networks (Video Classification & Action Recognition)
In this story, Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks (P3D), by University of Science and Technology of China, and Microsoft Research, is briefly presented.
- 3D CNN is computational and memory expensive.
In this paper:
- 3×3×3 convolutions are designed as 1×3×3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3×1×1 convolutions to construct temporal connections on adjacent feature maps in time.
- Bottleneck building blocks are built, which formed a network, named Pseudo-3D Residual Net (P3D ResNet).
This is a paper in 2017 ICCV with over 500 citations. (Sik-Ho Tsang @ Medium)
1. Pseudo-3D (P3D) Convolution
- The size of 3D convolutional filters is denoted as d×k×k where d is the temporal depth of kernel and k is the kernel spatial size.
- 3D convolutional filters with size of 3×3×3, can be naturally decoupled into 1×3×3 convolutional filters equivalent to 2D CNN on spatial domain and 3×1×1 convolutional filters like 1D CNN tailored to temporal domain. (This idea is similar to Inception-v3 which is a 2D version.)
- Such decoupled 3D convolutions are regarded as a Pseudo 3D CNN,
- It not only reduces the model size significantly, but also enables the pre-training of 2D CNN from image data, endowing Pseudo 3D CNN more power of leveraging the knowledge of scenes and objects learnt from images.
- P3D-A: The first design considers stacked architecture by making temporal 1D filters (T) follow spatial 2D filters (S) in a cascaded manner.
- P3D-B: Both filters are at different pathways in a parallel fashion.
- P3D-C: The last design is a compromise between P3D-A and P3D-B, by simultaneously building the direct influences among S, T and the final output.
2. P3D ResNet Block Variants
- As a bottleneck block, similar to ResNet, two 1×1×1 convolutions are additionally placed at both ends of the path, which are responsible for reducing and then increasing the dimensions.
- The architecture of ResNet-50 is fine-tuned on UCF101 videos. We set the input as 224×224 image which is randomly cropped from the resized 240×320 video frame.
- For each P3D ResNet variant, the dimension of input video clip is set as 16×160×160 which is randomly cropped from the resized non-overlapped 16-frame clip with the size of 16×182×242.
- Overall, all the three P3D ResNet variants (i.e., P3D-A ResNet, P3D-B ResNet and P3D-C ResNet) exhibit better performance than ResNet-50.
3. Experimental Results
- A deeper 152-layer ResNet is used.
- Sports-1M contains about 1.13 million videos annotated with 487 Sports labels. The official split is used, i.e., 70%, 10% and 20% for training, validation and test set.
- P3D outperforms SOTA approaches such as Deep Video and C3D.
3.2. Other Datasets
- Basically, P3D ResNet utilizing 2D spatial convolutions plus 1D temporal convolutions exhibits significantly better performance than C3D which directly uses 3D spatio-temporal convolutions.
- P3D ResNet which pre-trains 2D spatial convolutions on image data and learns 1D temporal convolutions on video data fully leverages the knowledge from two domains, successfully boosting up the performance.
- P3D ResNet performs consistently better than both hand-crafted features and CNN-based representations.