[Paper] P3D: Pseudo-3D Residual Networks (Video Classification & Action Recognition)

Factorized 3D Convolutions, Outperforms Deep Video & C3D

Sik-Ho Tsang
4 min readNov 8, 2020

In this story, Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks (P3D), by University of Science and Technology of China, and Microsoft Research, is briefly presented.

  • 3D CNN is computational and memory expensive.

In this paper:

  • 3×3×3 convolutions are designed as 1×3×3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3×1×1 convolutions to construct temporal connections on adjacent feature maps in time.
  • Bottleneck building blocks are built, which formed a network, named Pseudo-3D Residual Net (P3D ResNet).

This is a paper in 2017 ICCV with over 500 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Pseudo-3D (P3D) Convolution
  2. P3D ResNet Block Variants
  3. Experimental Results

1. Pseudo-3D (P3D) Convolution

Pseudo-3D (P3D) Convolution blocks.
  • The size of 3D convolutional filters is denoted as d×k×k where d is the temporal depth of kernel and k is the kernel spatial size.
  • 3D convolutional filters with size of 3×3×3, can be naturally decoupled into 1×3×3 convolutional filters equivalent to 2D CNN on spatial domain and 3×1×1 convolutional filters like 1D CNN tailored to temporal domain. (This idea is similar to Inception-v3 which is a 2D version.)
  • Such decoupled 3D convolutions are regarded as a Pseudo 3D CNN,
  • It not only reduces the model size significantly, but also enables the pre-training of 2D CNN from image data, endowing Pseudo 3D CNN more power of leveraging the knowledge of scenes and objects learnt from images.
  • P3D-A: The first design considers stacked architecture by making temporal 1D filters (T) follow spatial 2D filters (S) in a cascaded manner.
  • P3D-B: Both filters are at different pathways in a parallel fashion.
  • P3D-C: The last design is a compromise between P3D-A and P3D-B, by simultaneously building the direct influences among S, T and the final output.

2. P3D ResNet Block Variants

P3D ResNet Block
  • As a bottleneck block, similar to ResNet, two 1×1×1 convolutions are additionally placed at both ends of the path, which are responsible for reducing and then increasing the dimensions.
model size, speed, and accuracy on UCF101 (split1).
  • The architecture of ResNet-50 is fine-tuned on UCF101 videos. We set the input as 224×224 image which is randomly cropped from the resized 240×320 video frame.
  • For each P3D ResNet variant, the dimension of input video clip is set as 16×160×160 which is randomly cropped from the resized non-overlapped 16-frame clip with the size of 16×182×242.
  • Overall, all the three P3D ResNet variants (i.e., P3D-A ResNet, P3D-B ResNet and P3D-C ResNet) exhibit better performance than ResNet-50.
  • P3D ResNet: Particularly, Residual Units are replaced with a chain of our P3D blocks in the order P3D-A > P3D-B > P3D-C.
  • There are absolute improvements over P3D-A ResNet, P3D-B ResNet and P3D-C ResNet by 0.5%, 1.4% and 1.2% in accuracy respectively, indicating that enhancing structural diversity.

3. Experimental Results

3.1. Sports-1M

Top-1 clip-level accuracy and Top-1&5 video-level accuracy on Sports-1M.
  • A deeper 152-layer ResNet is used.
  • Sports-1M contains about 1.13 million videos annotated with 487 Sports labels. The official split is used, i.e., 70%, 10% and 20% for training, validation and test set.
  • P3D outperforms SOTA approaches such as Deep Video and C3D.

3.2. Other Datasets

Performance comparisons with the state-of-the-art methods on UCF101 (3 splits).
Performance comparisons in terms of Top-1 & Top-3 classification accuracy, and mean AP on ActivityNet
  • Basically, P3D ResNet utilizing 2D spatial convolutions plus 1D temporal convolutions exhibits significantly better performance than C3D which directly uses 3D spatio-temporal convolutions.
Action similarity labeling performances on ASLAN benchmark.
  • P3D ResNet which pre-trains 2D spatial convolutions on image data and learns 1D temporal convolutions on video data fully leverages the knowledge from two domains, successfully boosting up the performance.
The accuracy performance of scene recognition on Dynamic Scene and YUPENN sets
  • P3D ResNet performs consistently better than both hand-crafted features and CNN-based representations.

3.3. Video Representation Embedding Visualization

  • The video-level representation is projected into 2-dimensional space using t-SNE.
  • It is clear that video representations by P3D ResNet are better semantically separated than those of ResNet-152.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.