[Paper] P3D: Pseudo-3D Residual Networks (Video Classification & Action Recognition)

Factorized 3D Convolutions, Outperforms Deep Video & C3D

4 min readNov 8, 2020

In this story, Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks (P3D), by University of Science and Technology of China, and Microsoft Research, is briefly presented.

3D CNN is computational and memory expensive.

In this paper:

3×3×3 convolutions are designed as 1×3×3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3×1×1 convolutions to construct temporal connections on adjacent feature maps in time.
Bottleneck building blocks are built, which formed a network, named Pseudo-3D Residual Net (P3D ResNet).

This is a paper in 2017 ICCV with over 500 citations. (Sik-Ho Tsang @ Medium)

Outline

Pseudo-3D (P3D) Convolution
P3D ResNet Block Variants
Experimental Results

1. Pseudo-3D (P3D) Convolution

The size of 3D convolutional filters is denoted as d×k×k where d is the temporal depth of kernel and k is the kernel spatial size.
3D convolutional filters with size of 3×3×3, can be naturally decoupled into 1×3×3 convolutional filters equivalent to 2D CNN on spatial domain and 3×1×1 convolutional filters like 1D CNN tailored to temporal domain. (This idea is similar to Inception-v3 which is a 2D version.)
Such decoupled 3D convolutions are regarded as a Pseudo 3D CNN,
It not only reduces the model size significantly, but also enables the pre-training of 2D CNN from image data, endowing Pseudo 3D CNN more power of leveraging the knowledge of scenes and objects learnt from images.
P3D-A: The first design considers stacked architecture by making temporal 1D filters (T) follow spatial 2D filters (S) in a cascaded manner.
P3D-B: Both filters are at different pathways in a parallel fashion.
P3D-C: The last design is a compromise between P3D-A and P3D-B, by simultaneously building the direct influences among S, T and the final output.

2. P3D ResNet Block Variants

As a bottleneck block, similar to ResNet, two 1×1×1 convolutions are additionally placed at both ends of the path, which are responsible for reducing and then increasing the dimensions.

**model size, speed, and accuracy on UCF101 (split1).**

The architecture of ResNet-50 is fine-tuned on UCF101 videos. We set the input as 224×224 image which is randomly cropped from the resized 240×320 video frame.
For each P3D ResNet variant, the dimension of input video clip is set as 16×160×160 which is randomly cropped from the resized non-overlapped 16-frame clip with the size of 16×182×242.
Overall, all the three P3D ResNet variants (i.e., P3D-A ResNet, P3D-B ResNet and P3D-C ResNet) exhibit better performance than ResNet-50.

P3D ResNet: Particularly, Residual Units are replaced with a chain of our P3D blocks in the order P3D-A > P3D-B > P3D-C.
There are absolute improvements over P3D-A ResNet, P3D-B ResNet and P3D-C ResNet by 0.5%, 1.4% and 1.2% in accuracy respectively, indicating that enhancing structural diversity.

3. Experimental Results

3.1. Sports-1M

**Top-1 clip-level accuracy and Top-1&5 video-level accuracy on Sports-1M.**

A deeper 152-layer ResNet is used.
Sports-1M contains about 1.13 million videos annotated with 487 Sports labels. The official split is used, i.e., 70%, 10% and 20% for training, validation and test set.
P3D outperforms SOTA approaches such as Deep Video and C3D.

3.2. Other Datasets

**Performance comparisons with the state-of-the-art methods on UCF101 (3 splits).**

**Performance comparisons in terms of Top-1 & Top-3 classification accuracy, and mean AP on ActivityNet**

Basically, P3D ResNet utilizing 2D spatial convolutions plus 1D temporal convolutions exhibits significantly better performance than C3D which directly uses 3D spatio-temporal convolutions.

**Action similarity labeling performances on ASLAN benchmark.**

P3D ResNet which pre-trains 2D spatial convolutions on image data and learns 1D temporal convolutions on video data fully leverages the knowledge from two domains, successfully boosting up the performance.

**The accuracy performance of scene recognition on Dynamic Scene and YUPENN sets**

P3D ResNet performs consistently better than both hand-crafted features and CNN-based representations.

3.3. Video Representation Embedding Visualization

The video-level representation is projected into 2-dimensional space using t-SNE.
It is clear that video representations by P3D ResNet are better semantically separated than those of ResNet-152.

Reference

[2017 ICCV] [P3D]
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Video Classification

[Deep Video] [C3D] [P3D]