Review — Non-local Neural Networks (Video Classification)

Space, Time, & Spacetime Long-Range Interactions Captured by Non-local Neural Networks

A spacetime non-local operation in our network trained for video classification in Kinetics. In this example computed by the proposed model, note how it relates the ball in the first frame to the ball in the last 2 frames.


1. Motivation of Non-Local Blocks

2. Generic Definition of Non-Local Blocks

2.1. Generic Definition

2.2. Differences From Convolutional, Recurrent, Fully Connected Operations

3. Instantiations of Non-Local Blocks

3.1. Unary Function g(xj)

3.2. Pairwise Function f(xi, xj)

3.2.1. Gaussian

3.2.2. Embedded Gaussian

3.2.3. Dot Product

3.2.4. Concatenation

4. Non-Local Blocks

A spacetime non-local block.

4.1. An Illustrative Example

4.2. Non-Local Blocks for Subsampling

5. Ablation Study on Video Classification

Baseline ResNet-50 C2D model

5.1. Instantiations


5.2. Which stage to add non-local blocks?


5.3. Going deeper with non-local blocks

Deeper non-local models

5.4. Non-local in spacetime

Space vs. time vs. spacetime

5.5. Non-local net vs. 3D ConvNet

Non-local net vs. 3D ConvNet

5.6. Non-local 3D ConvNet

Non-local 3D ConvNet

5.7. Longer Sequences

Longer clips

6. SOTA Comparison on Video Classification

6.1. Kinetics

Comparisons with state-of-the-art results in Kinetics, reported on the val and test sets

6.2. Charades

Classification mAP (%) in the Charades dataset

7. Experimental Results on COCO

7.1. Object Detection and Instance Segmentation

Adding 1 non-local block to Mask R-CNN for COCO object detection (Left) and instance segmentation (Right)

7.2. Keypoint Detection

Adding non-local blocks to Mask R-CNN for COCO keypoint detection.

