Review: FCGN — Fully Convolutional Google Net (Human Pose Estimation)

Outperforms Tompson CVPR’15 and DeepPose, Comparable to CPM with Lower Memory Footprint

6 min readOct 16, 2019

In this story, FCGN (Fully Convolutional Google Net), by RWTH Aachen University, and Univeristy of Bonn, is briefly reviewed.

By modifying the batch normalization GoogLeNet, i.e. BN-Inception / Inception-v2, into fully convolutional network (FCN), the network can be trained using mid-range GPU with low memory footprint.
Another advantage is that the network can be trained from scratch using the same dataset without any pre-training, and obtain a good performance.

This is a 2016 BMVC paper with more than 50 citations. (Sik-Ho Tsang @ Medium)

The first 17 layers of BN-Inception / Inception-v2 are used and the average pooling, drop-out, linear and soft-max layers from the last stages of the network are removed.
A skip connection is added to combine feature maps from layer 13 with feature maps from layer 17.
The feature maps from layer 17 are upsampled to the resolution of the feature maps from layer 13 by a deconvolution filter of both size and stride 2.
The output of FCGN consists of coarse feature maps from layer 13 and 17 that have 16 times lesser resolution than the input image due to max/average pooling by a factor of 16.

Both HalfRes & FullRes FCGNs are with shared weights.
Each FCGN takes the same image at a different resolutions.
The feature maps from the Half Res Image FCGN are upsampled to the resolution of Full Res Image FCGN feature maps by a deconvolution filter of both stride and filter size of 2.
The coarse feature maps from HalfRes FCGN and FullRes FCGN are then directly upsampled to belief maps for different body joints by using a larger deconvolution filter of size 32 and stride 16.
The belief maps are then normalized by using a sigmoid.
Spatial dropout is used before upsampling to further regularize the network.

A training example: {I, {bj(x,y)}. I is the input image, bj(x,y) is the ground-truth 2D belief map for joint j and contains the per-pixel likelihood of joint j for each pixel (x,y) in image I of width w and height h.
As we are mainly interested in learning the appearances of different body joints (1’s) and not background (0's), an error function with an inherent property of focusing on target values 1’s only is used.
The binary cross entropy between the ground-truth and predicted belief maps for k joints in each training image I is minimized:

The warp on the right first shifts the perturbed center of person (xp+tx,yp+ty) to the upper left corner of the image, produces the effect of applying a random translation.
The middle warp then applies random scaling s, rotation(cosθ, sinθ) and reflection r.
The left warp then shifts back the warped person to the center (cx,cy) of the cropped image.
Scaling ∈ {0.5,1.5}, translation ∈ {-20,20}, rotation ∈{-20°,20°} and horizontal flipping with probability 0:5 for data augmentation are used.
Some warped joints may be outside of the image boundary, only the warps for which all warped joints end up inside the image boundary are selected.

All training and test images are normalized to the same scale by re-normalizing the height of the detected torso in each image with 200 pixels.
Using scale information improves the performance especially for wrists from 89.66 to 94.88.
FCGN outperforms Tompson CVPR’15 [28] and DeepPose [29].
The result is close to CPM [30], but FCGN has the advantages of not using any background model and implicit spatial dependencies, low memory footprint of 3GB, and could be run on a moderate-range GPU, in contrast to their model that has a high memory footprint of 6GB.
If limited augmentation is used with the reduced ranges for Scaling ∈ {0.7,1.3}, translation ∈ {-5,5}, rotation ∈{-5°,5°}, performance drops from 96.06 to 92.37 for elbows and from 89.66 to 81.89 for wrists.
If exponential learning rate decay is not used, performance drops from 96.06 to 94.53 for elbows and from 89.66 to 86.51 for wrists.

The LSP dataset consists of 1’000 training and 1’000 test images.
The extended LSP dataset consists of an additional 10’000 training images, thus providing 11’000 training images from both LSP and extended LSP.

The performance 83.86 is very close to 84.32 from CPM [30], when CPM train only on the same dataset with an additional background model and implicit spatial model on top.
Additionally, FCGN requires almost half as much memory as their model.
Without the skip layer connection for the use of middle layer features, overall performance drops from 83.86 to 80.
Without multi-resolution features, with only FullRes FCGN, performance gets lower from 83.86 to 82.8.

25’925 images for training and 2’958 validation images are used.
7’247 Single Person test images are used for evaluation.
Test images around the given rough person location are cropped.
Both training and test images are normalized to the same scale by using the provided rough scale information.
FCGN takes 3 days to train using GTX 980 GPU.

FCGN is competitive to CPM [30], without using any pre-training, post-processing, background models and any form of implicit/explicit spatial dependencies.