Review: FCN — Fully Convolutional Network (Semantic Segmentation)

Published in

Towards Data Science

4 min readOct 5, 2018

--

In this story, Fully Convolutional Network (FCN) for Semantic Segmentation is briefly reviewed. Compared with classification and detection tasks, segmentation is a much more difficult task.

Image Classification: Classify the object (Recognize the object class) within an image.
Object Detection: Classify and detect the object(s) within an image with bounding box(es) bounded the object(s). That means we also need to know the class, position and size of each object.
Semantic Segmentation: Classify the object class for each pixel within an image. That means there is a label for each pixel.

An example for semantic segmentation is as below:

An example of Semantic Segmentation

**Original Image (Leftmost), Ground Truth Label Map (2nd Left), Predicted Label Map (2nd Right), Overlap Image and Predicted Label (Rightmost)**

It has been published in 2015 CVPR [1] and 2017 TPAMI [2] with citations more than 6000 while I was writing this story. Thus, it is also one of the most basic papers for semantic segmentation using FCN. (Sik-Ho Tsang @ Medium)

What Are Covered

From Image Classification to Semantic Segmentation
Upsampling Via Deconvolution
Fusing the Output
Results

1. From Image Classification to Semantic Segmentation

In classification, conventionally, an input image is downsized and goes through the convolution layers and fully connected (FC) layers, and output one predicted label for the input image, as follows:

Imagine we turn the FC layers into 1×1 convolutional layers:

And if the image is not downsized, the output will not be a single label. Instead, the output has a size smaller than the input image (due to the max pooling):

If we upsample the output above, then we can calculate the pixelwise output (label map) as below:

**Feature Map / Filter Number Along Layers**

2. Upsampling Via Deconvolution

Convolution is a process getting the output size smaller. Thus, the name, deconvolution, is coming from when we want to have upsampling to get the output size larger. (But the name, deconvolution, is misinterpreted as reverse process of convolution, but it is not.) And it is also called, up convolution, and transposed convolution. And it is also called fractional stride convolution when fractional stride is used.

**Upsampling Via Deconvolution (Blue: Input, Green: Output)**

3. Fusing the Output

After going through conv7 as below, the output size is small, then 32× upsampling is done to make the output have the same size of input image. But it also makes the output label map rough. And it is called FCN-32s:

This is because, deep features can be obtained when going deeper, spatial location information is also lost when going deeper. That means output from shallower layers have more location information. If we combine both, we can enhance the result.

To combine, we fuse the output (by element-wise addition):

FCN-16s: The output from pool5 is 2× upsampled and fuse with pool4 and perform 16× upsampling. Similar operations for FCN-8s as in the figure above.

FCN-32s result is very rough due to loss of location information while FCN-8s has the best result.

This fusing operation actually is just like the boosting / ensemble technique used in AlexNet, VGGNet, and GoogLeNet, where they add the results by multiple model to make the prediction more accurate. But in this case, it is done for each pixel, and they are added from the results of different layers within a model.