Reading: DRRN, Zhang JNCA’20 — Deep Recursive Residual Network (Semantic Segmentation)

Recursive Blocks to Improve Prediction of FCN While Reducing Parameters, Also Enhanced Mask R-CNN

Sik-Ho Tsang
5 min readMay 22, 2020

In this story, Deep Recursive Residual Network for Image Semantic Segmentation (DRRN, Zhang JNCAA’20), is briefly presented. In this paper:

  • DRRN for image semantic segmentation is proposed.
  • Recursive blocks are introduced in Residual Block (ResNet).
  • A concatenation layer is utilized to combine each output map of the recursive convolution layers at different iteration with same resolution so that different field-of-views of feature maps can be gathered.

This is a journal paper published in 2020 JNCA (Journal of Network and Computer Applications) with high impact factor of 5.273. (Sik-Ho Tsang @ Medium)

Outline

  1. Residual Unit with Recursions
  2. Network Architecture Modified from FCN-8s
  3. Experimental Results

1. Residual Unit with Recursions

1.1. Unfold Residual Unit with Recursions

Residual Unit with 3 Times of Recursions
  • For a recursive unit, the output of it can be the input of it again depending on the number of iterations allowed:
  • In the figure above, if we have a residual unit that is allowed to have 3 recursions, and we unfold it, it will becomes the above figure at the left.
  • If we separate into paths, we can have 2³ = 8 paths for data to flow through.
  • Through this method, functions are expanded without increasing any extra parameters and flexible functions choosing also eases the update difficulty of the recursive convolution kernels.

1.2. Concatenation

  • For each iteration, since convolution is performed, the receptive fields are enlarged.
  • Concatenating all output maps from recursive convolution layers is to combine large field-of-view, semantically strong features with small field-of-view, semantically weak features.
  • And the concatenated maps contain rich semantic information but with high channel dimension.
  • 1×1 convolution layers are used to reduce channel dimension as well as merge all information.
  • Therefore:
  • where Φ is the function of step in the recursive layer, concat is the concatenation layer, fd is the ×1 convolution layer, and Z is the final result.
  • A more detailed figure is shown as below:
Residual Unit with 3 Times of Recursions

2. Network Architecture Modified from FCN-8s

  • The backbone of FCN-8s is VGG-16 network.
  • Three variants are implemented: FCN-8s, simplified FCN, and simplified FCN with atrous convolution (Atrous convolution is from DeepLabv1 & DeepLabv2 or DilatedNet).
  • Simplified FCN: remove the deconvolution and up-sample layers of the FCN-8s, and use bilinear interpolation to upscale with a factor of 16.
  • Simplified FCN with atrous convolution: The original convolution is replaced with atrous convolution in the fifth block, which allows to enlarge the field-of-view of filters.
  • batch normalization is used in every convolutional operation.
  • ImageNet pretrained weights with fine-tuning, with 1000-way replaced by the number of semantic classes of the target dataset at the end of the network.
  • Sum of cross entropy for each location of output feature map is used as loss function.

When the proposed recursive residual blocks are used, three convolution layers are replaced by one recursive convolution layer.

3. Experimental Results

3.1. DeepFashion

left to right: image, ground truth, original FCN, proposed method
  • DeepFashion is a large collection of 289,222 fashion images with comprehensive annotations, released by The Chinese University of Hong Kong: 78,979 images own their segmentation masks and they are split by two parts: training set (70,000 images) and test set (8979 images).
Results of Three FCN Variants
  • The modified network reduces about 4,718,592 parameters but performs better on segmentation results for all three kind of networks.

3.2. Cityscapes

left to right: image, ground truth, original FCN, proposed method
  • The training, validation, and test sets contain 2975, 500, and 1525 images, respectively with 19 semantic labels.
Results of Two FCN Variants
  • The proposed network again performs a bit better with fewer number of parameters.

3.3. PASCAL VOC 2012

left to right: image, ground truth, original FCN, proposed method
  • The original dataset contains 1464 (train), 1449 (val) and 1456 (test) pixel-level labeled images for training, validation and testing, respectively with 20 foreground labels.
Results of Two FCN Variants
  • The proposed network can obtain larger gain compared with other two datasets.
  • This is because the targets have diverse shapes and sizes, so small and large field-of-views are both required for better segmentation and the function of merging different field-of-view feature maps can play better.

3.4. Ablation Study

Results of Two FCN Variants on PASCAL VOC 2012
  • The above table shows that concatenation layer indeed improves the performance of network, and recursive layer slightly lowers the metric of mean PA but with large amount of parameters reduced.

3.5. Extensions for Mask R-CNN

Modified Mask R-CNN with Proposed Recursive Blocks
  • In Mask R-CNN network, each RoI passes through four convolution layers to yield segmentation.
  • The last three convolution layers are transformed into recursive layers, as illustrated in the above figure.
  • 80 category COCO detection dataset and ResNet-101 backbone are used.
Top: Mask R-CNN, Bottom, Mask R-CNN with Proposed Recursive Blocks
[Failure Cases] Top: Mask R-CNN, Bottom, Mask R-CNN with Proposed Recursive Blocks
Segmentation metrics using Mask R-CNN on COCO minival
  • Overall, Mask R-CNN is enhanced by the proposed recursive blocks.
  • Also, from the table, we can see that our method gains greater with the size of segment objectives increasing.

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 29th story in this month. 1 story to go. Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet