Reading: DPN — Deep Parsing Network (Semantic Segmentation)

Outperforms DeepLabv1, CRF-RNN, SegNet, DilatedNet, & FCN

Sik-Ho Tsang
4 min readOct 10, 2020

In this story, Deep Parsing Network (DPN), by The Chinese University of Hong Kong, is briefly presented. In this paper:

  • DPN extends a CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms.
  • Some previous approaches, non-deep-learning post processing is performed to refine the result. With the network approximating the pairwise terms, the non-deep-learning post processing stage is not needed.

This is a paper in 2015 ICCV (over 500 citations) as well as in 2018 TPAMI (over 60 citations). (Sik-Ho Tsang @ Medium)

Outline

  1. DPN in 2015 ICCV
  2. Spatial-Temporal DPN in 2018 TPAMI
  3. Experimental Results

1. DPN in 2015 ICCV

DPN Derived From VGG16
  • DPN is derived from VGG16 with modifications.

1.1. DPN for Unary Terms

  • First, increase resolution of VGG16 by removing its max pooling layers at a8 and a10.
  • Second, two fully-connected layers at a11 are transformed to two convolutional layers at b9 and b10, respectively. b10. Each filter is 1×1×4096.
  • Overall, b11 generates the unary labeling results, producing
  • 21 512×512 feature maps, each of which represents the probabilistic label map of each category. i.e. the prob(class) at each location.

1.2. DPN for Pairwise Terms

  • ‘lconv’ in b12 indicates a locally convolutional layer.
  • b12 has 512×512 different filters and produces 21 output feature maps.
  • Therefore, at b12, the probability of object presented at each position is updated by weighted averaging over the probabilities at its nearby positions.
  • b13 is a convolutional layer that generates 105 feature maps by using 105 filters of size 9×9×21.
  • b13 learns a filter for each category to penalize the probabilistic label maps of b12, corresponding to the local label contexts.
  • b14 is a block min pooling layer that pools over every 1×1 region with one stride across every 5 input channels, leading to 21 output channels, i.e. 105/5=21. Layer b14 activates the contextual pattern with the smallest penalty.
  • b15 combines both the unary and smoothness terms by summing the outputs of b11 and b14 in an element-wise manner.

In summary, after b11, we obtain the the probabilistic label map of each category. Then, from b12 to b15, we refine the results by exploiting the relation between each label in the obtained label map.

  • To train the network, first, add a loss function to b11 and learn the weights from b1 to b11 in order to learn the unary terms.
  • Then, add b12-b15 to learn the later convolutions, and fine-tune the whole network.

2. Spatial-Temporal DPN in 2018 TPAMI

DPN in 2018 TPAMI
DPN in 2018 TPAMI Derived From VGG16
  • This DPN is very similar to the one in 2015 ICCV except that the input can support multiple images to support video input.
  • Also, b12 and b13 are converted into 3D convolutions for video.

3. Experimental Results

3.1. PASCAL VOC 12

Per-class results on PASCAL VOC 12 test set, +: Trained by MS COCO as well

3.2. Cityscapes

Per-class results on Cityscapes
  • DPN achieves 66.8% on Cityscapes dataset, which is the second best method, and it is close to the first place DilatedNet of 67.1%.

3.3. CamVid

Per-class results on CamVid
  • DPN achieves much better performance than other methods such as SegNet.
  • Spatial-Temporal DPN further improves the results by a little bit.

3.4. Visual Quality

Visual quality comparison of different semantic image segmentation methods
Failure Cases

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet