Reading: DPN — Deep Parsing Network (Semantic Segmentation)

Outperforms DeepLabv1, CRF-RNN, SegNet, DilatedNet, & FCN

4 min readOct 10, 2020

In this story, Deep Parsing Network (DPN), by The Chinese University of Hong Kong, is briefly presented. In this paper:

DPN extends a CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms.
Some previous approaches, non-deep-learning post processing is performed to refine the result. With the network approximating the pairwise terms, the non-deep-learning post processing stage is not needed.

This is a paper in 2015 ICCV (over 500 citations) as well as in 2018 TPAMI (over 60 citations). (Sik-Ho Tsang @ Medium)

Outline

DPN in 2015 ICCV
Spatial-Temporal DPN in 2018 TPAMI
Experimental Results

1. DPN in 2015 ICCV

DPN is derived from VGG16 with modifications.

1.1. DPN for Unary Terms

First, increase resolution of VGG16 by removing its max pooling layers at a8 and a10.
Second, two fully-connected layers at a11 are transformed to two convolutional layers at b9 and b10, respectively. b10. Each filter is 1×1×4096.
Overall, b11 generates the unary labeling results, producing
21 512×512 feature maps, each of which represents the probabilistic label map of each category. i.e. the prob(class) at each location.

1.2. DPN for Pairwise Terms

‘lconv’ in b12 indicates a locally convolutional layer.
b12 has 512×512 different filters and produces 21 output feature maps.
Therefore, at b12, the probability of object presented at each position is updated by weighted averaging over the probabilities at its nearby positions.
b13 is a convolutional layer that generates 105 feature maps by using 105 filters of size 9×9×21.
b13 learns a filter for each category to penalize the probabilistic label maps of b12, corresponding to the local label contexts.
b14 is a block min pooling layer that pools over every 1×1 region with one stride across every 5 input channels, leading to 21 output channels, i.e. 105/5=21. Layer b14 activates the contextual pattern with the smallest penalty.
b15 combines both the unary and smoothness terms by summing the outputs of b11 and b14 in an element-wise manner.

In summary, after b11, we obtain the the probabilistic label map of each category. Then, from b12 to b15, we refine the results by exploiting the relation between each label in the obtained label map.

To train the network, first, add a loss function to b11 and learn the weights from b1 to b11 in order to learn the unary terms.
Then, add b12-b15 to learn the later convolutions, and fine-tune the whole network.

2. Spatial-Temporal DPN in 2018 TPAMI

**DPN in 2018 TPAMI Derived From** **VGG**16

This DPN is very similar to the one in 2015 ICCV except that the input can support multiple images to support video input.
Also, b12 and b13 are converted into 3D convolutions for video.