Review — UPerNet: Unified Perceptual Parsing for Scene Understanding

UPerNet, Scene Parsing Using Multi-Task Datasets

5 min readJun 12, 2022

--

**Network trained for Unified Perceptual Parsing is able to parse various visual concepts at multiple perceptual levels such as scene, objects, parts, textures, and materials all at once.**

Unified Perceptual Parsing for Scene Understanding
UPerNet, by Peking University, MIT CSAIL, Bytedance Inc., and Megvii Inc.
2018 ECCV, Over 400 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Scene Parsing, Fully Convolutional Network, FCN

A new task called Unified Perceptual Parsing is proposed, which requires the machine vision systems to recognize as many visual concepts as possible from a given image.
A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations.

Outline

Broden+ Dataset
UPerNet
Experimental Results

1. Broden+ Dataset

**Statistics of each label type in the Broden+ dataset**

**(a) Sorted object classes by frequency: we show top 120 classes selected from the Broden+. (b) Frequency of parts grouped by objects**

The Broden dataset provides a wide range of visual concepts, but samples from different classes are unbalanced. Broden dataset is standardized as Broden+ dataset to make it more suitable for training segmentation networks.

First, similar concepts are merged across different datasets. For example, objects and parts annotations in ADE20K, Pascal-Context, and Pascal-Part are merged and unified.
Second, only object classes which appear in at least 50 images and contain at least 50,000 pixels are included in the whole dataset. Also, object parts which appear in at least 20 images can be considered valid parts. Objects and parts that are conceptually inconsistent are manually removed.
Third, under-sampled labels are manually merged in OpenSurfaces.

Broden+ contains 57,095 images in total, including 22,210 images from ADE20K, 10,103 images from Pascal-Context and Pascal-Part, 19,142 images from OpenSurfaces and 5,640 images from DTD.

2. UPerNet

2.1. Network Architecture

Top-left: The Feature Pyramid Network (FPN), with a Pyramid Pooling Module (PPM) in PSPNet, appended on the last layer of the backbone network before feeding it into the top-down branch in FPN.
Top-right: Features are used at various semantic levels.

Scene head is attached on the feature map directly after the PPM since image-level information is more suitable for scene classification.
Object and part heads are attached on the feature map fused by all the layers output by FPN.
Material head is attached on the feature map in FPN with the highest resolution.
Texture head is attached on the Res-2 block in ResNet, and fine-tuned after the whole network finishes training on other tasks.

Bottom: The illustrations of different heads. All extra non-classifier convolutional layers, including those in FPN, have batch normalization (BN) with 512-channel output. ReLU is applied after BN.

2.2. Training Strategy

During each iteration, if a mini-batch is composed of images from several sources on various tasks, the gradient with respect to a certain task can be noisy, since the real batch size of each task is in fact decreased.
Thus, a data source is randomly sampled at each iteration based on the scale of each source, and only the path to infer the concepts related to the selected source is updated. For object and material, losses are not calculated on unlabeled area.

3. Experimental Results

3.1. SOTA Comparisons

**Detailed analysis of our framework based on** **ResNet-50 v.s. state-of-the-art methods on** **ADE20K** **dataset**

In general, FPN demonstrates competitive performance while requiring much less computational resources for semantic segmentation.

Adding the Pyramid Pooling Module (PPM) boosts performance by a 4.87/3.09 margin.
Fusing features from all levels of FPN yields best performance.

3.2. Multi-Task Learning with Heterogeneous Annotations

**Results of Unified Perceptual Parsing on the Broden+ dataset** O: Object. P: Part. S: Scene. M: Material. T: Texture. mI.: mean IoU. P.A.: pixel accuracy. mI.(bg): mean IoU including background. T-1: top-1 accuracy.

The baseline of object parsing is the model trained on ADE20K and Pascal-Context. It yields mIoU and P.A. of 24.72/78:03.
When jointly training material with object, part, and scene classification, it yields a performance of 54.19/84.45 on material parsing, 23.36/77.09 on object parsing, and 28.75/46.92 on part parsing.

It is worth noting that the object and part both suffer a slight performance degrade due to heterogeneity, while material enjoys a boost in performance compared with that trained only on OpenSurfaces.

3.3. Qualitative Results

**Predictions on the validation set using UPerNet (ResNet-50)**

There are more results in the paper.

3.4. Discovering Visual Knowledge in Natural Scenes

**Visualization of scene-object relations**

For each scene, how many objects show up are counted, and normalized by the frequency of this scene. According to [44], the relation is formulated as a bipartite graph G=(V,E).
We can clearly see that the indoor scenes mostly share objects such as ceiling, floor, chair, or windowpane while the outdoor scenes mostly share objects such as sky, tree, building, or mountain.

**Discovered visual knowledge by UPerNet**

For scene-object relations, the objects which appear in at least 30% of a scene are chosen.
For object-material, part-material and material-texture relations, at most top-3 candidates are chosen with filtering and frequency normalization.

This knowledge base provides rich information across various types of concepts.

This network is used in Swin Transformer and ConvNeXt for segmentation experiments.

Reference

[2018 ECCV] [UPerNet]
Unified Perceptual Parsing for Scene Understanding

1.5. Semantic Segmentation

2015 … 2018 … [UPerNet] 2019 [ResNet-38] [C3] [ESPNetv2] [ADE20K] [Semantic FPN, Panoptic FPN] 2020 [DRRN Zhang JNCA’20] 2021 [PVT, PVTv1]