Review — UPerNet: Unified Perceptual Parsing for Scene Understanding
Unified Perceptual Parsing for Scene Understanding
UPerNet, by Peking University, MIT CSAIL, Bytedance Inc., and Megvii Inc.
2018 ECCV, Over 400 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Scene Parsing, Fully Convolutional Network, FCN
- A new task called Unified Perceptual Parsing is proposed, which requires the machine vision systems to recognize as many visual concepts as possible from a given image.
- A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations.
- Broden+ Dataset
- Experimental Results
1. Broden+ Dataset
- The Broden dataset provides a wide range of visual concepts, but samples from different classes are unbalanced. Broden dataset is standardized as Broden+ dataset to make it more suitable for training segmentation networks.
- First, similar concepts are merged across different datasets. For example, objects and parts annotations in ADE20K, Pascal-Context, and Pascal-Part are merged and unified.
- Second, only object classes which appear in at least 50 images and contain at least 50,000 pixels are included in the whole dataset. Also, object parts which appear in at least 20 images can be considered valid parts. Objects and parts that are conceptually inconsistent are manually removed.
- Third, under-sampled labels are manually merged in OpenSurfaces.
- Broden+ contains 57,095 images in total, including 22,210 images from ADE20K, 10,103 images from Pascal-Context and Pascal-Part, 19,142 images from OpenSurfaces and 5,640 images from DTD.
2.1. Network Architecture
- Top-left: The Feature Pyramid Network (FPN), with a Pyramid Pooling Module (PPM) in PSPNet, appended on the last layer of the backbone network before feeding it into the top-down branch in FPN.
- Top-right: Features are used at various semantic levels.
- Scene head is attached on the feature map directly after the PPM since image-level information is more suitable for scene classification.
- Object and part heads are attached on the feature map fused by all the layers output by FPN.
- Material head is attached on the feature map in FPN with the highest resolution.
- Texture head is attached on the Res-2 block in ResNet, and fine-tuned after the whole network finishes training on other tasks.
- Bottom: The illustrations of different heads. All extra non-classifier convolutional layers, including those in FPN, have batch normalization (BN) with 512-channel output. ReLU is applied after BN.
2.2. Training Strategy
- During each iteration, if a mini-batch is composed of images from several sources on various tasks, the gradient with respect to a certain task can be noisy, since the real batch size of each task is in fact decreased.
- Thus, a data source is randomly sampled at each iteration based on the scale of each source, and only the path to infer the concepts related to the selected source is updated. For object and material, losses are not calculated on unlabeled area.
3. Experimental Results
3.1. SOTA Comparisons
- In general, FPN demonstrates competitive performance while requiring much less computational resources for semantic segmentation.
Adding the Pyramid Pooling Module (PPM) boosts performance by a 4.87/3.09 margin.
Fusing features from all levels of FPN yields best performance.
3.2. Multi-Task Learning with Heterogeneous Annotations
- The baseline of object parsing is the model trained on ADE20K and Pascal-Context. It yields mIoU and P.A. of 24.72/78:03.
- When jointly training material with object, part, and scene classification, it yields a performance of 54.19/84.45 on material parsing, 23.36/77.09 on object parsing, and 28.75/46.92 on part parsing.
It is worth noting that the object and part both suffer a slight performance degrade due to heterogeneity, while material enjoys a boost in performance compared with that trained only on OpenSurfaces.
3.3. Qualitative Results
- There are more results in the paper.
3.4. Discovering Visual Knowledge in Natural Scenes
- For each scene, how many objects show up are counted, and normalized by the frequency of this scene. According to , the relation is formulated as a bipartite graph G=(V,E).
- We can clearly see that the indoor scenes mostly share objects such as ceiling, floor, chair, or windowpane while the outdoor scenes mostly share objects such as sky, tree, building, or mountain.
- For scene-object relations, the objects which appear in at least 30% of a scene are chosen.
- For object-material, part-material and material-texture relations, at most top-3 candidates are chosen with filtering and frequency normalization.
This knowledge base provides rich information across various types of concepts.