Brief Review — The Cityscapes Dataset for Semantic Urban Scene Understanding

Cityscapes, One of the Popular Semantic Segmentation Datasets

Sik-Ho Tsang
3 min readDec 22, 2022
Cityscapes Dataset (Figure from

The Cityscapes Dataset for Semantic Urban Scene Understanding,
Cityscapes, by Daimler AG R&D, TU Darmstadt, MPI Informatics, and TU Dresden, 2016 CVPR, Over 8600 Citations (

@ Medium)
Semantic Segementation, Dataset

  • Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities.
  • 5000 of these images have high quality pixel-level annotations.
  • 20,000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data.


  1. Cityscapes Dataset
  2. Results

1. Cityscapes Dataset

  • Several hundreds of thousands of frames were acquired from a moving vehicle during the span of several months, covering spring, summer, and fall in 50 cities, primarily in Germany but also in neighboring countries. They are not deliberately recorded in adverse weather conditions.
  • 5000 images were manually selected from 27 cities for dense pixel-level annotation, aiming for high diversity of foreground objects, background, and overall scene layout. The annotations were done on the 20th frame of a 30-frame video snippet, which we provide in full to supply context information.
  • For the remaining 23 cities, a single image every 20s or 20m driving distance (whatever comes first) was selected for coarse annotation, yielding 20,000 images in total.
Train Val Test Split
  • Densely annotated images are split into separate training, validation, and test sets.
  • Coarsely annotated images serve as additional training data only.
Number of pixels for each class
  • The above shows some statistics for each class in the dataset.

2. Results

Quantitative results of baselines for semantic labeling

FCN and also other SOTA approaches at that year, such as DPN [40], CRF-RNN [81], DeepLabv1 [9], and DilatedNet [79], are used to benchmark the dataset. IoU and iIoU are low, meaning that the dataset is challenging.

Quantitative results (avg. recall in percent) of half-resolution FCN-8s model trained on Cityscapes images and tested on Camvid and KITTI.

FCN-8s model trained on Cityscapes images and tested on Camvid and KITTI, and obtained reasonable performance, which means that the dataset integrates well with existing ones and allows for cross-dataset research.

Qualitative examples of selected baselines From left to right: image with stereo depth maps partially overlayed, annotation, [48], [37], and DilatedNet [79].
  • The above shows some visualized results.


[2016 CVPR] [Cityscapes]
The Cityscapes Dataset for Semantic Urban Scene Understanding


1.5. Semantic Segmentation / Scene Parsing

20152016 … [Cityscapes] … 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.