Brief Review — Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome, A Dataset of Relationship Between Objects

Sik-Ho Tsang
2 min readDec 25, 2022
A dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,
Visual Genome (VG), by Stanford University, Dresden University of Technology, Yahoo Inc., Snapchat Inc., and Centrum Wiskunde & Informatica (CWI)
2017 IJCV, Over 3500 Citations (Sik-Ho Tsang @ Medium)
Dataset, Vision Language Model, VLM, Question Answering

  • Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world.
  • Visual Genome (VG) dataset to enable the modeling of relationships, e.g.: riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.”
  • This is a paper from Prof. Fei-Fei Li.

VG contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects.

  • Below shows some examples of the dataset:
3 region descriptions and their corresponding region graphs.
  • The relationship between objects can be used for question answering:
Each image contains region descriptions that describe a localized portion of the image. VG collects two types of question answer pairs (QAs): freeform QAs and region-based QAs.
  • A total of over 33,000 unique workers contributed to the dataset with below distribution around the world:
Crowd Workers Over the World.

With such dense and rich annotations, the dataset can have multiple purposes.

Just to have a very short review during Christmas holiday.
Merry Christmas!!!

Reference

[2017 IJCV] [Visual Genome (VG)]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

3.1. Visual/Vision/Video Language Model (VLM)

2017 [Visual Genome (VG)] 2018 [Conceptual Captions] 2019 [VideoBERT] [VisualBERT] [LXMERT] [ViLBERT] 2020 [ConVIRT] [VL-BERT] [OSCAR] 2021 [CLIP] [VinVL] [ALIGN] [VirTex] 2022 [FILIP] [Wukong]

==== My Other Previous Paper Readings ====

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.