Brief Review — Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome, A Dataset of Relationship Between Objects

2 min readDec 25, 2022

--

**A dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships.**

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,
Visual Genome (VG), by Stanford University, Dresden University of Technology, Yahoo Inc., Snapchat Inc., and Centrum Wiskunde & Informatica (CWI)
2017 IJCV, Over 3500 Citations (Sik-Ho Tsang @ Medium)
Dataset, Vision Language Model, VLM, Question Answering

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world.
Visual Genome (VG) dataset to enable the modeling of relationships, e.g.: riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.”
This is a paper from Prof. Fei-Fei Li.

VG contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects.

Below shows some examples of the dataset:

**3 region descriptions and their corresponding region graphs.**

The relationship between objects can be used for question answering:

**Each image contains region descriptions that describe a localized portion of the image. VG collects two types of question answer pairs (QAs): freeform QAs and region-based QAs.**

A total of over 33,000 unique workers contributed to the dataset with below distribution around the world:

With such dense and rich annotations, the dataset can have multiple purposes.

Just to have a very short review during Christmas holiday.
Merry Christmas!!!

Reference

[2017 IJCV] [Visual Genome (VG)]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

3.1. Visual/Vision/Video Language Model (VLM)

2017 [Visual Genome (VG)] 2018 [Conceptual Captions] 2019 [VideoBERT] [VisualBERT] [LXMERT] [ViLBERT] 2020 [ConVIRT] [VL-BERT] [OSCAR] 2021 [CLIP] [VinVL] [ALIGN] [VirTex] 2022 [FILIP] [Wukong]

Brief Review — Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome, A Dataset of Relationship Between Objects

Reference

3.1. Visual/Vision/Video Language Model (VLM)

==== My Other Previous Paper Readings ====

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet