Brief Review — Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Visual Genome, A Dataset of Relationship Between Objects
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,
Visual Genome (VG), by Stanford University, Dresden University of Technology, Yahoo Inc., Snapchat Inc., and Centrum Wiskunde & Informatica (CWI)
2017 IJCV, Over 3500 Citations (Sik-Ho Tsang @ Medium)
Dataset, Vision Language Model, VLM, Question Answering
- Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world.
- Visual Genome (VG) dataset to enable the modeling of relationships, e.g.: riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.”
- This is a paper from Prof. Fei-Fei Li.
VG contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects.
- Below shows some examples of the dataset:
- The relationship between objects can be used for question answering:
- A total of over 33,000 unique workers contributed to the dataset with below distribution around the world:
With such dense and rich annotations, the dataset can have multiple purposes.
Just to have a very short review during Christmas holiday.
Merry Christmas!!!
Reference
[2017 IJCV] [Visual Genome (VG)]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
3.1. Visual/Vision/Video Language Model (VLM)
2017 [Visual Genome (VG)] 2018 [Conceptual Captions] 2019 [VideoBERT] [VisualBERT] [LXMERT] [ViLBERT] 2020 [ConVIRT] [VL-BERT] [OSCAR] 2021 [CLIP] [VinVL] [ALIGN] [VirTex] 2022 [FILIP] [Wukong]