Brief Review — Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Conceptual Captions, Over 3M <Image, Caption> Pairs

4 min readAug 20, 2022

**Examples of images and image descriptions from the Conceptual Captions dataset**

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning,
Conceptual Captions, by Google AI,
2018 ACL, Over 700 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model (VLM), Image Captioning, Transformer

Conceptual Captions, an image captioning dataset, is proposed, which has an order of magnitude more images than the MS-COCO dataset.
An image captioning model is proposed as baseline, where Inception-ResNet-v2, as in Inception-v4, is used for image-feature extraction and Transformer for sequence modeling.

Outline

Conceptual Captions: Dataset Generation Process
Image Captioning Model
Results

1. Conceptual Captions: Dataset Generation Process

**Conceptual Captions pipeline steps with examples and final output**

This pipeline processes billions of Internet webpages in parallel. From these webpages, it extracts, filters, and processes candidate <image, caption> pairs.

1.1. Image-based Filtering

It only keeps JPEG images where both dimensions are greater than 400 pixels. Ratio cannot be smaller or larger than 2. Images that trigger pornography or profanity detectors are excluded.
These filters discard more than 65% of the candidates.

1.2. Text-based Filtering

It harvests Alt-text from HTML webpages.
Google Cloud Natural Language APIs are used to analyze candidate Alt-text. Heuristics are also introduced.
These filters only allow around 3% of the incoming candidates to pass to the later stages.

1.3. Image & Text-based Filtering

Google Cloud Vision APIs are used to assign class labels to images. Images are generally assigned between 5 to 20 labels. These labels are matched against the candidate text.
Candidates for which none of the text tokens can be mapped to the content of the image, are filtered out.

1.4. Text Transformation with Hypernymization

**Examples of Conceptual Captions as derived from their original Alt-text versions**

Some rules are shown above.
e.g.: named-entities are identified, matched against the KG entries, and substitute with their hypernym, using the Google Knowledge Graph (KG) Search API.
Around 20% of samples are discarded during this transformation because it can leave sentences too short or inconsistent.
These remaining <image, caption> pairs contain around 16,000 entity types.

1.5. Conceptual Captions Quality

**Human evaluation results on a sample from Conceptual Captions**

A random sample of 4K examples are extracted from the test split.
Out of 3 annotations, over 90% of the captions receive a majority (2+) of GOOD judgments. This indicates that the above pipeline produces high-quality image captions.

1.6. Dataset

**Statistics over Train/Validation/Test splits for Conceptual Captions**

The training set consists of slightly over 3.3M examples, while there are slightly over 28K examples in the validation set and 22.5K examples in the test set.

The size of the training set vocabulary (unique tokens) is 51,201.

2. Image Captioning Model

A deep CNN that takes a (preprocessed) image and outputs a vector of image embeddings X.
An Encoder module that takes the image embeddings and encodes them into a tensor H.
A Decoder model that generates outputs zt at each step t, conditioned on H as well as the decoder inputs Y1:t.
Inception-ResNet-v2, as in Inception-v4, as the CNN component.
Two sequence models are tried: One is modified Show and Tell (RNN-based model). One is based on Transformer (T2T).

The goal is to produce baseline results.

3. Results

3.1. Qualitative Results

**Side by side comparison of model outputs under two training conditions**

e.g.: For the left-most image, COCO-trained models use “group of men” to refer to the people in the image; Conceptual-based models use the more appropriate and informative term “graduates”.