Brief Review — The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Moving Objects Dataset, Recognize Gestures in the Context of Everyday Objects

Sik-Ho Tsang
5 min readSep 11, 2022
Teams of TwentyBN (Image from here)

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense, Something Something, by TwentyBN
2017 ICCV, Over 600 Citations (Sik-Ho Tsang @ Medium)
Video Classification, Action Recognition, Video Captioning, Dataset

  • This paper proposes a “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, which contains more than 100,000 videos across 174 classes, which are defined as caption-templates.
  • A news article (link shown at the bottom) about TwentyBN makes me read this paper!


  1. Something-Something Dataset
  2. Results
  3. Something-Something v2

1. Something-Something Dataset

1.1. Crowdsourcing

An example video from our database, captioned “Picking [something] up”

In this dataset, crowd-workers are asked to record videos and to complete caption-templates, by providing appropriate input-text for placeholders.

  • In this example, the text provided for placeholder “something” is “a shoe”.
  • Using natural language instead of an action provides a much weaker learning signal than a one-hot label.
  • The complexity and sophistication of caption-templates is increased over time, to the degree that models succeed at making predictions, finally having a “Something-Something v2” dataset later on.
  • (The paper talks about the detailed rules and procedures of constructing the datasets. Please feel free to read paper directly if interested.)

1.2. Dataset Summary

Dataset Summary
Video Durations
  • The version in the paper, contains 108,499 videos across 174 labels, with duration ranging from 2 to 6 seconds.
  • The dataset is split into train, validation and test-sets in the ratio of 8:1:1. All videos provided by the same worker occur only in one split.
Frequencies of occurrence of 15 most common objects
  • The dataset contains 23,137 distinct object names. The estimated number of actually distinct objects to be at least a few thousand.
Numbers of videos per class (truncated for better visualisation)
  • A truncated distribution of the number of videos per class is shown above, with an average of roughly 620 videos per class, a minimum of 77 for “Poking a hole into [some substance]” and a maximum of 986 for “Holding [something]”.
Example videos and corresponding descriptions. Object entries shown in italics
  • The above figure shows some of the example videos.
Dataset Comparison
  • The above shows the comparison of video datasets.

2. Results

2.1. Datasets

Subset of 10 hand-chosen “easy” classes
  • 10-Class Dataset: 41 “easy” classes are firstly pre-selected. Then, 10 classes are generated by grouping together one or more of the original 41 classes with similar semantics. The total number of videos in this case is 28198.
  • 40-Class Dataset: Keeping the above 10 groups, 30 additional common classes are selected. The total number of samples in this case is 53267.
  • 174-Class Dataset: Full dataset.

2.2. Baseline Results

Error rates on different subsets of the data
  • 2D-CNN+Avg: VGG-16 is used to represent individual frames and averaging the obtained features for each frame in the video to form the final encoding.
  • Pre-2D-CNN+Avg: ImageNet-Pretrained 2D-CNN+Avg.
  • Pre-2D-CNN+LSTM: ImageNet-Pretrained 2D-CNN but using LSTM. The last hidden state of the LSTM is used as the video encoding.
  • 3D-CNN+Stack: C3D with a size of 1024 units for the fully-connected layers and a clip size of 9 frames. Features are extracted from non-overlapping clips of size 9 frames. The obtained features are stacked to obtain a 4096 dimensional representation.
  • Pre-3D-CNN+Avg: Sport1M-Pretrained C3D with fine-tuning on this dataset.
  • 2D+3D-CNN: A combination of the best performing 2D-CNN and 3D-CNN trained models, obtained by concatenating the two resulting video-encodings.

The difficulty of the task grows significantly as the number of classes are increased.

  • On all 174 classes using a 3D-CNN model pre-trained on the 40 selected classes, and obtained error rates of top-1: 88.5%, top-5: 70.3%.
  • An informal human evaluation on the complete dataset (174 classes) with 10 individuals is also performed, with 700 test samples in total, about 60% accuracy.

3. Something-Something v2

Some Samples in the QualComm website
Something-Something v2 Summary in QualComm website

News Article:


[2017 ICCV] [Something-Something]
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

[Dataset] [Moving Objects Dataset: Something-Something v. 2]

[News Article] [Acquired by Qualcomm]

1.12. Video Classification / Action Recognition

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] [LRCN] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] [Something Something] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G] 2019 [VideoBERT]

5.4. Video Captioning

2015 [LRCN] 2017 [Something Something] 2019 [VideoBERT]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.