Brief Review — The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Moving Objects Dataset, Recognize Gestures in the Context of Everyday Objects

--

Teams of TwentyBN (Image from here)

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense, Something Something, by TwentyBN
2017 ICCV, Over 600 Citations (Sik-Ho Tsang @ Medium)
Video Classification, Action Recognition, Video Captioning, Dataset

  • This paper proposes a “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, which contains more than 100,000 videos across 174 classes, which are defined as caption-templates.
  • A news article (link shown at the bottom) about TwentyBN makes me read this paper!

Outline

  1. Something-Something Dataset
  2. Results
  3. Something-Something v2

1. Something-Something Dataset

1.1. Crowdsourcing

An example video from our database, captioned “Picking [something] up”

In this dataset, crowd-workers are asked to record videos and to complete caption-templates, by providing appropriate input-text for placeholders.

  • In this example, the text provided for placeholder “something” is “a shoe”.
  • Using natural language instead of an action provides a much weaker learning signal than a one-hot label.
  • The complexity and sophistication of caption-templates is increased over time, to the degree that models succeed at making predictions, finally having a “Something-Something v2” dataset later on.
  • (The paper talks about the detailed rules and procedures of constructing the datasets. Please feel free to read paper directly if interested.)

1.2. Dataset Summary

Dataset Summary
Video Durations
  • The version in the paper, contains 108,499 videos across 174 labels, with duration ranging from 2 to 6 seconds.
  • The dataset is split into train, validation and test-sets in the ratio of 8:1:1. All videos provided by the same worker occur only in one split.
Frequencies of occurrence of 15 most common objects
  • The dataset contains 23,137 distinct object names. The estimated number of actually distinct objects to be at least a few thousand.
Numbers of videos per class (truncated for better visualisation)
  • A truncated distribution of the number of videos per class is shown above, with an average of roughly 620 videos per class, a minimum of 77 for “Poking a hole into [some substance]” and a maximum of 986 for “Holding [something]”.
Example videos and corresponding descriptions. Object entries shown in italics
  • The above figure shows some of the example videos.
Dataset Comparison
  • The above shows the comparison of video datasets.

2. Results

2.1. Datasets

Subset of 10 hand-chosen “easy” classes
  • 10-Class Dataset: 41 “easy” classes are firstly pre-selected. Then, 10 classes are generated by grouping together one or more of the original 41 classes with similar semantics. The total number of videos in this case is 28198.
  • 40-Class Dataset: Keeping the above 10 groups, 30 additional common classes are selected. The total number of samples in this case is 53267.
  • 174-Class Dataset: Full dataset.

2.2. Baseline Results

Error rates on different subsets of the data
  • 2D-CNN+Avg: VGG-16 is used to represent individual frames and averaging the obtained features for each frame in the video to form the final encoding.
  • Pre-2D-CNN+Avg: ImageNet-Pretrained 2D-CNN+Avg.
  • Pre-2D-CNN+LSTM: ImageNet-Pretrained 2D-CNN but using LSTM. The last hidden state of the LSTM is used as the video encoding.
  • 3D-CNN+Stack: C3D with a size of 1024 units for the fully-connected layers and a clip size of 9 frames. Features are extracted from non-overlapping clips of size 9 frames. The obtained features are stacked to obtain a 4096 dimensional representation.
  • Pre-3D-CNN+Avg: Sport1M-Pretrained C3D with fine-tuning on this dataset.
  • 2D+3D-CNN: A combination of the best performing 2D-CNN and 3D-CNN trained models, obtained by concatenating the two resulting video-encodings.

The difficulty of the task grows significantly as the number of classes are increased.

  • On all 174 classes using a 3D-CNN model pre-trained on the 40 selected classes, and obtained error rates of top-1: 88.5%, top-5: 70.3%.
  • An informal human evaluation on the complete dataset (174 classes) with 10 individuals is also performed, with 700 test samples in total, about 60% accuracy.

3. Something-Something v2

Some Samples in the QualComm website
Something-Something v2 Summary in QualComm website

News Article: https://seekingalpha.com/news/3716397-qualcomm-acquires-team-and-assets-from-ai-company-twenty-billion-neurons

References

[2017 ICCV] [Something-Something]
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

[Dataset] [Moving Objects Dataset: Something-Something v. 2]
https://developer.qualcomm.com/software/ai-datasets/something-something

[News Article] [Acquired by Qualcomm]
https://seekingalpha.com/news/3716397-qualcomm-acquires-team-and-assets-from-ai-company-twenty-billion-neurons

1.12. Video Classification / Action Recognition

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] [LRCN] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] [Something Something] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G] 2019 [VideoBERT]

5.4. Video Captioning

2015 [LRCN] 2017 [Something Something] 2019 [VideoBERT]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet

Write a response