Brief Review — The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
Moving Objects Dataset, Recognize Gestures in the Context of Everyday Objects
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense, Something Something, by TwentyBN
2017 ICCV, Over 600 Citations (Sik-Ho Tsang @ Medium)
Video Classification, Action Recognition, Video Captioning, Dataset
- This paper proposes a “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, which contains more than 100,000 videos across 174 classes, which are defined as caption-templates.
- A news article (link shown at the bottom) about TwentyBN makes me read this paper!
Outline
- Something-Something Dataset
- Results
- Something-Something v2
1. Something-Something Dataset
1.1. Crowdsourcing
In this dataset, crowd-workers are asked to record videos and to complete caption-templates, by providing appropriate input-text for placeholders.
- In this example, the text provided for placeholder “something” is “a shoe”.
- Using natural language instead of an action provides a much weaker learning signal than a one-hot label.
- The complexity and sophistication of caption-templates is increased over time, to the degree that models succeed at making predictions, finally having a “Something-Something v2” dataset later on.
- (The paper talks about the detailed rules and procedures of constructing the datasets. Please feel free to read paper directly if interested.)
1.2. Dataset Summary
- The version in the paper, contains 108,499 videos across 174 labels, with duration ranging from 2 to 6 seconds.
- The dataset is split into train, validation and test-sets in the ratio of 8:1:1. All videos provided by the same worker occur only in one split.
- The dataset contains 23,137 distinct object names. The estimated number of actually distinct objects to be at least a few thousand.
- A truncated distribution of the number of videos per class is shown above, with an average of roughly 620 videos per class, a minimum of 77 for “Poking a hole into [some substance]” and a maximum of 986 for “Holding [something]”.
- The above figure shows some of the example videos.
- The above shows the comparison of video datasets.
2. Results
2.1. Datasets
- 10-Class Dataset: 41 “easy” classes are firstly pre-selected. Then, 10 classes are generated by grouping together one or more of the original 41 classes with similar semantics. The total number of videos in this case is 28198.
- 40-Class Dataset: Keeping the above 10 groups, 30 additional common classes are selected. The total number of samples in this case is 53267.
- 174-Class Dataset: Full dataset.
2.2. Baseline Results
- 2D-CNN+Avg: VGG-16 is used to represent individual frames and averaging the obtained features for each frame in the video to form the final encoding.
- Pre-2D-CNN+Avg: ImageNet-Pretrained 2D-CNN+Avg.
- Pre-2D-CNN+LSTM: ImageNet-Pretrained 2D-CNN but using LSTM. The last hidden state of the LSTM is used as the video encoding.
- 3D-CNN+Stack: C3D with a size of 1024 units for the fully-connected layers and a clip size of 9 frames. Features are extracted from non-overlapping clips of size 9 frames. The obtained features are stacked to obtain a 4096 dimensional representation.
- Pre-3D-CNN+Avg: Sport1M-Pretrained C3D with fine-tuning on this dataset.
- 2D+3D-CNN: A combination of the best performing 2D-CNN and 3D-CNN trained models, obtained by concatenating the two resulting video-encodings.
The difficulty of the task grows significantly as the number of classes are increased.
- On all 174 classes using a 3D-CNN model pre-trained on the 40 selected classes, and obtained error rates of top-1: 88.5%, top-5: 70.3%.
- An informal human evaluation on the complete dataset (174 classes) with 10 individuals is also performed, with 700 test samples in total, about 60% accuracy.
3. Something-Something v2
News Article: https://seekingalpha.com/news/3716397-qualcomm-acquires-team-and-assets-from-ai-company-twenty-billion-neurons
- Later, the assets and the teams of TwentyBN is acquired by QualComm, in July 2021.
- The larger “Something-Something v2” with 220,847 videos, now is hosted in QualComm website.
References
[2017 ICCV] [Something-Something]
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
[Dataset] [Moving Objects Dataset: Something-Something v. 2]
https://developer.qualcomm.com/software/ai-datasets/something-something
[News Article] [Acquired by Qualcomm]
https://seekingalpha.com/news/3716397-qualcomm-acquires-team-and-assets-from-ai-company-twenty-billion-neurons
1.12. Video Classification / Action Recognition
2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] [LRCN] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] [Something Something] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G] 2019 [VideoBERT]
5.4. Video Captioning
2015 [LRCN] 2017 [Something Something] 2019 [VideoBERT]