Brief Review — Moments in Time Dataset: One Million Videos for Event Understanding
Moments in Time, One-Million 3-Second Videos
Moments in Time Dataset: One Million Videos for Event Understanding,
Moments in Time, by Massachusetts Institute of Technology (MIT), International Business Machines (IBM), Chinese University of Hong Kong (CUHK), Columbia University, and Boston University
2019 TPAMI, Over 300 Citations (Sik-Ho Tsang @ Medium)
Video Dataset, Video Classification, Action Recognition
- Moments in Time video dataset is proposed, which consists of 1M 3-second videos.
Outline
- Moments in Time Dataset
- Results
1. Moments in Time Dataset
1.1. Construction & Annotation
- The Moments in Time Dataset consists of over one million 3-second videos corresponding to 339 different verbs.
- The vocabulary is begun to build by forming a list of the 4,500 most commonly used verbs from VerbNet. With some clustering and processing, a set of 339 frequently used and semantically diverse verbs is used to build the proposed dataset with a large coverage and diversity of labels.
- Each video is downloaded and randomly cut as a 3-second section with the corresponding verb. These verb-video tuples are then sent to Amazon Mechanical Turk (AMT) for annotation.
1.2. Statistics
- Left: The full distribution across all classes where the average number of labeled videos per class is 1,757 with a median of 2,775.
- Middle: On the far left (larger human proportion), there are classes such as “typing”, “sketching”, and “repairing”, while on the far right (smaller human proportion) there are events such as “storming”, “roaring”, and “erupting”.
- Right: There are sound-dependent classes. This figure shows the distribution of videos according to whether or not the event in the video can be seen.
- Left: The total number of action labels used for training.
- Middle: The average number of videos that belong to each class in the training set. This increase in scale for action recognition is beneficial for training large generalizable systems for machine learning.
- Right: 100% of the scene categories in Places and 99.9% of the object categories in ImageNet were recognized in the proposed dataset.
2. Results
- A training set of 802,264 videos with between 500 and 5,000 videos per class for 339 different classes and evaluate performance on a validation set of 33,900 videos with 100 videos for each class.
- Additionally, there is a withheld test set of 67,800 videos consisting of 200 videos per class.
- Optical flow maps are generated as images.
- Models from three different modalities, spatial, temporal, auditory, are evaluated.
- The best single model is I3D, with a Top-1 accuracy of 29.51% and a Top-5 accuracy of 56.06% while the Ensemble model (SVM) achieves a 57.67% Top-5 accuracy.
Given the relatively low performance on Moments in Time, this suggests that there is still room to capitalize on temporal and auditory dynamics to better recognize actions.
- The models can recognize moments well when the action is well-framed and close up.
- However, the model frequently misfires when the category is fine-grained or there is background clutter.
- CAM highlights the most informative image regions relevant to the prediction.
- Pretraining on Moments in Time results in better performance when transferring to HMDB51 and pretraining on Kinetics gives stronger results when transferring to UCF101. This makes sense as UCF101 and Kinetics share many classes.
On Something-Something, pretraining on Moments in Time improves performance. 3-second length of the videos in the Moments in Time dataset does not hinder performance when applied to datasets with much longer videos.
References
[2019 TPAMI] [Moments in Time]
Moments in Time Dataset: One Million Videos for Event Understanding
[Dataset] [Moments in Time]
http://moments.csail.mit.edu/
1.13. Video Classification / Action Recognition
2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] [LRCN] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] [Something Something] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G] 2019 [VideoBERT] [Moments in Time]