Brief Review — Moments in Time Dataset: One Million Videos for Event Understanding

Moments in Time, One-Million 3-Second Videos

Sik-Ho Tsang
4 min readNov 1, 2022


Moments in Time Dataset: One Million Videos for Event Understanding,
Moments in Time
, by Massachusetts Institute of Technology (MIT), International Business Machines (IBM), Chinese University of Hong Kong (CUHK), Columbia University, and Boston University
2019 TPAMI, Over 300 Citations (Sik-Ho Tsang @ Medium)
Video Dataset, Video Classification, Action Recognition

  • Moments in Time video dataset is proposed, which consists of 1M 3-second videos.


  1. Moments in Time Dataset
  2. Results

1. Moments in Time Dataset

1.1. Construction & Annotation

Sample Videos. Day-to-day events can happen to many types of actors, in different environments, and at different scales. Moments in Time dataset has a significant intra-class variation among the categories.
  • The Moments in Time Dataset consists of over one million 3-second videos corresponding to 339 different verbs.
  • The vocabulary is begun to build by forming a list of the 4,500 most commonly used verbs from VerbNet. With some clustering and processing, a set of 339 frequently used and semantically diverse verbs is used to build the proposed dataset with a large coverage and diversity of labels.
User interface. An example for our binary annotation task for the action cooking.
  • Each video is downloaded and randomly cut as a 3-second section with the corresponding verb. These verb-video tuples are then sent to Amazon Mechanical Turk (AMT) for annotation.

1.2. Statistics

Dataset Statistics
  • Left: The full distribution across all classes where the average number of labeled videos per class is 1,757 with a median of 2,775.
  • Middle: On the far left (larger human proportion), there are classes such as “typing”, “sketching”, and “repairing”, while on the far right (smaller human proportion) there are events such as “storming”, “roaring”, and “erupting”.
  • Right: There are sound-dependent classes. This figure shows the distribution of videos according to whether or not the event in the video can be seen.
Comparison to Datasets
  • Left: The total number of action labels used for training.
  • Middle: The average number of videos that belong to each class in the training set. This increase in scale for action recognition is beneficial for training large generalizable systems for machine learning.
  • Right: 100% of the scene categories in Places and 99.9% of the object categories in ImageNet were recognized in the proposed dataset.

2. Results

  • A training set of 802,264 videos with between 500 and 5,000 videos per class for 339 different classes and evaluate performance on a validation set of 33,900 videos with 100 videos for each class.
  • Additionally, there is a withheld test set of 67,800 videos consisting of 200 videos per class.
  • Optical flow maps are generated as images.
Classification Accuracy on Validation Set
  • Models from three different modalities, spatial, temporal, auditory, are evaluated.
  • The best single model is I3D, with a Top-1 accuracy of 29.51% and a Top-5 accuracy of 56.06% while the Ensemble model (SVM) achieves a 57.67% Top-5 accuracy.

Given the relatively low performance on Moments in Time, this suggests that there is still room to capitalize on temporal and auditory dynamics to better recognize actions.

Overview of top detections for several single stream models
  • The models can recognize moments well when the action is well-framed and close up.
Examples of missed detections
  • However, the model frequently misfires when the category is fine-grained or there is background clutter.
Predictions and Attention
  • CAM highlights the most informative image regions relevant to the prediction.
Dataset transfer performance using ResNet50 I3D models pretrained on both Kinetics and Moments in Time
  • Pretraining on Moments in Time results in better performance when transferring to HMDB51 and pretraining on Kinetics gives stronger results when transferring to UCF101. This makes sense as UCF101 and Kinetics share many classes.

On Something-Something, pretraining on Moments in Time improves performance. 3-second length of the videos in the Moments in Time dataset does not hinder performance when applied to datasets with much longer videos.


[2019 TPAMI] [Moments in Time]
Moments in Time Dataset: One Million Videos for Event Understanding

[Dataset] [Moments in Time]

1.13. Video Classification / Action Recognition

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] [LRCN] 2016 [TSN] 2017 [Temporal Modeling Approaches] [4 Temporal Modeling Approaches] [P3D] [I3D] [Something Something] 2018 [NL: Non-Local Neural Networks] [S3D, S3D-G] 2019 [VideoBERT] [Moments in Time]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.