Review — Motion Masks: Learning Features by Watching Objects Move

Outperforms Context Prediction, Split-Brain Auto, Jigsaw, Context Encoders, BiGAN, Wang ICCV’15, etc.

Sik-Ho Tsang
5 min readJan 8, 2022
Motion helps us to correctly group pixels that move together (bottom left) and identify this group as a single object (bottom right).

Learning Features by Watching Objects Move
Motion Masks, by Facebook AI Research (FAIR), and University of California
2017 CVPR, Over 400 Citations (Sik-Ho Tsang @ Medium)
Unsupervised Learning, Representation Learning, Object Detection, Image Classification, Semantic Segmentation, Video Classification

  • Motion plays a key role in the development of the human visual system.
  • In this paper, an unsupervised motion-based segmentation on videos is proposed to obtain segments, which is treated as ‘pseudo ground truth’ to train a convolutional network to segment objects from a single frame.
  • When used for transfer learning on object detection, the proposed representation significantly outperforms previous unsupervised approaches.


  1. Approach Overview
  2. Unsupervised Motion Segmentation
  3. Training a ConvNet to Segment Objects
  4. Experimental Results

1. Approach Overview

Approach Overview
  1. Yahoo Flickr Creative Commons 100 million (YFCC100m) [43] dataset is used.
  2. Motion cues are used to segment objects in videos without any supervision.
  3. Then, a ConvNet is trained to predict these segmentations from static frames, i.e. without any motion cues.
  4. The learned representation is transferred to other recognition tasks.

2. Unsupervised Motion Segmentation

Left: a video frame, Right: the output of uNLC
  • The NLC approach from Faktor and Irani [12] is unsupervised with respect to video segmentation, it utilizes an edge detector that was trained on labeled edge images.
  • uNLC is proposed where a nearest neighbor graph is computed over the superpixels in the video using location and appearance (color histograms and HOG [6]) as features, so that it is totally unsupervised.
  • Frames are discarded: (1) frames with too many (>80%) or too few (<10%) pixels marked as foreground; (2) frames with too many pixels (>10%) within 5% of the frame border that are marked as foreground.
  • uNLC is run on videos from YFCC100m [43], which contains about 700,000 videos. It ends up with 205,000 videos after pruning.
  • 5–10 frames per shot from each video are sampled to create the dataset of 1.6M images, so we have slightly more frames than images in ImageNet.

3. Training a ConvNet to Segment Objects

Left: A video frame, Middle: the output of uNLC that we use to train the ConvNet, and Right: the output of the ConvNet
  • The ConvNet is trained to segment the object, i.e., assign each pixel a label of 1 if it lies on the object and 0 otherwise.
  • An object is sampled from an image and a box is cropped around the ground truth segment to make sure only one object exists. The box is jittered in position and scale.
  • AlexNet is used. It takes as input a w×w image and outputs an s×s mask. w=227, s=56.
  • The proposed network ends in a fully connected layer with s² outputs followed by an element-wise sigmoid. The resulting s² dimensional vector is reshaped into an s×s mask.
  • The ground truth mask is downsampled to s×s and the cross entropy losses are sum over the s² locations to train the network.
Examples of segmentations produced by our ConvNet on held out images

4. Experimental Results

4.1. Does training for segmentation yield good features?

The proposed representation (Supervised Masks) trained on manually-annotated segments from COCO (without class labels) compared to ImageNet pretraining and Context Prediction (unsupervised)
  • Object detection on PASCAL VOC 2007 using Fast R-CNN is evaluated.
  • The proposed supervised representation outperforms the unsupervised Context Prediction model across all scenarios by a large margin, which is to be expected.
  • Notably though, the proposed model maintains a fairly small gap with ImageNet pretraining.

Thus, given high-quality segments, the proposed proposed method can learn a strong representation, which validates authors’ hypothesis.

  • The model trained on Context Prediction degrades rapidly as more layers are frozen. This drop indicates that the higher layers of the model have become overly specific to the pretext task.

The proposed approach retains good performance even when most of the ConvNet is frozen, indicating that it has indeed learned high-level semantics in the higher layers.

4.2. Can the ConvNet learn from noisy masks?

From left to right, the original mask, dilated and eroded masks (boundary errors), and a truncated mask (truncation can be on any side).
  • The ground truth masks are degraded to measure the impact of segmentation quality on the learned representation.
VOC object detection accuracy using our supervised ConvNet as noise is introduced in mask boundaries, the masks are truncated, or the amount of data is reduced.
  • Surprisingly, the representation maintains quality even with large degradation.

4.3. How much data do we need?

  • As shown at the right of the above figure, building a good representation requires significant amounts of training data.

4.4. SOTA Comparison

Object detection AP (%) on PASCAL VOC 2012 using Fast R-CNN with various pretrained ConvNets.
Results on object detection using Fast R-CNN
  • For supervised learning, the first is trained on ImageNet classification. The second is trained on manually-annotated segments (without class labels) from COCO.
  • The proposed representation learned from unsupervised motion segmentation performs on par or better than prior work on unsupervised learning across all scenarios.
  • The training data is from different domains (YFCC100m videos vs. ImageNet images). The two variants perform similarly: 33.4% mean AP when trained on YFCC with conv5 and below frozen compared to 33.2% for the ImageNet version. This confirms that the different image sources do not explain the proposed gains.

4.5. Low-Shot Transfer

4.6. Other Downstream Tasks

Results on image (object) classification on VOC 2007, single-image action classification on Stanford 40 Actions, and semantic segmentation on VOC 2011

When the ConvNet is progressively frozen, the proposed approach is a strong performer.

  • When all layers until conv5 are frozen, the proposed representation is better than other approaches on action classification and second only to Colorization [51] on image classification on VOC 2007 and semantic segmentation on VOC 2011.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.