Review — Motion Masks: Learning Features by Watching Objects Move
Outperforms Context Prediction, Split-Brain Auto, Jigsaw, Context Encoders, BiGAN, Wang ICCV’15, etc.
Learning Features by Watching Objects Move
Motion Masks, by Facebook AI Research (FAIR), and University of California
2017 CVPR, Over 400 Citations (Sik-Ho Tsang @ Medium)
Unsupervised Learning, Representation Learning, Object Detection, Image Classification, Semantic Segmentation, Video Classification
- Motion plays a key role in the development of the human visual system.
- In this paper, an unsupervised motion-based segmentation on videos is proposed to obtain segments, which is treated as ‘pseudo ground truth’ to train a convolutional network to segment objects from a single frame.
- When used for transfer learning on object detection, the proposed representation significantly outperforms previous unsupervised approaches.
Outline
- Approach Overview
- Unsupervised Motion Segmentation
- Training a ConvNet to Segment Objects
- Experimental Results
1. Approach Overview
- Yahoo Flickr Creative Commons 100 million (YFCC100m) [43] dataset is used.
- Motion cues are used to segment objects in videos without any supervision.
- Then, a ConvNet is trained to predict these segmentations from static frames, i.e. without any motion cues.
- The learned representation is transferred to other recognition tasks.
2. Unsupervised Motion Segmentation
- The NLC approach from Faktor and Irani [12] is unsupervised with respect to video segmentation, it utilizes an edge detector that was trained on labeled edge images.
- uNLC is proposed where a nearest neighbor graph is computed over the superpixels in the video using location and appearance (color histograms and HOG [6]) as features, so that it is totally unsupervised.
- Frames are discarded: (1) frames with too many (>80%) or too few (<10%) pixels marked as foreground; (2) frames with too many pixels (>10%) within 5% of the frame border that are marked as foreground.
- uNLC is run on videos from YFCC100m [43], which contains about 700,000 videos. It ends up with 205,000 videos after pruning.
- 5–10 frames per shot from each video are sampled to create the dataset of 1.6M images, so we have slightly more frames than images in ImageNet.
3. Training a ConvNet to Segment Objects
- The ConvNet is trained to segment the object, i.e., assign each pixel a label of 1 if it lies on the object and 0 otherwise.
- An object is sampled from an image and a box is cropped around the ground truth segment to make sure only one object exists. The box is jittered in position and scale.
- AlexNet is used. It takes as input a w×w image and outputs an s×s mask. w=227, s=56.
- The proposed network ends in a fully connected layer with s² outputs followed by an element-wise sigmoid. The resulting s² dimensional vector is reshaped into an s×s mask.
- The ground truth mask is downsampled to s×s and the cross entropy losses are sum over the s² locations to train the network.
4. Experimental Results
4.1. Does training for segmentation yield good features?
- Object detection on PASCAL VOC 2007 using Fast R-CNN is evaluated.
- The proposed supervised representation outperforms the unsupervised Context Prediction model across all scenarios by a large margin, which is to be expected.
- Notably though, the proposed model maintains a fairly small gap with ImageNet pretraining.
Thus, given high-quality segments, the proposed proposed method can learn a strong representation, which validates authors’ hypothesis.
- The model trained on Context Prediction degrades rapidly as more layers are frozen. This drop indicates that the higher layers of the model have become overly specific to the pretext task.
The proposed approach retains good performance even when most of the ConvNet is frozen, indicating that it has indeed learned high-level semantics in the higher layers.
4.2. Can the ConvNet learn from noisy masks?
- The ground truth masks are degraded to measure the impact of segmentation quality on the learned representation.
- Surprisingly, the representation maintains quality even with large degradation.
4.3. How much data do we need?
- As shown at the right of the above figure, building a good representation requires significant amounts of training data.
4.4. SOTA Comparison
- For supervised learning, the first is trained on ImageNet classification. The second is trained on manually-annotated segments (without class labels) from COCO.
- The proposed representation learned from unsupervised motion segmentation performs on par or better than prior work on unsupervised learning across all scenarios.
- The training data is from different domains (YFCC100m videos vs. ImageNet images). The two variants perform similarly: 33.4% mean AP when trained on YFCC with conv5 and below frozen compared to 33.2% for the ImageNet version. This confirms that the different image sources do not explain the proposed gains.
4.5. Low-Shot Transfer
- In this scenario it actually hurts to fine-tune the entire network, and the best setup is to leave some layers frozen.
- But the proposed network obtains the best performance, outperforms Context Prediction, Split-Brain Auto, Jigsaw, Inpainting (Context Encoders), BiGAN, Tracking Video (Wang ICCV’15), etc.
4.6. Other Downstream Tasks
When the ConvNet is progressively frozen, the proposed approach is a strong performer.
- When all layers until conv5 are frozen, the proposed representation is better than other approaches on action classification and second only to Colorization [51] on image classification on VOC 2007 and semantic segmentation on VOC 2011.
Reference
[2017 CVPR] [Motion Masks]
Learning Features by Watching Objects Move
Self-Supervised Learning
2008–2010 [Stacked Denoising Autoencoders] 2014 [Exemplar-CNN] 2015 [Context Prediction] [Wang ICCV’15] 2016 [Context Encoders] [Colorization] [Jigsaw Puzzles] 2017 [L³-Net] [Split-Brain Auto] [Mean Teacher] [Motion Masks] 2018 [RotNet/Image Rotations] [DeepCluster] [CPC/CPCv1]