Brief Review — Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

MobileSAM, Distilling Segment Anything Model (SAM)

Sik-Ho Tsang
4 min readAug 4, 2024

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
MobileSAM
, by Kyung Hee University
2023 arXiv v2, Over 150 Citations (Sik-Ho Tsang @ Medium)

Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++] 2023 [Segment Anthing Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====

  • The original Segment Anything Model (SAM) image encoder is heavy.
  • In this paper, the knowledge from the heavy image encoder (ViT-H in the original SAM) is distilled to a lightweight image encoder, formed MobileSAM, which can be automatically compatible with the mask decoder in the original SAM.

Outline

  1. Background & Coupled Distillation
  2. MobileSAM
  3. Results

1. Background & Coupled Distillation

1.1. SAM

Original SAM
  • SAM consists of a ViT-based image encoder and a prompt-guided mask decoder. The image encoder takes the image as the input and generates an embedding, which is then fed to the mask decoder.
  • The mask decoder generates a mask to cut out any object from the background based on a prompt like a point (or box).
  • As seen above, the image encoder is heavy.

The goal of this project is to generate a mobile-friendly SAM (MobileSAM).

1.2. Coupled Distillation

  • Fully-coupled distillation (Left): is the most straightforward retraining process of knowledge distillation, which transfers the knowledge from a ViT-H-based SAM to a SAM with a smaller image encoder.
  • Semi-coupled distillation (Right): Since the mask decoder in the original SAM is already lightweight, another straightforward way is to optimize the image encoder with a copied and frozen mask decoder.
  • Empirically, it is found that this semi-coupled distillation optimization is still challenging because the choice of a prompt is random, which makes the mask decoder variable and thus increases the optimization difficulty.

Therefore, authors propose to distill the small image encoder directly from the ViT-H in the original SAM without resorting to the combined decoder, which is termed decoupled distillation, as below.

2. MobileSAM

When the generated image encoding from the student image encoder is sufficiently close to that of the original teacher encoder, which renders finetuning on the combined decoder in the second stage optional.

That is, distillation is only performed on the image encoder.

  • mIoU is calculated between the two masks generated by the teacher SAM and student SAM on the same prompt point.

Decoupled distillation takes less than 1% of the computation resources than coupled distillation, while achieving a superior performance of mIoU of 0.75 vs 0.72 for the coupled one.

3. Results

3.1. SAM vs MobileSAM

  • ViT-Tiny is used as the target image encoder with some modification to adapt to the following decoder.

The parameters and speed of MobileSAM are much smaller and faster respectively.

  • The predicted masks with two types of prompts are tried: point and box.

MobileSAM makes a satisfactory mask prediction similar to that of the original SAM.

3.2. Ablation Study

Ablation Study

Increasing the batch size increases the model performance.

Moreover, under the batch size, the performance also benefits from more update iterations by increasing the training epochs.

3.2. Segment Everything

Segment Anything vs Segment Everything
  • “segment anything” is to segment things based on prompt.
  • By contrast, “segment everything” is in essence object proposal generation, for which the prompt is not necessary.
  • FastSAM consists of a YOLOv8-based detection branch and a YOLACT-based segmentation branch to perform a prompt-free mask proposal generation.

For the inference speed, on a single GPU, FastSAM takes 40ms to process an image while MobileSAM only takes 10ms, which is 4 times faster than FastSAM.

mIoU for FastSAM is much smaller than that for MobileSAM, suggesting that the mask prediction of FastSAM is very different from that of the original SAM.

MobileSAM align surprisingly well with that of the original SAM. By contrast, the results of FastSAM are often less satisfactory.

  • For example, FastSAM often fails to predict some objects, like the roof in the first image.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.