Review — Segment Anything Model (SAM)

Promptable, Foundation Model for Segmentation

8 min readJun 22, 2023

**Segment Anything** (Video from https://segment-anything.com/: Meta AI even gets a domain name for it!)

Segment Anything,
Segment Anything Model (SAM), by Meta AI Research, FAIR
2023 arXiv v1, Over 200 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++]
==== My Other Paper Readings Are Also Over Here ====

Segment Anything Model (SAM) is introduced by Meta AI recently, which is very hot now. It already gots over 200 citations by today.
SA-1B is constructed, which is the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images.
SAM is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.
If interested, please visit their website: https://segment-anything.com/

Outline

Segment Anything Task
Segment Anything Model (SAM)
Segment Anything Data Engine
Segment Anything Dataset: SA-1B
Zero-Shot Transfer Results

1. Segment Anything Task

In NLP, large lanaguage model (LLM) is pretrained, which can then be zero-shot transferred to new NLP task.
In computer vision, e.g.: CLIP and ALIGN, engineered text prompts enable zero-shot generalization to novel visual concepts.

In this paper, authors seek to develop a promptable model and pre-train it on a broad dataset so that the model can segment unseen/any objects.

2. Segment Anything Model (SAM)

SAM has three components, as illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder.

2.1. Image Encoder

MAE pre-trained Vision Transformer (ViT) is used to obtain the image embedding.

2.2. Prompt Encoder

Two sets of prompts are considered: sparse (points, boxes, text) and dense (masks).

Points and boxes are represented by positional encodings summed with learned embeddings for each prompt type. Free-form text is used with an off-the-shelf text encoder from CLIP.
Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding.

2.3. Mask Decoder

The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask.

This design, inspired by DETR and [20], employs a modification of a Transformer decoder block (Left) followed by a dynamic mask prediction head (Right).

Transformer Decoder Block (Left): The modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings.
Dynamic Mask Prediction Head (Right): After running two blocks, the image embedding is upsampled and an MLP is used to map the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.

2.4. Resolving Ambiguity

Given an ambiguous single prompt (Green dot), the model to predict multiple output masks as shown above.

3 mask outputs are sufficient to address most common cases (nested masks are often at most three deep: whole, part, and subpart).

During training, only the minimum loss over masks is backpropagated.
To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.

2.5. Efficiency

Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in ∼50ms, enables seamless, real-time interactive prompting.

2.6. Loss

The linear combination of focal loss and dice loss is used.
For training promptable segmentation task, a mixture of geometric prompts is used.

3. Segment Anything Data Engine

A data engine is used to enable the collection of 1.1B mask dataset, SA-1B.

The data engine has three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage with a mix of automatically predicted masks and model-assisted annotation, and (3) a fully automatic stage.

3.1. Assisted-Manual Stage (First Stage)

A team of professional annotators labeled masks by clicking foreground/background object points using a browser-based interactive segmentation tool powered by SAM. Annotators label objects they could name or describe.

At the start of this stage, SAM was trained using common public segmentation datasets.
As more masks were collected, the image encoder was scaled from ViT-B to ViT-H with other details involved.

The models are retrained by 6 times. Overall, 4.3M masks from 120k images are collected in this stage.

3.2. Semi-Automatic Stage (Second Stage)

To focus annotators on less prominent objects for diversity, SAM first automatically detected confident masks, then presented annotators with images prefilled with these masks and asked them to annotate any additional unannotated objects.
During this stage, an additional 5.9M masks in 180k images (for a total of 10.2M masks) are collected.

3.3. Fully Automatic Stage (Third Stage)

The model is prompted with a 32×32 regular grid of points and for each point a set of masks is predicted that may correspond to valid objects.
With the ambiguity-aware model, if a point lies on a part or subpart, SAM will return the subpart, part, and whole object.

The IoU prediction module of the model is used to select confident masks; moreover, only stable masks are identified and selected.
After selecting the confident and stable masks, non-maximal suppression (NMS) is applied to filter duplicates. To further improve the quality of smaller masks, multiple overlapping zoomed-in image crops are also processed.

Finally, SA-1B is constructed, with 11M images and a total of 1.1B high-quality masks obtained.

4. Segment Anything Dataset: SA-1B

**SA-1B Examples with More Than 400 Masks per Image**

The above shows some example images.

Mask center statistics is obtained. SA-1B has greater coverage of image corners compared to LVIS v1 [44] and ADE20K.

Left: SA-1B has more masks per image.
Middle: SA-1B tends to include a greater percentage of small and medium relative-size masks.
Right: The concavity distribution (Shape complexity) of SA-1B masks is broadly similar to that of other datasets.

In SA-1B, all regions, including Africa, have at least 28 million masks, 10× more than the total number of masks of any previous dataset.

Finally, it is observed that the average number of masks per image (not shown) is fairly consistent across region and income.

**Gender Presentation, Age Group, and Skin Tone.**

SAM has less bias for Gender, Age and Skin Tone.

5. Zero-Shot Transfer Results

5.1. 23 Downstream Tasks

(a) Compared with RITM, SAM yields higher results on 16 of the 23 datasets.

“oracle” result is also presented, in which the most relevant of SAM’s 3 masks is selected by comparing them to the ground truth, rather than selecting the most confident mask. This reveals the impact of ambiguity on automatic evaluation.

In particular, with the oracle to perform ambiguity resolution, SAM outperforms RITM on all datasets.

(b) SAM is worse on automatic metrics, it receives consistently higher ratings in the human study.

5.2. Zero-Shot Edge Prediction

SAM with a 16×16 regular grid of foreground points are prompted resulting in 768 predicted masks (3 per point). Redundant masks are removed by NMS. Then, edge maps are computed using Sobel filtering.

Though SAM was not trained for edge detection, it produces reasonable edge maps, with high performance.

5.3. Zero-Shot Object Proposal Generation

Unsurprisingly, ViTDet-H performs best. SAM does remarkably well on several metrics under zero-shot condition.

5.4. Zero-Shot Instance Segmentation

SAM is reasonably close, though certainly behind ViTDet.

By visualizing outputs, it is observed that SAM masks are often qualitatively better than those of ViTDet. SAM consistently outperforms ViTDet in the human study.

5.5. Zero-Shot Text-to-Mask

CLIP image embedding is extracted first. Then, during training, SAM is prompted with the extracted CLIP image embeddings.
Because CLIP’s image embeddings are trained to align with its text embeddings, text embeddings are for inference. i.e. the resulting text embedding is used as a prompt to SAM.

SAM can segment objects based on simple text prompts like “a wheel” as well as phrases like “beaver tooth grille”.

Left: SAM’s performance when trained on cumulative data from the data engine stages. Each stage increases mIoU. When training with all three stages, the automatic masks vastly outnumber the manual and semi-automatic masks
Middle: With subsampling 1M images, about 10% of the full dataset, it is observed that the results are comparable to using the full dataset. This data regime, which still includes approximately 100M masks, may be a practical setting for many use cases.
Right: ViT-B, ViT-L, and ViT-H are used as image encoders. ViT-H improves substantially over ViT-B, but has only marginal gains over ViT-L. Image encoder scaling does not appear fruitful at this time.

The goal is not to obtain the best performance, instead, authors want to proposed a promptable segmentation model such that the model can segment unseen objects, generalize to any new tasks.