Brief Review — FastSAM: Fast Segment Anything
Fast Segment Anything
FastSAM, by Chinese Academy of Sciences, University of Chinese Academy of Sciences, Objecteye Inc., and Wuhan AI Research
2023 arXiv v1, Over 80 Citations (Sik-Ho Tsang @ Medium)Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++] 2023 [Segment Anthing Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====
- SAM has huge computation costs prevent it from wider applications in industry scenarios.
- In this paper, FastSAM is proposed so that the task is reformulated as segments-generation and prompting, it is found that a regular CNN detector with an instance segmentation branch can also accomplish this task well.
Outline
- FastSAM
- Results
1. FastSAM
1.1. All-Instance Segmentation Stage
- YOLOv8 detection backbone is used, with YOLACT principles applied for instance segmentation, namely YOLOv8-seg.
- It begins with feature extraction from an image via a backbone network and the Feature Pyramid Network (FPN).
- The detection branch outputs category and bounding box, while the segmentation branch outputs k prototypes (defaulted to 32 in FastSAM) along with k mask coefficients.
- The segmentation and detection tasks are computed in parallel.
- The segmentation branch inputs a high-resolution feature map. This map is processed through a convolution layer, upscaled, and then passed through two more convolution layers to output the masks.
- The mask coefficients, similar to the detection head’s classification branch, range between -1 and 1. The instance segmentation result is obtained by multiplying the mask coefficients with the prototypes and then summing them up.
The prototypes and mask coefficients provide a lot of extensibility for prompt guidance. This YOLOv8-seg method is used for the all-instance segmentation stage.
1.2. Prompt-guided Selection Stage
- Point prompt: Similar to SAM, foreground/background points can be as the prompt.
- Box prompt: The aim is to identify the mask with the highest IoU score with the selected box and thus select the object of interest.
- Text prompt: As shown above, the corresponding text embeddings of the text are extracted using the CLIP model. The respective image embeddings are then determined and matched to the intrinsic features of each mask using a similarity metric. The mask with the highest similarity score to the image embeddings of the text prompt is then selected.
1.3. Data
- Only 1/50 of all SA-1B dataset, which is used by SAM, is used to train the FastSAM model.
2. Results
2.1. Run-Time
- While FastSAM generates relatively satisfying results as shown in Figure 3, FastSAM surpasses SAM at all prompt numbers in terms of speed. Moreover, the running speed of FastSAM does not change with the prompts.
2.2. Zero-Shot Edge Detection
FastSAM’s significantly fewer parameters (only 68M), it produces a generally good edge map.
FastSAM has similar performance with SAM, specifically a higher R50 and a lower AP.
2.3. Zero-Shot Object Proposal Generation
- While others are supervised approaches, FastSAM and SAM implement a fully zero-shot transfer.
- FastSAM and SAM do not perform as well in AR@10 precision. However, in AR@1000, FastSAM significantly outperforms OLN [17].
FastSAM substantially surpasses the most computationally intensive model of SAM, SAM-H E64, by over 5%.
- However, it falls short compared to ViTDet-H, which was trained on the LVIS dataset.
Again, mask proposal generation by FastSAM is relatively lower on Recall.
2.4. Zero-Shot Instance Segmentation
On this task, FastSAM fails to achieve a high AP.
But qualitatively, FastSAM still can segment objects well based on the text prompts.
2.5. Real-world Applications
- Fig. 7: By foreground/background points (yellow and magenta points in FastSAM-point respectively) or box-guided selection, FastSAM can segment on the exact defective regions.
- Fig. 8: FastSAM exhibited only a minor difference from SAM under everything mode, as it segment fewer background objects which are irrelevant to the task.
- Fig. 9: FastSAM performs well in segmenting regularly shaped objects, but segments fewer regions related to shadows compared to SAM.
- Fig. 10: On some images, FastSAM even generates better masks for large object.
2.6. Failure Modes
- The low-quality small-sized segmentation masks have large confidence scores.
- The masks of some of the tiny-sized objects tend to be near the square.