Brief Review — You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
YOLOS, YOLO Using Vision Transformer
3 min readApr 28, 2024
You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
YOLOS, by Huazhong University of Science & Technology, and Horizon Robotics
2021 NeurIPS, Over 240 Citations (Sik-Ho Tsang @ Medium)Object Detection
2014 … 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====
- You Only Look at One Sequence (YOLOS) is proposed, where a series of object detection models are developed based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task.
Outline
- You Only Look at One Sequence (YOLOS)
- Results
1. You Only Look at One Sequence (YOLOS)
- YOLOS drops the [CLS] token for image classification and appends one hundred randomly initialized learnable detection tokens ([DET] tokens) to the input patch embeddings ([PATCH] tokens) for object detection.
- During training, YOLOS replaces the image classification loss in ViT with the bipartite matching loss to perform object detection in a set prediction manner following DETR.
1.1. Stem
- The image x is reshaped into a sequence of flattened 2D image patches xPATCH.
- Then, xPATCH is mapped to D dimensions with a trainable linear projection E.
- [DET] tokens are appended.
- Finally, position embeddings P are added to all the input tokens to retain positional information:
1.2. Body
- The body of YOLOS is basically the same as ViT, which consists of a stack of Transformer encoder layers only. Formally, for the l-th YOLOS Transformer encoder layer:
1.3. Detector Heads
- Both the classification and the bounding box regression heads are implemented by one MLP with separate parameters containing two hidden layers with intermediate ReLU.
1.4. Detection Token
- When fine-tuning on COCO, for each forward pass, an optimal bipartite matching between predictions generated by [DET] tokens and ground truth objects is established. This procedure plays the same role as label assignment.
1.5. Fine-tuning at Higher Resolution
- The positional embeddings need to adapt to the longer input sequences with various lengths. 2D interpolation of the pre-trained position embeddings is performed on the fly.
1.6. YOLOS Variants
2. Results
- Table 5: YOLOS-Ti is strong in AP and competitive in FLOPs & FPS. YOLOS-Ti can serve as a promising model scaling start point.
- Table 6: YOLOS-Ti still performs better than the DETR counterpart, while larger YOLOS models with width scaling become less competitive: YOLOS-S with more computations is 0.8 AP lower compared with a similar-sized DETR model. Even worse, YOLOS-B cannot beat DETR with over 2 parameters and FLOPs.