Brief Review — You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

YOLOS, YOLO Using Vision Transformer

3 min readApr 28, 2024

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
YOLOS, by Huazhong University of Science & Technology, and Horizon Robotics
2021 NeurIPS, Over 240 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

You Only Look at One Sequence (YOLOS) is proposed, where a series of object detection models are developed based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task.

Outline

You Only Look at One Sequence (YOLOS)
Results

1. You Only Look at One Sequence (YOLOS)

YOLOS drops the [CLS] token for image classification and appends one hundred randomly initialized learnable detection tokens ([DET] tokens) to the input patch embeddings ([PATCH] tokens) for object detection.
During training, YOLOS replaces the image classification loss in ViT with the bipartite matching loss to perform object detection in a set prediction manner following DETR.

1.1. Stem

The image x is reshaped into a sequence of flattened 2D image patches xPATCH.
Then, xPATCH is mapped to D dimensions with a trainable linear projection E.
[DET] tokens are appended.
Finally, position embeddings P are added to all the input tokens to retain positional information:

1.2. Body

The body of YOLOS is basically the same as ViT, which consists of a stack of Transformer encoder layers only. Formally, for the l-th YOLOS Transformer encoder layer:

1.3. Detector Heads

Both the classification and the bounding box regression heads are implemented by one MLP with separate parameters containing two hidden layers with intermediate ReLU.

1.4. Detection Token

When fine-tuning on COCO, for each forward pass, an optimal bipartite matching between predictions generated by [DET] tokens and ground truth objects is established. This procedure plays the same role as label assignment.

1.5. Fine-tuning at Higher Resolution

The positional embeddings need to adapt to the longer input sequences with various lengths. 2D interpolation of the pre-trained position embeddings is performed on the fly.

1.6. YOLOS Variants

All YOLOS / ViT models are pretrained on ImageNet-1k.
YOLOS-Ti (Tiny), -S (Small), and -B (Base) directly correspond to DeiT-Ti, -S, and -B.

2. Results

Table 5: YOLOS-Ti is strong in AP and competitive in FLOPs & FPS. YOLOS-Ti can serve as a promising model scaling start point.
Table 6: YOLOS-Ti still performs better than the DETR counterpart, while larger YOLOS models with width scaling become less competitive: YOLOS-S with more computations is 0.8 AP lower compared with a similar-sized DETR model. Even worse, YOLOS-B cannot beat DETR with over 2 parameters and FLOPs.

Brief Review — You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

YOLOS, YOLO Using Vision Transformer

Outline

1. You Only Look at One Sequence (YOLOS)

1.1. Stem

1.2. Body

1.3. Detector Heads

1.4. Detection Token

1.5. Fine-tuning at Higher Resolution

1.6. YOLOS Variants

2. Results

Written by Sik-Ho Tsang