Brief Review — You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

YOLOS, YOLO Using Vision Transformer

Sik-Ho Tsang
3 min readApr 28, 2024

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
YOLOS
, by Huazhong University of Science & Technology, and Horizon Robotics
2021 NeurIPS, Over 240 Citations (Sik-Ho Tsang @ Medium)

Object Detection
20142022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

  • You Only Look at One Sequence (YOLOS) is proposed, where a series of object detection models are developed based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task.

Outline

  1. You Only Look at One Sequence (YOLOS)
  2. Results

1. You Only Look at One Sequence (YOLOS)

You Only Look at One Sequence (YOLOS)
  • YOLOS drops the [CLS] token for image classification and appends one hundred randomly initialized learnable detection tokens ([DET] tokens) to the input patch embeddings ([PATCH] tokens) for object detection.
  • During training, YOLOS replaces the image classification loss in ViT with the bipartite matching loss to perform object detection in a set prediction manner following DETR.

1.1. Stem

  • The image x is reshaped into a sequence of flattened 2D image patches xPATCH.
  • Then, xPATCH is mapped to D dimensions with a trainable linear projection E.
  • [DET] tokens are appended.
  • Finally, position embeddings P are added to all the input tokens to retain positional information:

1.2. Body

  • The body of YOLOS is basically the same as ViT, which consists of a stack of Transformer encoder layers only. Formally, for the l-th YOLOS Transformer encoder layer:

1.3. Detector Heads

  • Both the classification and the bounding box regression heads are implemented by one MLP with separate parameters containing two hidden layers with intermediate ReLU.

1.4. Detection Token

  • When fine-tuning on COCO, for each forward pass, an optimal bipartite matching between predictions generated by [DET] tokens and ground truth objects is established. This procedure plays the same role as label assignment.

1.5. Fine-tuning at Higher Resolution

  • The positional embeddings need to adapt to the longer input sequences with various lengths. 2D interpolation of the pre-trained position embeddings is performed on the fly.

1.6. YOLOS Variants

YOLOS Variants
  • All YOLOS / ViT models are pretrained on ImageNet-1k.
  • YOLOS-Ti (Tiny), -S (Small), and -B (Base) directly correspond to DeiT-Ti, -S, and -B.

2. Results

SOTA Comparisons
  • Table 5: YOLOS-Ti is strong in AP and competitive in FLOPs & FPS. YOLOS-Ti can serve as a promising model scaling start point.
  • Table 6: YOLOS-Ti still performs better than the DETR counterpart, while larger YOLOS models with width scaling become less competitive: YOLOS-S with more computations is 0.8 AP lower compared with a similar-sized DETR model. Even worse, YOLOS-B cannot beat DETR with over 2 parameters and FLOPs.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.