Review — Pix2Seq: A Language Modeling Framework for Object Detection

Pix2Seq, Using Language Model for Object Detection, by Prof. Hinton, Outperforms Faster R-CNN and DETR

Sik-Ho Tsang
7 min readNov 29, 2022


Pix2Seq framework for object detection. The neural network perceives an image, and generates a sequence of tokens for each object, which correspond to bounding boxes and class labels.

Pix2Seq: A Language Modeling Framework for Object Detection,
Pix2Seq, by Google Research, Brain Team,
2022 ICLR, Over 50 Citations (Sik-Ho Tsang @ Medium)
Object Detection, Language Model, LM, BERT, Transformer

  • Unlike existing approaches that explicitly integrate prior knowledge about the task, Pix2Seq casts object detection as a language modeling task conditioned on the observed pixel inputs.
  • Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and a neural network is trained to perceive the image and generate the desired sequence.
  • This is a paper from Prof. Hinton’s research group.


  1. Pix2Seq Framework
  2. Bounding Box Discretization Using Quantization Bins
  3. Random Ordering Strategy
  4. Architecture, Objective, and Inference
  5. Sequence Augmentation
  6. Experimental Results

1. Pix2Seq Framework

Major components of the Pix2Seq learning framework.

In the proposed Pix2Seq framework, object detection is casted as a language modeling task, conditioned on pixel inputs.

Major components of the Pix2Seq learning framework.
  • The system consists of four main components:
  1. Image Augmentation: are used to enrich a fixed set of training examples (e.g., with random scaling and crops).
  2. Sequence construction & augmentation: A set of bounding boxes and class labels is converted into a sequence of discrete tokens.
  3. Architecture: An encoder-decoder model is used where the encoder perceives pixel inputs, and the decoder generates the target sequence (one token at a time.).
  4. Objective/loss function: The model is trained to maximize the log likelihood of tokens conditioned on the image and the preceding tokens (with a softmax cross-entropy loss).

2. Bounding Box Discretization Using Quantization Bins

Applying the proposed discretization of bounding box on an image of 480×640.
  • Specifically, an object is represented as a sequence of five discrete tokens, i.e. [ymin, xmin, ymax, xmax, c], where each of the continuous corner coordinates is uniformly discretized into an integer between [1, nbins], and c is the class index.
  • A shared vocabulary for all tokens is shared, so the vocabulary size is equal to number of bins + number of classes. This quantization scheme for the bounding boxes allows us to use a small vocabulary while achieving high precision.
  • For example, a 600×600 image requires only 600 bins to achieve zero quantization error. This is much smaller than modern language models with vocabulary sizes of 32K or higher (as in BERT or GPT).

The effect of different levels of quantization is as shown above. With a small number of bins, such as 500 bins (1 pixel/bin), it achieves high precision even for small objects.

3. Random Ordering Strategy

Examples of sequence construction with nbins = 1000, and 0 is EOS token.
  • Multiple object descriptions are serialized to form a single sequence for a given image.
  • A random ordering strategy (randomizing the order objects each time an image is shown) is used.
  • Finally, because different images often have different numbers of objects, the generated sequences will have different lengths. To indicate the end of a sequence, we therefore incorporate an EOS token.

4. Architecture, Objective, and Inference

4.1. Architecture

  • An encoder-decoder architecture is used.
  • The encoder can be a general image encoder that perceives pixels and encodes them into hidden representations, which can be ConvNet, Transformer, or hybrid.
  • For generation, a Transformer decoder is used to generate one token at a time, conditioned on the preceding tokens and the encoded image representation.
  • Thus, Pix2Seq can have a backbone of ResNet followed by 6 layers of Transformer encoder and 6 layers of (causal) Transformer decoder.

4.2. Objective

  • Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss:
  • where x is a given image, y and ~y are input and target sequences associated with x, and L is the target sequence length.
  • wj is a pre-assigned weight for j-th token in the sequence. All equal 1s.

4.3. Inference

  • At inference time, token is sampled from the model likelihood.
  • This can be done by either taking the token with the largest likelihood (arg max sampling), or using other stochastic sampling techniques. It is found that using nucleus sampling (Holtzman et al., 2019) leads to higher recall than arg max sampling.
  • Once the sequence is generated, it is straight-forward to extract and de-quantize the object descriptions.

5. Sequence Augmentation

  • The EOS token allows the model to decide when to terminate generation. But in practice, it is found that the model tends to finish without predicting all objects.
  • This is likely due to:
  1. Annotation noise (e.g., where annotators did not identify all the objects),
  2. Uncertainty in recognizing or localizing some objects.

5.1. Sequence Augmentation

Illustration of language modeling with / without sequence augmentation.
  • To encourage higher recall rates, one trick is to delay the sampling of the EOS token by artificially decreasing its likelihood.
  • With sequence augmentation, input tokens are constructed to include both real objects (blue) and synthetic noise objects (orange). For the noise objects, the model is trained to identify them as the “noise” class, and the loss weight of “n/a” tokens (corresponding to coordinates of noise objects) is set to zero since we do not want the model to mimic them.

5.2. Altered Sequence Construction

Illustrations of randomly sampled noise objects (in white), vs. ground-truth objects (in red).
  • Synthetic noise objects are first synthesized to augment input sequences in the following two ways:
  1. Adding noise to existing ground-truth objects (e.g., random scaling or shifting their bounding boxes), and
  2. Generating completely random boxes (with randomly associated class labels). These objects are appended at the end.

6. Experimental Results

6.1. Training from Scratch on COCO

Pix2Seq achieves competitive AP results compared to existing systems that require specialization during model design, while being significantly simpler. The best performing Pix2Seq model achieved an AP score of 45.
Comparison of average precision, over multiple thresholds and object sizes, on COCO validation set.

Overall, Pix2Seq achieves competitive results to both baselines. The proposed model performs comparably to Faster R-CNN on small and medium objects, but better on larger objects.

Compared with DETR, the proposed model performs comparably or slightly worse on large and medium objects, but substantially better (4–5 AP) on small objects.

6.2. Pretrain on Objects365 and Finetune on COCO

Average precision of finetuned Pix2seq models on COCO with different backbone architectures and image sizes.

The performances of Objects365 pretrained Pix2Seq models are strong across various model sizes and image sizes. The best performance (with 1333 image size) is 50 AP which is 5% higher than the best model trained from scratch.

  • The pretrain+finetune process is faster than training from scratch, and also generalizes better.

6.3. Ablation Studies

Ablations on sequence construction. (a) Quantization bins vs. performance. (b) and (c) show AP and AR@100 for different object ordering strategies.
  • (a): The plot indicates that quantization to 500 bins or more is sufficient; with 500 bins there are approximately 1.3 pixels per bin, which does not introduce significant approximation error.
  • (b) and (c): Both in terms of precision and recall, the random ordering yields the best performance. It is conjectured that with deterministic ordering, it may be difficult for the model to recover from mistakes of missing objects made earlier on, while with random ordering it would still be possible to retrieve them later.
Left: Impact of sequence augmentation on when training from scratch on COCO. Right: Impact of sequence augmentation when pretraining on Objects365 and finetuning on COCO.
  • Left: Without sequence augmentation, the AP is marginally worse if one delays the sampling of EOS token during the inference (via likelihood offsetting), but the recall is significantly worse for the optimal AP.
  • Right: AP is not significantly affected while recall is significantly worse without sequence augmentation. It is also worth noting that sequence augmentation is mainly effective during the fine-tuning.

6.4. Visualization

Decoder’s cross attention to visual feature map when predicting the first 5 objects. (b) we reshape a prediction sequence of 25 into a 5x5 grid, so each row represents a prediction for 5 tokens [ymin; xmin; ymax; xmax; c].
  • The cross attention (averaged over layers and heads) is visualized as the model predicts a new token.

One can see that the attention is very diverse when predicting the first coordinate token (i.e ymin), but then quickly concentrates and fixates on the object.

Visualization of Transformer decoder’s cross attention (when predicting class tokens) conditioned on the given bounding boxes.
  • The decoder pays the most attention to the object when predicting the class token.
Examples of the model’s predictions (at the score threshold of 0.5). Links for original images:,,,,,
  • Detection results of one of Pix2seq model (with 46 AP) are visualized on a subset of images from COCO validation set that contain a crowded set of objects.

A simple and generic framework for object detection is proposed except the augmentation uses prior knowledge for integration.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.