Review — Pix2Seq: A Language Modeling Framework for Object Detection

Pix2Seq, Using Language Model for Object Detection, by Prof. Hinton, Outperforms Faster R-CNN and DETR

Pix2Seq: A Language Modeling Framework for Object Detection,
Pix2Seq, by Google Research, Brain Team,
2022 ICLR, Over 50 Citations (Sik-Ho Tsang @ Medium)
Object Detection, Language Model, LM, BERT, Transformer

  • Unlike existing approaches that explicitly integrate prior knowledge about the task, Pix2Seq casts object detection as a language modeling task conditioned on the observed pixel inputs.
  • Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and a neural network is trained to perceive the image and generate the desired sequence.
  • This is a paper from Prof. Hinton’s research group.


  1. Pix2Seq Framework
  2. Bounding Box Discretization Using Quantization Bins
  3. Random Ordering Strategy
  4. Architecture, Objective, and Inference
  5. Sequence Augmentation
  6. Experimental Results

1. Pix2Seq Framework

In the proposed Pix2Seq framework, object detection is casted as a language modeling task, conditioned on pixel inputs.

  • The system consists of four main components:
  1. Image Augmentation: are used to enrich a fixed set of training examples (e.g., with random scaling and crops).
  2. Sequence construction & augmentation: A set of bounding boxes and class labels is converted into a sequence of discrete tokens.
  3. Architecture: An encoder-decoder model is used where the encoder perceives pixel inputs, and the decoder generates the target sequence (one token at a time.).
  4. Objective/loss function: The model is trained to maximize the log likelihood of tokens conditioned on the image and the preceding tokens (with a softmax cross-entropy loss).

2. Bounding Box Discretization Using Quantization Bins

  • Specifically, an object is represented as a sequence of five discrete tokens, i.e. [ymin, xmin, ymax, xmax, c], where each of the continuous corner coordinates is uniformly discretized into an integer between [1, nbins], and c is the class index.
  • A shared vocabulary for all tokens is shared, so the vocabulary size is equal to number of bins + number of classes. This quantization scheme for the bounding boxes allows us to use a small vocabulary while achieving high precision.
  • For example, a 600×600 image requires only 600 bins to achieve zero quantization error. This is much smaller than modern language models with vocabulary sizes of 32K or higher (as in BERT or GPT).

The effect of different levels of quantization is as shown above. With a small number of bins, such as 500 bins (1 pixel/bin), it achieves high precision even for small objects.

3. Random Ordering Strategy

  • Multiple object descriptions are serialized to form a single sequence for a given image.
  • A random ordering strategy (randomizing the order objects each time an image is shown) is used.
  • Finally, because different images often have different numbers of objects, the generated sequences will have different lengths. To indicate the end of a sequence, we therefore incorporate an EOS token.

4. Architecture, Objective, and Inference

4.1. Architecture

  • An encoder-decoder architecture is used.
  • The encoder can be a general image encoder that perceives pixels and encodes them into hidden representations, which can be ConvNet, Transformer, or hybrid.
  • For generation, a Transformer decoder is used to generate one token at a time, conditioned on the preceding tokens and the encoded image representation.
  • Thus, Pix2Seq can have a backbone of ResNet followed by 6 layers of Transformer encoder and 6 layers of (causal) Transformer decoder.

4.2. Objective

  • Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss:
  • where x is a given image, y and ~y are input and target sequences associated with x, and L is the target sequence length.
  • wj is a pre-assigned weight for j-th token in the sequence. All equal 1s.

4.3. Inference

  • At inference time, token is sampled from the model likelihood.
  • This can be done by either taking the token with the largest likelihood (arg max sampling), or using other stochastic sampling techniques. It is found that using nucleus sampling (Holtzman et al., 2019) leads to higher recall than arg max sampling.
  • Once the sequence is generated, it is straight-forward to extract and de-quantize the object descriptions.

5. Sequence Augmentation

  • The EOS token allows the model to decide when to terminate generation. But in practice, it is found that the model tends to finish without predicting all objects.
  • This is likely due to:
  1. Annotation noise (e.g., where annotators did not identify all the objects),
  2. Uncertainty in recognizing or localizing some objects.

5.1. Sequence Augmentation

  • To encourage higher recall rates, one trick is to delay the sampling of the EOS token by artificially decreasing its likelihood.
  • With sequence augmentation, input tokens are constructed to include both real objects (blue) and synthetic noise objects (orange). For the noise objects, the model is trained to identify them as the “noise” class, and the loss weight of “n/a” tokens (corresponding to coordinates of noise objects) is set to zero since we do not want the model to mimic them.

5.2. Altered Sequence Construction

  • Synthetic noise objects are first synthesized to augment input sequences in the following two ways:
  1. Adding noise to existing ground-truth objects (e.g., random scaling or shifting their bounding boxes), and
  2. Generating completely random boxes (with randomly associated class labels). These objects are appended at the end.

6. Experimental Results

6.1. Training from Scratch on COCO

Overall, Pix2Seq achieves competitive results to both baselines. The proposed model performs comparably to Faster R-CNN on small and medium objects, but better on larger objects.

Compared with DETR, the proposed model performs comparably or slightly worse on large and medium objects, but substantially better (4–5 AP) on small objects.

6.2. Pretrain on Objects365 and Finetune on COCO

The performances of Objects365 pretrained Pix2Seq models are strong across various model sizes and image sizes. The best performance (with 1333 image size) is 50 AP which is 5% higher than the best model trained from scratch.

  • The pretrain+finetune process is faster than training from scratch, and also generalizes better.

6.3. Ablation Studies

  • (a): The plot indicates that quantization to 500 bins or more is sufficient; with 500 bins there are approximately 1.3 pixels per bin, which does not introduce significant approximation error.
  • (b) and (c): Both in terms of precision and recall, the random ordering yields the best performance. It is conjectured that with deterministic ordering, it may be difficult for the model to recover from mistakes of missing objects made earlier on, while with random ordering it would still be possible to retrieve them later.
  • Left: Without sequence augmentation, the AP is marginally worse if one delays the sampling of EOS token during the inference (via likelihood offsetting), but the recall is significantly worse for the optimal AP.
  • Right: AP is not significantly affected while recall is significantly worse without sequence augmentation. It is also worth noting that sequence augmentation is mainly effective during the fine-tuning.

6.4. Visualization

  • The cross attention (averaged over layers and heads) is visualized as the model predicts a new token.

One can see that the attention is very diverse when predicting the first coordinate token (i.e ymin), but then quickly concentrates and fixates on the object.

  • The decoder pays the most attention to the object when predicting the class token.
Links for original images:,,,,,
  • Detection results of one of Pix2seq model (with 46 AP) are visualized on a subset of images from COCO validation set that contain a crowded set of objects.

A simple and generic framework for object detection is proposed except the augmentation uses prior knowledge for integration.



A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store