Review — DiT: Self-supervised Pre-training for Document Image Transformer

DiT Uses ViT, BEiT and DALL·E Ideas

Sik-Ho Tsang
5 min readFeb 21


DiT: Self-supervised Pre-training for Document Image Transformer,
DiT, by Shanghai Jiao Tong University, Microsoft Research Asia, and Microsoft Azure AI
2022 ACM MM, Over 15 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Object Detection
==== My Other Paper Readings Are Also Over Here ====

  • DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR.


  1. Document Image Transformer (DiT)
  2. Results

1. Document Image Transformer (DiT)

The model architecture of DiT with MIM pre-training.

1.1. Model Architecture

  • Following ViT, the vanilla Transformer architecture is used as the backbone of DiT.
  • A document image is divided into non-overlapping patches and obtain a sequence of patch embeddings.
  • After adding the 1D position embedding, these image patches are passed into a stack of Transformer blocks with multi-head attention.
  • Finally, the output of the Transformer encoder is taken as the representation of image patches.
  • DiT-B model has the same architecture as the ViT base: a 12-layer Transformer with 768 hidden sizes, and 12 attention heads. The intermediate size of feed-forward networks is 3,072.
  • A larger version, DiT-L, is also trained with 24 layers, 1,024 hidden sizes, and 16 attention heads. The intermediate size of feed-forward networks is 4,096.

1.2. Pretraining

  • Inspired by BEiT, we use Masked Image Modeling (MIM) is used.
  • During pre-training, DiT accepts the image patches as input and predicts the visual tokens with the output representation.
  • BEiT uses the discrete variational auto-encoder (dVAE) from DALL·E as the image tokenizer, which is trained on a large data collection including 400 million images. However, there exists a domain mismatch between natural images and document images, which makes the DALL·E tokenizer not appropriate for the document images.
  • Therefore, to get better discrete visual tokens for the document image domain, the dVAE is trained on the IIT-CDIP dataset that includes 42 million document images.
  • The new dVAE tokenizer is trained with a combination of a MSE loss to reconstruct the input image, and a perplexity loss to increase the use of the quantized codebook representations.
Document image reconstruction with different tokenizers. From left to right: the original document image, image reconstruction using the self-trained dVAE tokenizer, image reconstruction using the DALL·E tokenizer.
  • A better tokenizer can produce more accurate tokens that better describe the original images, which as shown above.
  • A subset of inputs is randomly masked with a special token [MASK] given a sequence of image patches.
  • The model is required to predict the index of visual tokens with the output from masked positions. Instead of predicting the raw pixels, the masked image modeling task requires the model to predict the discrete visual tokens obtained by the image tokenizer.

1.3. Fine-Tuning

Illustration of applying DiT as the backbone network in different detection frameworks.
  • There are four Document AI benchmarks, including the RVL-CDIP dataset for document image classification, the PubLayNet dataset for document layout analysis, the ICDAR 2019 cTDaR dataset for table detection, and the FUNSD dataset for text detection.
  • These benchmark datasets can be formalized as two common tasks: image classification and object detection.
  • For image classification, average pooling is used to aggregate the representation of image patches. Next, the global representation is passed into a simple linear classifier.
  • For object detection, Mask R-CNN and Cascade R-CNN arreused as detection frameworks and ViT-based models are used as the backbone. Resolution-modifying modules at four different Transformer blocks are used to adapt the single-scale ViT to the multi-scale FPN, which as shown above.
An example of pre-processing with adaptive image binarization on the ICDAR 2019 cTDaR archival subset.
  • On ICDAR 2019 cTDaR archival subset, binarization using OpenCV is used.

2. Results


Document Image Classification accuracy (%) on RVLCDIP, where all the models use the pure image information (w/o text information) with the 224×224 resolution.

DiT-L, gets a comparable score with the previous SOTA ensemble model under the single-model setting, which further highlights its modeling capability on document images.

2.2. PubLayNet

Document Layout Analysis mAP @ IOU [0.50:0.95] on PubLayNet validation set. ResNeXt-101-32×8d is shortened as ResNeXt and Cascade as C.
  • DeiT-B, BEiT-B, and MAE-B are obviously better than ResNeXt-101, and DiT-B is even stronger than these powerful image Transformer baselines.
  • On the basis of DiT-B, DiT-L gives out a much higher mAP score.

Cascade R-CNN is applied on the ResNeXt-101–32×8d baseline, and DiT surpasses it by 1% and 1.4% absolute score for the base and large settings respectively, indicating the superiority of DiT on a different detection framework.

2.3. ICDAR 2019 cTDaR

Table detection accuracy (F1) on ICDAR 2019 cTDaR.
  • (a): DiT-L achieves the highest wF1 score among all Mask R-CNN methods.
  • (b): DiT surpasses all the baselines except BEiT for the archival subset.
  • Under all the three settings, the SOTA results have been pushed to a new level by more than 2% (94.23→96.55) absolute wF1 score.

2.4. FUNSD

Text detection accuracy (IoU@0.5) on FUNSD Task #1, where Mask R-CNN is used with different backbones (ResNeXt, DeiT, BEiT, MAE and DiT). “+syn” denotes that DiT is trained with a synthetic dataset including 1M document images, then fine-tuned with the FUNSD training data.
  • DiT models achieve new SOTA results compared with other models.
  • Finally, DiT models are further trained with a synthetic dataset that contains 1 million document images, leading to an F1 of 0.9429 being achieved by the DiT-L model.

DiT uses ViT, BEiT and DALL·E ideas for self-supervised pretraining.


[2022 ACM MM] [DiT]
DiT: Self-supervised Pre-training for Document Image Transformer

1.3. Pretraining or Weakly/Semi-Supervised Learning

2004-20212022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT]



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.