Review — DiT: Self-supervised Pre-training for Document Image Transformer
DiT: Self-supervised Pre-training for Document Image Transformer,
DiT, by Shanghai Jiao Tong University, Microsoft Research Asia, and Microsoft Azure AI
2022 ACM MM, Over 15 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Object Detection
==== My Other Paper Readings Are Also Over Here ====
- DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR.
Outline
- Document Image Transformer (DiT)
- Results
1. Document Image Transformer (DiT)
1.1. Model Architecture
- Following ViT, the vanilla Transformer architecture is used as the backbone of DiT.
- A document image is divided into non-overlapping patches and obtain a sequence of patch embeddings.
- After adding the 1D position embedding, these image patches are passed into a stack of Transformer blocks with multi-head attention.
- Finally, the output of the Transformer encoder is taken as the representation of image patches.
- DiT-B model has the same architecture as the ViT base: a 12-layer Transformer with 768 hidden sizes, and 12 attention heads. The intermediate size of feed-forward networks is 3,072.
- A larger version, DiT-L, is also trained with 24 layers, 1,024 hidden sizes, and 16 attention heads. The intermediate size of feed-forward networks is 4,096.
1.2. Pretraining
- Inspired by BEiT, we use Masked Image Modeling (MIM) is used.
- During pre-training, DiT accepts the image patches as input and predicts the visual tokens with the output representation.
- BEiT uses the discrete variational auto-encoder (dVAE) from DALL·E as the image tokenizer, which is trained on a large data collection including 400 million images. However, there exists a domain mismatch between natural images and document images, which makes the DALL·E tokenizer not appropriate for the document images.
- Therefore, to get better discrete visual tokens for the document image domain, the dVAE is trained on the IIT-CDIP dataset that includes 42 million document images.
- The new dVAE tokenizer is trained with a combination of a MSE loss to reconstruct the input image, and a perplexity loss to increase the use of the quantized codebook representations.
- A better tokenizer can produce more accurate tokens that better describe the original images, which as shown above.
- A subset of inputs is randomly masked with a special token [MASK] given a sequence of image patches.
- The model is required to predict the index of visual tokens with the output from masked positions. Instead of predicting the raw pixels, the masked image modeling task requires the model to predict the discrete visual tokens obtained by the image tokenizer.
1.3. Fine-Tuning
- There are four Document AI benchmarks, including the RVL-CDIP dataset for document image classification, the PubLayNet dataset for document layout analysis, the ICDAR 2019 cTDaR dataset for table detection, and the FUNSD dataset for text detection.
- These benchmark datasets can be formalized as two common tasks: image classification and object detection.
- For image classification, average pooling is used to aggregate the representation of image patches. Next, the global representation is passed into a simple linear classifier.
- For object detection, Mask R-CNN and Cascade R-CNN arreused as detection frameworks and ViT-based models are used as the backbone. Resolution-modifying modules at four different Transformer blocks are used to adapt the single-scale ViT to the multi-scale FPN, which as shown above.
- On ICDAR 2019 cTDaR archival subset, binarization using OpenCV is used.
2. Results
2.1. RVL-CDIP
DiT-L, gets a comparable score with the previous SOTA ensemble model under the single-model setting, which further highlights its modeling capability on document images.
2.2. PubLayNet
- DeiT-B, BEiT-B, and MAE-B are obviously better than ResNeXt-101, and DiT-B is even stronger than these powerful image Transformer baselines.
- On the basis of DiT-B, DiT-L gives out a much higher mAP score.
Cascade R-CNN is applied on the ResNeXt-101–32×8d baseline, and DiT surpasses it by 1% and 1.4% absolute score for the base and large settings respectively, indicating the superiority of DiT on a different detection framework.
2.3. ICDAR 2019 cTDaR
- (a): DiT-L achieves the highest wF1 score among all Mask R-CNN methods.
- (b): DiT surpasses all the baselines except BEiT for the archival subset.
- Under all the three settings, the SOTA results have been pushed to a new level by more than 2% (94.23→96.55) absolute wF1 score.
2.4. FUNSD
- DiT models achieve new SOTA results compared with other models.
- Finally, DiT models are further trained with a synthetic dataset that contains 1 million document images, leading to an F1 of 0.9429 being achieved by the DiT-L model.
Reference
[2022 ACM MM] [DiT]
DiT: Self-supervised Pre-training for Document Image Transformer
1.3. Pretraining or Weakly/Semi-Supervised Learning
2004-2021 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT]