Brief Review — Florence: A New Foundation Model for Computer Vision
Florence, Pretrained Using Image Captioning Dataset
Florence: A New Foundation Model for Computer Vision
Florence, by Microsoft Cloud and AI; Microsoft Research Redmond
2021 arXiv v1, Over 560 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====
- Florence, a computer vision foundation model, is proposed by incorporating universal visual-language representations from Web-scale image-text data.
- Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
- After that, Florence-2 is proposed as well.
- Florence has been used in Azure Cognitive Service for Vision since March 2023.
Outline
- Florence Pretraining
- Adaptation to Other Tasks
- Results
1. Florence
1.1. Pretraining Dataset
- A 900 million image-text-pair dataset, called FLD-900M (FLD stands for FLorenceDataset), is constructed, with 900M free-form texts (ranging from one word, phase to sentences), 9.7M unique queries, and 7.5B tokens in total.
1.2. Pretraining
- A unified image-text contrastive learning (UniCL) (Yang et al., 2022) is utilized for contrastive pretraining in an image-label-description space.
- Given an image-text pair, we generate a triplet (x, t, y) via a text hash-table, where x is the image, t is the language description (i.e. hash value), and y is the language label (i.e. hash key).
- Thus, all image-text pairs mapped to the same label y are regarded as positive. Others are negative.
- fθ and fφ are the image encoder and text encoder, respectively.
- u and v are the normalized visual feature vector and language feature vector, respectively.
Given a mini-batch B, a bi-directional supervised contrastive learning objective between images and language descriptions is used to train the model:
- This objective contains two contrastive terms: the supervised image-to-language contrastive loss:
- and the supervised language-to-image contrastive loss:
- To mitigate the negative effect from augmented prompts, the training is separated into two stages. In the first stage, all data including augmented texts are used for training; while in the second stage, all augmented data are used for continuing training.
1.3. Model
- Florence pretrained model uses a two-tower architecture: a 12-layer Transformer as language encoder, similar to CLIP, and a hierarchical Vision Transformer, particularly CoSwin Transformer, as the image encoder, which uses the convolutional embedding layers as described in CvT.
- Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features.
- The model size is 893M, including the language Transformer with 256M parameters and the CoSwin-H transformer with 637M parameters. The model takes 10 days to train on 512 NVIDIA-A100 GPUs with 40GB memory per GPU.
1.4. Scalable Training Infrastructure
- As the model is large, Zero Redundancy Optimizer (ZeRO), Activation Checkpointing, Mixed-precision Training, and Gradient Cache are used.
2. Adaptation to Other Tasks
2.1. Object Detection
- An adaptor Dynamic Head (Dai et al., 2021a) (or Dynamic DETR (Dai et al., 2021b)) is added to extend Florence to learn fine-grained (i.e. , object-level) representation. Dynamic Head is trained with the one-stage ATSS framework and losses.
- A large-scale object detection dataset, called FLOD-9M (for FLorence Object detection Dataset), is curated for object detection pre-training.
2.2. Fine-Grained V+L Representation Learning
- METER (Dou et al., 2021) adapter is used to expand to fine-grained vision-language representation.
- The two modalities are fused together to learn the contextual representation.
- The model is first trained with the image-text matching loss and the masked-language modeling loss, then fine-tune the model on the downstream task, such as VQA.
2.3. Fine-Grained V+L Representation Learning
- CoSwin replaces the tokenization layer of CoSwin (in Section 2.3) from 2D convolutional layers to 3D convolutional layers, which converts each 3D tube into one token.
- 3D convolution-based patch merging operator is also used.
- 2D shifted window design is replaced with 3D shifted local windows in self-attention layers, and so on.
3. Results
3.1. Image Classification
Florence outperforms on 9/12 tasks compared with state-of-the-art methods. We achieved a remarkable improvement in the zero-shot transfer on ImageNet-1K — the top-1 accuracy of 83.74% (+5.6% over SOTA result), and the top-5 accuracy of 97.18%.
3.2. Image-Text Retrieval
Zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets.
3.3. Object Detection
Florence outperforms in 7/11 tasks over 5-shot fine tuning, and outperforms full-set fine-tuning on the “Packages” dataset, consisting of only 26 images for training.
3.4. V+L Representation Learning
Compared with SimVLM, which uses 1.8B image-text pairs, Florence only uses 900M data to pre-train the image encoder and 20M for VLP, but achieve better results.
Florence outperform all the state-of-the-art methods by a large margin in terms of the R@1 metric.
3.5. Video Action Recognition
Florence results are better than the state-of-the-art by 1.1% and 1.5% on Kinectics-400 and Kinectics-600, respectively.