Brief Review — Florence: A New Foundation Model for Computer Vision

Florence, Pretrained Using Image Captioning Dataset

5 min readFeb 6, 2024

**Florence for Multimodal Computer Vision Tasks**

Florence: A New Foundation Model for Computer Vision
Florence, by Microsoft Cloud and AI; Microsoft Research Redmond
2021 arXiv v1, Over 560 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====

Florence, a computer vision foundation model, is proposed by incorporating universal visual-language representations from Web-scale image-text data.
Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
After that, Florence-2 is proposed as well.
Florence has been used in Azure Cognitive Service for Vision since March 2023.

Outline

Florence Pretraining
Adaptation to Other Tasks
Results

1. Florence

1.1. Pretraining Dataset

A 900 million image-text-pair dataset, called FLD-900M (FLD stands for FLorenceDataset), is constructed, with 900M free-form texts (ranging from one word, phase to sentences), 9.7M unique queries, and 7.5B tokens in total.

1.2. Pretraining

A unified image-text contrastive learning (UniCL) (Yang et al., 2022) is utilized for contrastive pretraining in an image-label-description space.
Given an image-text pair, we generate a triplet (x, t, y) via a text hash-table, where x is the image, t is the language description (i.e. hash value), and y is the language label (i.e. hash key).
Thus, all image-text pairs mapped to the same label y are regarded as positive. Others are negative.
fθ and fφ are the image encoder and text encoder, respectively.
u and v are the normalized visual feature vector and language feature vector, respectively.

Given a mini-batch B, a bi-directional supervised contrastive learning objective between images and language descriptions is used to train the model:

This objective contains two contrastive terms: the supervised image-to-language contrastive loss:

and the supervised language-to-image contrastive loss:

To mitigate the negative effect from augmented prompts, the training is separated into two stages. In the first stage, all data including augmented texts are used for training; while in the second stage, all augmented data are used for continuing training.

1.3. Model

Florence pretrained model uses a two-tower architecture: a 12-layer Transformer as language encoder, similar to CLIP, and a hierarchical Vision Transformer, particularly CoSwin Transformer, as the image encoder, which uses the convolutional embedding layers as described in CvT.
Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features.
The model size is 893M, including the language Transformer with 256M parameters and the CoSwin-H transformer with 637M parameters. The model takes 10 days to train on 512 NVIDIA-A100 GPUs with 40GB memory per GPU.

1.4. Scalable Training Infrastructure

As the model is large, Zero Redundancy Optimizer (ZeRO), Activation Checkpointing, Mixed-precision Training, and Gradient Cache are used.

2. Adaptation to Other Tasks

2.1. Object Detection

An adaptor Dynamic Head (Dai et al., 2021a) (or Dynamic DETR (Dai et al., 2021b)) is added to extend Florence to learn fine-grained (i.e. , object-level) representation. Dynamic Head is trained with the one-stage ATSS framework and losses.
A large-scale object detection dataset, called FLOD-9M (for FLorence Object detection Dataset), is curated for object detection pre-training.

2.2. Fine-Grained V+L Representation Learning

METER (Dou et al., 2021) adapter is used to expand to fine-grained vision-language representation.
The two modalities are fused together to learn the contextual representation.
The model is first trained with the image-text matching loss and the masked-language modeling loss, then fine-tune the model on the downstream task, such as VQA.

2.3. Fine-Grained V+L Representation Learning

CoSwin replaces the tokenization layer of CoSwin (in Section 2.3) from 2D convolutional layers to 3D convolutional layers, which converts each 3D tube into one token.
3D convolution-based patch merging operator is also used.
2D shifted window design is replaced with 3D shifted local windows in self-attention layers, and so on.

3. Results

3.1. Image Classification

Florence outperforms on 9/12 tasks compared with state-of-the-art methods. We achieved a remarkable improvement in the zero-shot transfer on ImageNet-1K — the top-1 accuracy of 83.74% (+5.6% over SOTA result), and the top-5 accuracy of 97.18%.

3.2. Image-Text Retrieval

Zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets.

3.3. Object Detection

Florence outperforms in 7/11 tasks over 5-shot fine tuning, and outperforms full-set fine-tuning on the “Packages” dataset, consisting of only 26 images for training.

3.4. V+L Representation Learning

Compared with SimVLM, which uses 1.8B image-text pairs, Florence only uses 900M data to pre-train the image encoder and 20M for VLP, but achieve better results.