Brief Review — Florence: A New Foundation Model for Computer Vision

Florence, Pretrained Using Image Captioning Dataset

Sik-Ho Tsang
5 min readFeb 6, 2024
Florence for Multimodal Computer Vision Tasks

Florence: A New Foundation Model for Computer Vision
, by Microsoft Cloud and AI; Microsoft Research Redmond
2021 arXiv v1, Over 560 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====

  • Florence, a computer vision foundation model, is proposed by incorporating universal visual-language representations from Web-scale image-text data.
  • Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
  • After that, Florence-2 is proposed as well.
  • Florence has been used in Azure Cognitive Service for Vision since March 2023.


  1. Florence Pretraining
  2. Adaptation to Other Tasks
  3. Results

1. Florence


1.1. Pretraining Dataset

  • A 900 million image-text-pair dataset, called FLD-900M (FLD stands for FLorenceDataset), is constructed, with 900M free-form texts (ranging from one word, phase to sentences), 9.7M unique queries, and 7.5B tokens in total.

1.2. Pretraining

  • A unified image-text contrastive learning (UniCL) (Yang et al., 2022) is utilized for contrastive pretraining in an image-label-description space.
  • Given an image-text pair, we generate a triplet (x, t, y) via a text hash-table, where x is the image, t is the language description (i.e. hash value), and y is the language label (i.e. hash key).
  • Thus, all image-text pairs mapped to the same label y are regarded as positive. Others are negative.
  • and are the image encoder and text encoder, respectively.
  • u and v are the normalized visual feature vector and language feature vector, respectively.

Given a mini-batch B, a bi-directional supervised contrastive learning objective between images and language descriptions is used to train the model:

  • This objective contains two contrastive terms: the supervised image-to-language contrastive loss:
  • and the supervised language-to-image contrastive loss:
  • To mitigate the negative effect from augmented prompts, the training is separated into two stages. In the first stage, all data including augmented texts are used for training; while in the second stage, all augmented data are used for continuing training.

1.3. Model

  • Florence pretrained model uses a two-tower architecture: a 12-layer Transformer as language encoder, similar to CLIP, and a hierarchical Vision Transformer, particularly CoSwin Transformer, as the image encoder, which uses the convolutional embedding layers as described in CvT.
  • Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features.
  • The model size is 893M, including the language Transformer with 256M parameters and the CoSwin-H transformer with 637M parameters. The model takes 10 days to train on 512 NVIDIA-A100 GPUs with 40GB memory per GPU.

1.4. Scalable Training Infrastructure

  • As the model is large, Zero Redundancy Optimizer (ZeRO), Activation Checkpointing, Mixed-precision Training, and Gradient Cache are used.

2. Adaptation to Other Tasks

Adaptation to Other Tasks

2.1. Object Detection

  • An adaptor Dynamic Head (Dai et al., 2021a) (or Dynamic DETR (Dai et al., 2021b)) is added to extend Florence to learn fine-grained (i.e. , object-level) representation. Dynamic Head is trained with the one-stage ATSS framework and losses.
  • A large-scale object detection dataset, called FLOD-9M (for FLorence Object detection Dataset), is curated for object detection pre-training.

2.2. Fine-Grained V+L Representation Learning

  • METER (Dou et al., 2021) adapter is used to expand to fine-grained vision-language representation.
  • The two modalities are fused together to learn the contextual representation.
  • The model is first trained with the image-text matching loss and the masked-language modeling loss, then fine-tune the model on the downstream task, such as VQA.

2.3. Fine-Grained V+L Representation Learning

  • CoSwin replaces the tokenization layer of CoSwin (in Section 2.3) from 2D convolutional layers to 3D convolutional layers, which converts each 3D tube into one token.
  • 3D convolution-based patch merging operator is also used.
  • 2D shifted window design is replaced with 3D shifted local windows in self-attention layers, and so on.

3. Results

3.1. Image Classification

Image Classification

Florence outperforms on 9/12 tasks compared with state-of-the-art methods. We achieved a remarkable improvement in the zero-shot transfer on ImageNet-1K — the top-1 accuracy of 83.74% (+5.6% over SOTA result), and the top-5 accuracy of 97.18%.

3.2. Image-Text Retrieval

Image-Text Retrieval

Zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets.

3.3. Object Detection

Object Detection

Florence outperforms in 7/11 tasks over 5-shot fine tuning, and outperforms full-set fine-tuning on the “Packages” dataset, consisting of only 26 images for training.

3.4. V+L Representation Learning


Compared with SimVLM, which uses 1.8B image-text pairs, Florence only uses 900M data to pre-train the image encoder and 20M for VLP, but achieve better results.

Text-to-Video Retrieval

Florence outperform all the state-of-the-art methods by a large margin in terms of the R@1 metric.

3.5. Video Action Recognition

Video Action Recognition

Florence results are better than the state-of-the-art by 1.1% and 1.5% on Kinectics-400 and Kinectics-600, respectively.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.