Brief Review — Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Improve Florence as Florence-2, FLD-5B Dataset is Constructed

Sik-Ho Tsang
5 min readMar 7, 2024
Florence-2 Capability

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence-2 & FLD-5B
, by Azure AI, Microsoft
2023 arXiv v1 (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa]
==== My Other Paper Readings Are Also Over Here ====

  • Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation.
  • This multi-task learning setup demands large-scale, high-quality annotated data. To this end, FLD-5B dataset is also developed that consists of 5.4 billion comprehensive visual annotations on 126 million images.
  • Florence has been used in Azure Cognitive Service for Vision since March 2023.


  1. Florence-2 Framework
  2. FLD-5B Dataset
  3. Results

1. Florence-2 Framework

Florence-2 Framework

1.1. Task Formulation

Florence-2 uses a vision encoder to convert images into visual token embeddings, which are then concatenated with text embeddings and processed by a Transformer-based multi-modal encoder-decoder to generate the response.

Each task as a translation problem. Given an input image and a task-specific prompt, the corresponding output response is generate.

  • For region-specific tasks, location tokens are added to the tokenizer’s vocabulary list, representing quantized coordinates. 1000 bins are used.
  • Different region formats are used: Box representation (x0, y0, x1, y1), Quad box representation (x0, y0, …, x3, y3), and Polygon Representation (x0, y0, …, xn, yn).

1.2. Model

Vision encoder: DaViT [20] is used. Image is input, and being flattened visual token embeddings V.

Multi-Modality Encoder Decoder: The prompt text embeddings Tprompt, is first obtained. Then, vision token embeddings are concatenated with prompt embeddings to form the multi-modality encoder module input, X = [V′, Tprompt]. where V’ is obtained by linear projecting, layer-normalizing V.

1.3. Objective

  • Given the input x combined from the image and the prompt, and the target y, the standard language modeling with cross-entropy loss is used:

2. FLD-5B Dataset

Florence-2 Data Engine

Given the scarcity of such data, a new multitask image dataset is developed. This dataset FLD-5B includes 126M images, 500M text annotations, and 1.3B text-region annotations, and 3.6B text-phrase- region annotations across different tasks.

2.1. Initial Annotation With Specialist Models

  • To initiate the annotation process for each annotation type, synthetic labels are employed obtained from specialist models.

2.2. Data Filtering and Enhancement

  • First, pertaining to textual annotations, inspired by DiHT [63] and a parsing tool based on SpaCy [28] is developed to extract objects, attributes, and actions. Texts containing excessive objects are filtered out.
  • The complexity of the actions and objects is assessed by measuring their degree of node in the dependency parsing tree. Texts with a certain minimum action and object complexity are retained.

2.3. Iterative Data Refinement

  • Using the filtered initial annotations, a multitask model is trained that processes sequences of data. The iteratively trained model is leveraged for pre-training purposes.

2.4. Annotation

Example in FLD-5B
  • Text: Text annotations categorize images using three types of granularities: brief, detailed, and more detailed.
  • Region-Text Pairs: Each region can be annotated with varying degrees of granularity, including phrases and sentences. Text regions are labeled using Azure AI Services’ OCR API [1], while visual objects are initially annotated with a DINO object detector.
  • Text-Phrase-Region Triplets: consist of a descriptive text of the image, noun phrases in this text related to image objects, and region annotations for these objects.
Dataset Comparisons

FLD-5B has several advantages over the previous ones, such as having more annotations in total and per image. Moreover, the annotations in FLD-5B span multiple levels of spatial and semantic granularity, which allows for more diverse and comprehensive visual understanding tasks.

3. Results

3.1. Zero-Shot

Zero-Shot Performance
  • For image-level tasks, Florence-2-L achieves a 135.6 CIDEr score on the COCO caption benchmark.
  • For region-level grounding and referring expression comprehension tasks, Florence-2-L establishes a new record in zero-shot performance.
  • Additionally, the pretrained model attains a 35.8% mIOU in the Refcoco referring expression segmentation (RES) task, a capability not supported by prior foundation models.

3.2. Generalist Model with Public Supervised Data

Florence-2 models are fine-tuned by adding a collection of public datasets that cover image-level, region-level, pixel-level tasks, yielding one generalist model for various vision tasks.

Florence-2 demonstrates strong performance with standard multi-modality Transformer encoder-decoder without special designs, particularly for region-level and pixel-level tasks.

  • Florence-2-L achieves competitive performance without the need for LLMs, maintaining a compact size, outperforming models with significantly more parameters, such as Flamingo (80B).
  • Florence-2-L sets a new state-of-the-art performance with an accuracy of 81.5 without any external OCR token input.

3.3. Downstream Tasks Fine-tuning

  • By adding the object detection head and segmentaion head (Mask R-CNN and UPerNet) for object detection and segmentation tasks respectively, Florence-2 is fine-tuned for downstream tasks.

Impressive performance of Florence-2 is shown above.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.