Brief Review — Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Improve Florence as Florence-2, FLD-5B Dataset is Constructed

5 min readMar 7, 2024

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence-2 & FLD-5B, by Azure AI, Microsoft
2023 arXiv v1 (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa]
==== My Other Paper Readings Are Also Over Here ====

Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation.
This multi-task learning setup demands large-scale, high-quality annotated data. To this end, FLD-5B dataset is also developed that consists of 5.4 billion comprehensive visual annotations on 126 million images.
Florence has been used in Azure Cognitive Service for Vision since March 2023.

Outline

Florence-2 Framework
FLD-5B Dataset
Results

1. Florence-2 Framework

1.1. Task Formulation

Florence-2 uses a vision encoder to convert images into visual token embeddings, which are then concatenated with text embeddings and processed by a Transformer-based multi-modal encoder-decoder to generate the response.
Each task as a translation problem. Given an input image and a task-specific prompt, the corresponding output response is generate.

For region-specific tasks, location tokens are added to the tokenizer’s vocabulary list, representing quantized coordinates. 1000 bins are used.
Different region formats are used: Box representation (x0, y0, x1, y1), Quad box representation (x0, y0, …, x3, y3), and Polygon Representation (x0, y0, …, xn, yn).

1.2. Model

Vision encoder: DaViT [20] is used. Image is input, and being flattened visual token embeddings V.
Multi-Modality Encoder Decoder: The prompt text embeddings Tprompt, is first obtained. Then, vision token embeddings are concatenated with prompt embeddings to form the multi-modality encoder module input, X = [V′, Tprompt]. where V’ is obtained by linear projecting, layer-normalizing V.

1.3. Objective

Given the input x combined from the image and the prompt, and the target y, the standard language modeling with cross-entropy loss is used:

2. FLD-5B Dataset

Given the scarcity of such data, a new multitask image dataset is developed. This dataset FLD-5B includes 126M images, 500M text annotations, and 1.3B text-region annotations, and 3.6B text-phrase- region annotations across different tasks.

2.1. Initial Annotation With Specialist Models

To initiate the annotation process for each annotation type, synthetic labels are employed obtained from specialist models.

2.2. Data Filtering and Enhancement

First, pertaining to textual annotations, inspired by DiHT [63] and a parsing tool based on SpaCy [28] is developed to extract objects, attributes, and actions. Texts containing excessive objects are filtered out.
The complexity of the actions and objects is assessed by measuring their degree of node in the dependency parsing tree. Texts with a certain minimum action and object complexity are retained.

2.3. Iterative Data Refinement

Using the filtered initial annotations, a multitask model is trained that processes sequences of data. The iteratively trained model is leveraged for pre-training purposes.

2.4. Annotation

Text: Text annotations categorize images using three types of granularities: brief, detailed, and more detailed.
Region-Text Pairs: Each region can be annotated with varying degrees of granularity, including phrases and sentences. Text regions are labeled using Azure AI Services’ OCR API [1], while visual objects are initially annotated with a DINO object detector.
Text-Phrase-Region Triplets: consist of a descriptive text of the image, noun phrases in this text related to image objects, and region annotations for these objects.

FLD-5B has several advantages over the previous ones, such as having more annotations in total and per image. Moreover, the annotations in FLD-5B span multiple levels of spatial and semantic granularity, which allows for more diverse and comprehensive visual understanding tasks.

3. Results

3.1. Zero-Shot

For image-level tasks, Florence-2-L achieves a 135.6 CIDEr score on the COCO caption benchmark.
For region-level grounding and referring expression comprehension tasks, Florence-2-L establishes a new record in zero-shot performance.
Additionally, the pretrained model attains a 35.8% mIOU in the Refcoco referring expression segmentation (RES) task, a capability not supported by prior foundation models.

3.2. Generalist Model with Public Supervised Data

Florence-2 models are fine-tuned by adding a collection of public datasets that cover image-level, region-level, pixel-level tasks, yielding one generalist model for various vision tasks.
Florence-2 demonstrates strong performance with standard multi-modality Transformer encoder-decoder without special designs, particularly for region-level and pixel-level tasks.

Florence-2-L achieves competitive performance without the need for LLMs, maintaining a compact size, outperforming models with significantly more parameters, such as Flamingo (80B).
Florence-2-L sets a new state-of-the-art performance with an accuracy of 81.5 without any external OCR token input.

3.3. Downstream Tasks Fine-tuning

By adding the object detection head and segmentaion head (Mask R-CNN and UPerNet) for object detection and segmentation tasks respectively, Florence-2 is fine-tuned for downstream tasks.

Impressive performance of Florence-2 is shown above.

Brief Review — Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Improve Florence as Florence-2, FLD-5B Dataset is Constructed

Outline

1. Florence-2 Framework

1.1. Task Formulation

1.2. Model

1.3. Objective

2. FLD-5B Dataset

2.2. Data Filtering and Enhancement

2.3. Iterative Data Refinement

2.4. Annotation

3. Results

3.1. Zero-Shot

3.2. Generalist Model with Public Supervised Data

3.3. Downstream Tasks Fine-tuning

Written by Sik-Ho Tsang

No responses yet