Brief Review — Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Improve Florence as Florence-2, FLD-5B Dataset is Constructed
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence-2 & FLD-5B, by Azure AI, Microsoft
2023 arXiv v1 (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa]
==== My Other Paper Readings Are Also Over Here ====
- Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation.
- This multi-task learning setup demands large-scale, high-quality annotated data. To this end, FLD-5B dataset is also developed that consists of 5.4 billion comprehensive visual annotations on 126 million images.
- Florence has been used in Azure Cognitive Service for Vision since March 2023.
Outline
- Florence-2 Framework
- FLD-5B Dataset
- Results
1. Florence-2 Framework
1.1. Task Formulation
Florence-2 uses a vision encoder to convert images into visual token embeddings, which are then concatenated with text embeddings and processed by a Transformer-based multi-modal encoder-decoder to generate the response.
Each task as a translation problem. Given an input image and a task-specific prompt, the corresponding output response is generate.
- For region-specific tasks, location tokens are added to the tokenizer’s vocabulary list, representing quantized coordinates. 1000 bins are used.
- Different region formats are used: Box representation (x0, y0, x1, y1), Quad box representation (x0, y0, …, x3, y3), and Polygon Representation (x0, y0, …, xn, yn).
1.2. Model
Vision encoder: DaViT [20] is used. Image is input, and being flattened visual token embeddings V.
Multi-Modality Encoder Decoder: The prompt text embeddings Tprompt, is first obtained. Then, vision token embeddings are concatenated with prompt embeddings to form the multi-modality encoder module input, X = [V′, Tprompt]. where V’ is obtained by linear projecting, layer-normalizing V.
1.3. Objective
- Given the input x combined from the image and the prompt, and the target y, the standard language modeling with cross-entropy loss is used:
2. FLD-5B Dataset
Given the scarcity of such data, a new multitask image dataset is developed. This dataset FLD-5B includes 126M images, 500M text annotations, and 1.3B text-region annotations, and 3.6B text-phrase- region annotations across different tasks.
2.1. Initial Annotation With Specialist Models
- To initiate the annotation process for each annotation type, synthetic labels are employed obtained from specialist models.
2.2. Data Filtering and Enhancement
- First, pertaining to textual annotations, inspired by DiHT [63] and a parsing tool based on SpaCy [28] is developed to extract objects, attributes, and actions. Texts containing excessive objects are filtered out.
- The complexity of the actions and objects is assessed by measuring their degree of node in the dependency parsing tree. Texts with a certain minimum action and object complexity are retained.
2.3. Iterative Data Refinement
- Using the filtered initial annotations, a multitask model is trained that processes sequences of data. The iteratively trained model is leveraged for pre-training purposes.
2.4. Annotation
- Text: Text annotations categorize images using three types of granularities: brief, detailed, and more detailed.
- Region-Text Pairs: Each region can be annotated with varying degrees of granularity, including phrases and sentences. Text regions are labeled using Azure AI Services’ OCR API [1], while visual objects are initially annotated with a DINO object detector.
- Text-Phrase-Region Triplets: consist of a descriptive text of the image, noun phrases in this text related to image objects, and region annotations for these objects.
FLD-5B has several advantages over the previous ones, such as having more annotations in total and per image. Moreover, the annotations in FLD-5B span multiple levels of spatial and semantic granularity, which allows for more diverse and comprehensive visual understanding tasks.
3. Results
3.1. Zero-Shot
- For image-level tasks, Florence-2-L achieves a 135.6 CIDEr score on the COCO caption benchmark.
- For region-level grounding and referring expression comprehension tasks, Florence-2-L establishes a new record in zero-shot performance.
- Additionally, the pretrained model attains a 35.8% mIOU in the Refcoco referring expression segmentation (RES) task, a capability not supported by prior foundation models.
3.2. Generalist Model with Public Supervised Data
Florence-2 models are fine-tuned by adding a collection of public datasets that cover image-level, region-level, pixel-level tasks, yielding one generalist model for various vision tasks.
Florence-2 demonstrates strong performance with standard multi-modality Transformer encoder-decoder without special designs, particularly for region-level and pixel-level tasks.
- Florence-2-L achieves competitive performance without the need for LLMs, maintaining a compact size, outperforming models with significantly more parameters, such as Flamingo (80B).
- Florence-2-L sets a new state-of-the-art performance with an accuracy of 81.5 without any external OCR token input.
3.3. Downstream Tasks Fine-tuning
- By adding the object detection head and segmentaion head (Mask R-CNN and UPerNet) for object detection and segmentation tasks respectively, Florence-2 is fine-tuned for downstream tasks.
Impressive performance of Florence-2 is shown above.