Brief Review — MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4, Using Open-Sourced LLM and Vision Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4, by King Abdullah University of Science and Technology
2024 ICLR, Over 2000 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI] [PaLI-X] [OpenCLIP]
==== My Other Paper Readings Are Also Over Here ====
- MiniGPT-4 aims to align visual information from pretrained vision encoder with an advanced large language model (LLM), which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna (A fine-tuned LLaMA), using one projection layer.
- The work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4.
Outline
- MiniGPT-4
- Results
1. MiniGPT-4
1.1. Model Architecture
- The language decoder, Vicuna, which is constructed upon LLaMA is used.
- For visual perception, the same visual encoder as used in BLIP-2 (Li et al., 2023c) is used, which is a ViT backbone coupled with their pre-trained Q-Former.
- Both language and vision models are open-sourced.
The target is to bridge the gap between the visual encoder and LLM using a linear projection layer.
1.2. First Pretraining Stage
The model uses a large collection of aligned image-text pairs to gain vision-language knowledge.
- The output from the projection layer serves as a soft prompt for the LLM.
- Throughout pretraining, the pretrained vision encoder and LLM remain frozen, with only the linear projection layer undergoing training.
Conceptual Captions, SBU, LAION are used as datasets, undergoes 20,000 training steps with a batch size of 256, covering about 5 million image-text pairs, and completes in around 10 hours on 4 A100 (80GB) GPUs.
Yet, it sometimes generates incoherent outputs like repetitive words or sentences, fragmented phrases, or irrelevant content, which impairs its capacity for fluent visual conversation with humans.
1.3. Curating a High-Quality Alignment Dataset
The model derived from the first pretraining stage is used to generate comprehensive descriptions of input images. A prompt is designed. In this prompt, <ImageFeature> represents the visual features produced by the linear projection layer.
###Human:
<Img><ImageFeature></Img>
Describe this image in detail.
Give as many details as possible. Say everything you see.
###Assistant:
- To identify incomplete sentences, authors examine whether the generated sentence exceeds 80 tokens. If it does not, an additional prompt is incorporated, ###Human: Continue ###Assistant: , prompting MiniGPT-4 to extend the generation process.
- By concatenating the outputs from both steps, a more comprehensive image description can be created.
- 5,000 images are randomly from the Conceptual Captions dataset and fed into the pretrained model to generate corresponding language descriptions for each image.
- ChatGPT is also utilized with a specific prompt to improve the descriptions, like handling repetitive words or sentences, fragmented sentences, and irrelevant content.
- Upon completing the post-processing stage, the correctness of each image description is manually verified to guarantee its high quality.
1.4. Second Pretraining Stage
Predefined prompts are used for finetuning model with the curated high-quality image-text pairs:
###Human:
<Img><ImageFeature></Img>
<Instruction>
###Assistant:
- In this prompt, <Instruction> represents a randomly sampled instruction from the predefined instruction set containing variant forms of instructions such as “Describe this image in detail” or “Could you describe the contents of this image for me”.
It is observed that this fine-tuning process is remarkably efficient, only requiring a mere 400 training steps with a batch size of 12, which takes around 7 minutes with a single A100 GPU.
2. Results
2.1. Qualitative Results
Fig.2 shows MiniGPT-4’s ability to identify multiple elements in an image, like busy streets, clock towers, shops, streetlights, and restaurants, whereas BLIP-2 only notes streets, people, and motorcycles.
- MiniGPT-4 has many other capabilities, including creating ads from images (Fig.3), extracting facts from movie photos (Fig.8), generating recipes from food images (Fig.11), diagnosing and suggesting treatments for plant diseases (Fig.12), designing websites from hand-written drafts (Fig.4b), and writing poems inspired by images (Fig.10).
- (Please see the appendix of the paper.)
2.2. Quantitative Results
- Human evaluators assessed the model’s responses.
- MiniGPT-4 outperformed BLIP-2 (Li et al., 2023c), especially in recipe, advertisement, and poem tasks, successfully handling 80% of these.
- It also interpreted humor in memes correctly in 8 out of 25 cases, which is a challenging aspect for BLIP-2.
- Table 1, Image Captioning: MiniGPT-4 averaged 2.22 ground truth captions, better than BLIP-2’s 1.96, proving its captions to be more informative.
- Table 2, Video Understanding: MiniGPT-4 is finetuned on 1.2k videos from the VideoInstruct100K (Maaz et al., 2023), using 50 frames and subtitles per video. MiniGPT-4 outperformed the strongest baseline Video-ChatGPT (Maaz et al., 2023) in correctness, detail, context, and time comprehension.
- Table 3 show a significant drop in failures post-finetuning, with less than 2 failures in 100 images for each task, indicating a notable improvement in output quality.
- Table 5 indicates a single projection layer is sufficient to align the vision encoder and the large language model in the proposed limited training data setting.
- Table 6: MiniGPT-4 (long): Please describe this image as detailed as possible. MiniGPT-4 (short): Please describe the image shortly and precisely, in less than 20 words. Longer captions tend to have higher hallucination rates.