Review — LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Light-Weight, Plug-and-Play, Fine-Tune LLaMA-7B

Sik-Ho Tsang
5 min readSep 16, 2023

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter, by Shanghai Artificial Intelligence Laboratory; CUHK MMLab; University of California, Los Angeles
2024 ICLR, Over 320 Citations (Sik-Ho Tsang @ Medium)

LM Tuning / Prompting
2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] [FLAN] [UL2R, U-PaLM] [Flan-PaLM] [Tk-INSTRUCT] 2023 [LIMA] [SELF-INTRUCT]
==== My Other Paper Readings Are Also Over Here ====

  • LLaMA-Adapter is proposed, which is a lightweight adaption method to efficiently fine-tune LLaMA into an instruction following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs.
  • A set of learnable adaption prompts is adopted, and prepended to the input text tokens at higher Transformer layers.
  • Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA.
  • CLIP is can be added as visual encoder to support multi-modal inputs.


  1. Single-Modal LLaMA-Adapter
  2. Multi-Modal LLaMA-Adapter
  3. Results

1. Single-Modal LLaMA-Adapter


1.1. Learnable Adaption Prompts

  • Given 52K instruction-to-output data and a pretrained LLaMA with an N-layer Transformer, a set of learnable adaption prompts is adopted for instruction-following fine-tuning.
  • The prompts for L Transformer layers are denoted as {Pl}, l=1, …, L.
  • Pl has the size of K×C, with K denoting the prompt length for each layer, and C equaling the feature dimension of LLaMA’s Transformer.
  • The prompts are only inserted into the topmost L layers.

The M-length word tokens are denoted as Tl. Then, the adaption prompt Pl is concatenated with Tl along the token dimension as prefix, formulated as

In this way, the instruction knowledge learned within Pl can effectively guide Tl to generate contextual responses.

1.2. Zero-init Attention

  • In the attention mechanism, several linear projection layers are first applied to transform the input tokens into queries, keys, and values:
  • Then, the attention scores before the softmax function are:
  • Meanwhile, Sl can be reformulated by two components as:

where the left term is attention scores of K adaption prompts and the right term is M + 1 word tokens, respectively. However, directly insertion probably causes disturbance in the early training stage.

The gating gl is initialized by zero, gl can firstly eliminate the influence of under-fitted prompts.

  • Finally, the output of the attention layer with a linear projection layer:

2. Multi-Modal LLaMA-Adapter

Multi-Modal LLaMA-Adapter

A pre-trained visual encoder, CLIP, is used to extract its multi-scale global features, denoted as {Im} where m = 1, …, M, where M denotes the scale number.

  • Then, the M-scale features are concatenated along the channel dimension and a learnable projection network is applied on top, formulated as:
  • where Ip is the overall image token.
  • After this, Ip is repeated for K times, and element-wisely added onto the K-length adaption prompts at all L inserted Transformer layers.

In this way, LLaMA is fine-tuned to generate responses conditioned on vision-language inputs, and can tackle more challenging generative tasks with multi-modal understanding, such as the ScienceQA benchmark

3. Results

3.1. Qualitative Results

Single-Modal LLaMa-Adpater Examples
Multi-Modal LLaMa-Adpater Examples
  • Each example normally contains a visual context, a textual context, a question, multiple options, and an answer.
  • 7B-parameter, N=32, L=30, K=10.
  • LLaMa-Adapter is trained on 8 A100 GPUs for 5 epochs.

By only fine-tuning 1.2M parameters, the proposed approach generates reasonable responses comparable to the fully fine-tuned Alpaca and the large-scale GPT-3. This fully demonstrates the effectiveness of the proposed adapters with zero-init attention.

3.2. SOTA Comparisons

The proposed single-modal variant (‘LLaMA-AdapterT ’) attains 78.31% accuracy with 1.2M parameters. By further injecting visual conditions with a 0.6M projection network, the proposed multi-modal variant (‘LLaMA-Adapter’) is boosted by +6.88% answering accuracy.

GPT-3 contains 175B total parameters, much larger than the proposed 7B LLaMA with 1.2M adapters. Also, as a language model, GPT-3 can not leverage any additional visual information. In contrast, LLaMA-Adapter can be easily switched into multi-modal variant and achieves +10% higher accuracy.

Parameters, Storage, Training Time

As a lightweight plug-and-play module, LLaMA-Adapter enjoys superior training efficiency with only 1.2M parameters, 4.9M storage, and one-hour training.

3.3. Ablation Studies

Ablation Studies
  • Table 4: Increasing the layer numbers introduce more learnable parameters, but leads to a significant improvement on the accuracy of validation set.
  • Table 5: Zero-init attention contributes to a significant +43.27% performance gain on SciceneQA’s validation set.
  • Figure 5: The ‘zero-init attention’ converges faster and reaches lower loss bounds than ‘rand-init attention’.
  • Table 6: LLaMA-Adapter is relatively robust to the over-fitting issue.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.