Brief Review — QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA, LoRA With Quantization. A Model Family, Guanaco, Is Proposed.

Sik-Ho Tsang
5 min readApr 23, 2024
Guanaco, a camelid native to South America, closely related to the llama. (Free Image from Unsplash: Chris Stenger)

QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA, Guanaco
, by University of Washington
2023 NeurIPS, Over 650 Citations (Sik-Ho Tsang @ Medium)

LM Tuning / Prompting
2020 …
2023 [LIMA] [SELF-INTRUCT] [Self-Consistency] [Med-PaLM 2] 2024 [LLaMA-Adapter]
==== My Other Paper Readings Are Also Over Here ====

  • QLoRA invents (1) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights. (2) Double Quantization is utilized to reduce the average memory footprint by quantizing the quantization constants, and (3) Paged Optimizers are proposed to manage memory spikes.
  • By using QLoRA, a model family, namely Guanaco, is constructed.

Outline

  1. Background
  2. QLoRA Finetuning
  3. QLoRA & Guanaco Results

1. Background

1.1. Block-wise k-bit Quantization

  • The input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor.
  • For example, quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [−127, 127]:
  • where c is the quantization constant or quantization scale.
  • Dequantization is the inverse:

1.2. Low-rank Adapter (LoRA) Finetuning

LoRA: Only train A and B. (Figure from LoRA)
  • With the full model parameters W which remain fixed, LoRA augments a linear projection through an additional factorized projection. Given a projection XW = Y:

2. QLoRA Finetuning

Full Finetuning, LoRA & QLoRA
  • QLoRA further reduces the memory requirement, achieves high-fidelity 4-bit finetuning via two techniques 4-bit NormalFloat (NF4) quantization and Double Quantization. Additionally, Paged Optimizers are introduced, to prevent memory spikes.

2.1. 4-bit NormalFloat Quantization

  • The NormalFloat (NF) data type builds on Quantile Quantization. However, the process of quantile estimation is expensive.
  • Since pretrained neural network weights usually have a zero-centered normal distribution with standard deviation σ, we can transform all weights to a single fixed distribution by scaling σ such that the distribution fits exactly into the range of our data type. The arbitrary data range can be set to [−1, 1]:
  1. Estimate the 2k + 1 quantiles of a theoretical N(0, 1) distribution to obtain a k-bit quantile quantization data type for normal distributions.
  2. Take this data type and normalize its values into the [−1, 1] range.
  3. Quantize an input weight tensor by normalizing it into the [−1, 1] range through absolute maximum rescaling.
  • Formally, the 2^k values qi of the data type are estimated as follows:
  • where QX(·) is the quantile function of the standard normal distribution N(0, 1).

2.2. Double Quantization (DQ)

  • Double Quantization treats quantization constants cFP32_2 of the first quantization as inputs to a second quantization.
  • This second step yields the quantized quantization constants cFP8_2 and the second level of quantization constants cFP32_1 .
  • 8-bit Floats are used with a blocksize of 256 for the second quantization.

2.3. Paged Optimizers

  • Paged Optimizers use the NVIDIA unified memory feature which does automatic page-to-page transfers between the CPU and GPU, which automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.

2.4. QLoRA

  • Using the components described above, QLoRA for a single linear layer in the quantized base model with a single LoRA adapter is defined as follows:
  • NF4 is used for W and FP8 is used for c2. A blocksize of 64 is used for W for higher quantization precision and a blocksize of 256 is used for c2 to conserve memory.

To summarize, QLoRA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). QLoRA dequantizes the storage data type to the computation data type to perform the forward and backward pass, but QLoRA only compute sweight gradients for the LoRA parameters which use 16-bit BrainFloat.

2. QLoRA & Guanaco Results

2.1. QLoRA Performance

Mean Zero-Shot Accuracy
Mean Perplexity

4-bit NormalFloat yields better performance than 4-bit Floating Point.

NF4 improves performance significantly over FP4 and Int4 and that double quantization reduces the memory footprint without degrading performance.

Mean 5-Shot MMLU Accuracy

4-bit QLoRA with NF4 data type matches 16-bit full finetuning and 16-bit LoRA finetuning performance on academic benchmarks with well-established evaluation setups.

  • NF4 is more effective than FP4 and that double quantization does not degrade performance.

2.2. Guanaco: QLoRA trained on OASST1 is a State-of-the-art Chatbot

Elo Ratings
Zero-Shot Vicuna Benchmark Scores
  • Based on the automated and human evaluations, the top QLoRA tuned model, Guanaco 65B, is constructed which is finetuned on a variant of OASST1.

Guanaco 65B is the best-performing open-source chatbot model and offers performance competitive to ChatGPT.

  • When compared to GPT-4, Guanaco 65B and 33B have an expected win probability of 30%, based on Elo rating from human annotators system-level pairwise comparisons on the Vicuna benchmark — the highest reported to date.
  • 4-bit QLoRA is effective and can produce state-of-the-art chatbots that rival ChatGPT. Furthermore, 33B Guanaco can be trained on 24 GB consumer GPUs in less than 12 hours.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.