# Brief Review — QLoRA: Efficient Finetuning of Quantized LLMs

## QLoRA, LoRA With Quantization. A Model Family, **Guanaco, Is Proposed.**

QLoRA: Efficient Finetuning of Quantized LLMs, by University of Washington

QLoRA, Guanaco2023 NeurIPS, Over 650 Citations(Sik-Ho Tsang @ Medium)

LM Tuning / Prompting

2020 …2023[LIMA] [SELF-INTRUCT] [Self-Consistency] [Med-PaLM 2]2024[LLaMA-Adapter]

==== My Other Paper Readings Are Also Over Here ====

**QLoRA**invents (1)**4-bit NormalFloat (NF4)**, a new data type that is information theoretically optimal for normally distributed weights. (2)**Double Quantization**is utilized to reduce the average memory footprint by quantizing the quantization constants, and (3)**Paged Optimizers**are proposed to manage memory spikes.- By using QLoRA,
**a model family**, namely**Guanaco**, is constructed.

# Outline

**Background****QLoRA Finetuning****QLoRA & Guanaco Results**

**1. Background**

## 1.1. Block-wise k-bit Quantization

- The input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor.
- For example,
**quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [−127, 127]**:

- where
*c*is the quantization constant or quantization scale. **Dequantization**is the inverse:

## 1.2. Low-rank Adapter (LoRA) Finetuning

- With the full model parameters
which*W***remain fixed**,**LoRA****augments a linear projection through an additional factorized projection.**Given a**projection**:*XW*=*Y*

**2. QLoRA Finetuning**

- QLoRA further reduces the memory requirement, achieves high-fidelity 4-bit finetuning via
**two techniques**—**4-bit NormalFloat (NF4) quantization**and**Double Quantization.**Additionally,**Paged Optimizers**are introduced, to**prevent memory spikes.**

## 2.1. 4-bit NormalFloat Quantization

- The NormalFloat (NF) data type builds on
**Quantile Quantization**. However, the process of quantile estimation is**expensive**. - Since
**pretrained neural network weights**usually have**a zero-centered normal distribution with standard deviation**, we can transform all weights to a single fixed distribution by scaling*σ**σ*such that the distribution fits exactly into the range of our data type. The arbitrary**data range**can be set to**[−1, 1]:**

**Estimate the 2**of a theoretical*k*+ 1 quantiles*N*(0, 1) distribution to obtain a*k*-bit quantile quantization data type for normal distributions.- Take this data type and
**normalize its values into the [−1, 1] range.** **Quantize an input weight tensor**by normalizing it into the [−1, 1] range through absolute maximum rescaling.

**Formally, the 2^**are estimated as follows:*k*values*qi*of the data type

- where
*QX*(·) is the quantile function of the standard normal distribution*N*(0, 1).

## 2.2. Double Quantization (DQ)

- Double Quantization treats
**quantization constants cFP32_2**of the**first quantization**as**inputs to a second quantization.** - This
**second**step yields the**quantized quantization constants cFP8_2**and the**second level of quantization constants cFP32_1 .** **8-bit Floats**are used with a blocksize of 256 for the second quantization.

## 2.3. Paged Optimizers

- Paged Optimizers use the
**NVIDIA unified memory feature**which does**automatic page-to-page transfers between the CPU and GPU, which automatically evicted to CPU RAM when the GPU runs out-of-memory**and paged back into GPU memory when the memory is needed in the optimizer update step.

## 2.4. QLoRA

- Using the components described above,
**QLoRA for a single linear layer in the quantized base model**with a single LoRA adapter is defined as follows:

**NF4**is used forand*W***FP8**is used for.*c*2**A blocksize of 64**is used forfor higher quantization precision and*W***a blocksize of 256**is used forto conserve memory.*c*2

To summarize,

QLoRA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat).QLoRA dequantizes the storage data type to the computation data type to perform the forward and backward pass, but QLoRA only compute sweight gradients for the LoRA parameters which use 16-bit BrainFloat.

# 2. **QLoRA & Guanaco Results**

## 2.1. QLoRA Performance

4-bit NormalFloat yields better performancethan 4-bit Floating Point.

NF4 improves performance significantly over FP4 and Int4and thatdouble quantization reduces the memory footprint without degrading performance.

4-bit QLoRA with NF4 data type matches 16-bit full finetuning and 16-bitLoRAfinetuning performanceon academic benchmarks with well-established evaluation setups.

**NF4 is more effective than FP4**and that double quantization does not degrade performance.

## 2.2. Guanaco: QLoRA trained on OASST1 is a State-of-the-art Chatbot

- Based on the automated and human evaluations,
**the top QLoRA tuned model, Guanaco 65B, is constructed which is finetuned on a variant of OASST1.**

Guanaco 65Bis thebest-performing open-source chatbot modeland offers performance competitive to ChatGPT.

**When compared to****GPT-4****, Guanaco 65B and 33B have an expected win probability of 30%, based on Elo rating**from human annotators system-level pairwise comparisons on the Vicuna benchmark — the highest reported to date.- 4-bit QLoRA is effective and can produce state-of-the-art chatbots that rival ChatGPT. Furthermore,
**33B Guanaco can be trained on 24 GB consumer GPUs in less than 12 hours.**