# Brief Review — Block-wise Dynamic Quantization

## Block-wise Dynamic Quantization

8-bit Optimizers via Block-wise Quantization, by Facebook AI Research, and University of Washington

Block-wise Dynamic Quantization2022 ICLR, Over 120 Citations(Sik-Ho Tsang @ Medium)

NLP, Language Model, Image Classification, Self-Supervised Learning, Machine Translation==== My Other Paper Readings Are Also Over Here ====

- (Recently, DeepLearning.AI has a new short course “Quantization Fundamentals with Hugging Face”, which makes me read a paper about quantization in deep learning.)
- In this paper,
**block-wise quantization**is proposed, which**divides input tensors into smaller blocks that are independently quantized.**Each block is processed in parallel across cores, yielding**faster**optimization and**high precision**quantization. - To maintain stability and performance, block-wise quantization is combined with two additional changes: (1)
**dynamic quantization**, a form of non-linear optimization that is precise for both large and small magnitude values, and (2)**a stable embedding layer**to reduce gradient variance in language model training.

# Outline

**Background****Block-wise Dynamic Quantization****Results**

**1. Background**

## 1.1. Motivation

**SGD with momentum**and**Adam optimizers**are formulated as below:

For 32-bit states, Momentum and Adam consume 4 and 8 bytes per parameter. That is

4 GB and 8 GB for a 1B parameter model. Theproposed 8-bit non-linear quantizationis toreduce these costs to 1 GB and 2 GB.

## 1.2. Non-Linear Quantization

**Quantization:**To perform general quantization from one data type into another we require**3 steps**:

**Compute a normalization constant**that*N***transforms the input tensor**of the target quantization data type*T*into the range of the domain*D**Qmap;***For each element of***T*/*N*, find the closest corresponding value*qi*in the domain*D*.**Store the index**corresponding to*i**qi*in the quantized output tensor*TQ.*

**Dynamic Quantization**: To perform this procedure for dynamic quantization we**first normalize into the range [-1, 1] through division by the absolute maximum value:**

- The
**closest values**are found via a binary search:

**Dequantization**: To receive the**dequantized**tensor*TD*we look up the index and denormalize:

**1.3. Dynamic Tree Quantization**

- Dynamic Tree quantization (Dettmers, 2016) is a method that
**yields low quantization error for both small and large magnitude values.** - It is made up of
**4 parts**, as seen in Figure 2 above:

- The
**first bit**of the data type is reserved for a**sign**. - The
**number of subsequent zero bits**indicates**the magnitude of the exponent.** - The
**first bit**that is set to**one**indicates that all following values are**reserved for (4) linear quantization.** - By moving the indicator bit,
**numbers can have a large exponent 10^(-7) or precision as high as 1/63.**

- Dynamic tree quantization is strictly defined to quantize numbers in the range [-1.0, 1.0].

**2. Block-wise Dynamic Quantization**

- With the above components, performing an optimizer update with 8-bit states is straightforward.
**the 8-bit optimizer states are dequantized to 32-bit to perform the update, and then the states are quantized back to 8-bit for storage.**

## 2.1. Block-wise Quantization

Block-wise dynamic quantizationreduces this cost bychunking an input tensor into small blocks of sizeandB= 2048performing normalization independentlyin each core across this block.

- This means for an input tensor
*T*with*n*elements we have*n*/*B*blocks. We proceed to compute**a normalization constant for each block**:

- With this block-wise normalization constant,
**each block can be quantized independently:**

- This approach has several
**advantages**, both for**stability**and**efficiency**:

- First, each block normalization can be computed independently. Thus
**no synchronization between cores is required**, and**throughput is enhanced.** - Secondly, it is also much
**more robust to outliers**in the input tensor.

## 2.2. Dynamic Quantization

- Since
**the second Adam state is strictly positive**, in this work, dynamic tree quantization is extended for non-signed input tensors by**re-purposing the sign bit.** - The dynamic tree quantization is
**extended with a fixed bit for the fraction.**This extension is motivated by the observation that**the second Adam state varies around 3–5 orders of magnitude during the training of a language model.**In comparison, dynamic tree quantization already has a range of 7 orders of magnitude.

## 2.3. Stable Embedding Layer

- The stable embedding layer is a standard word embedding layer variation (Devlin et al., 2019) designed to
**ensure stable training for NLP tasks.**

The Stable Embedding Layer is initialized with Xavier uniform initialization(Glorot and Bengio, 2010)andlayer normalizationis applied before adding position embeddings.

- This method maintains a variance of roughly one both at initialization and during training. Additionally, the uniform distribution initialization has less extreme values than a normal distribution, reducing maximum gradient size.

# 3. Results

8-bit optimizers match replicated 32-bit performancefor all tasks.The proposed 8-bit optimizers

save up to 8.5 GB of GPU memoryfor the largest 1.5B parameter language model and 2.0 GB for RoBERTa.

The models are now

accessible with smaller GPUs.

The Ablations show that

dynamic quantization, block-wise quantization, and the stable embedding layer are criticalfor either performance or stability. In addition, block-wise quantization is critical for large-scale language model stability.