Brief Review — Block-wise Dynamic Quantization
Block-wise Dynamic Quantization
8-bit Optimizers via Block-wise Quantization
Block-wise Dynamic Quantization, by Facebook AI Research, and University of Washington
2022 ICLR, Over 120 Citations (Sik-Ho Tsang @ Medium)NLP, Language Model, Image Classification, Self-Supervised Learning, Machine Translation
==== My Other Paper Readings Are Also Over Here ====
- (Recently, DeepLearning.AI has a new short course “Quantization Fundamentals with Hugging Face”, which makes me read a paper about quantization in deep learning.)
- In this paper, block-wise quantization is proposed, which divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
- To maintain stability and performance, block-wise quantization is combined with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance in language model training.
Outline
- Background
- Block-wise Dynamic Quantization
- Results
1. Background
1.1. Motivation
- SGD with momentum and Adam optimizers are formulated as below:
For 32-bit states, Momentum and Adam consume 4 and 8 bytes per parameter. That is 4 GB and 8 GB for a 1B parameter model. The proposed 8-bit non-linear quantization is to reduce these costs to 1 GB and 2 GB.
1.2. Non-Linear Quantization
- Quantization: To perform general quantization from one data type into another we require 3 steps:
- Compute a normalization constant N that transforms the input tensor T into the range of the domain D of the target quantization data type Qmap;
- For each element of T/N, find the closest corresponding value qi in the domain D.
- Store the index i corresponding to qi in the quantized output tensor TQ.
- Dynamic Quantization: To perform this procedure for dynamic quantization we first normalize into the range [-1, 1] through division by the absolute maximum value:
- The closest values are found via a binary search:
- Dequantization: To receive the dequantized tensor TD we look up the index and denormalize:
1.3. Dynamic Tree Quantization
- Dynamic Tree quantization (Dettmers, 2016) is a method that yields low quantization error for both small and large magnitude values.
- It is made up of 4 parts, as seen in Figure 2 above:
- The first bit of the data type is reserved for a sign.
- The number of subsequent zero bits indicates the magnitude of the exponent.
- The first bit that is set to one indicates that all following values are reserved for (4) linear quantization.
- By moving the indicator bit, numbers can have a large exponent 10^(-7) or precision as high as 1/63.
- Dynamic tree quantization is strictly defined to quantize numbers in the range [-1.0, 1.0].
2. Block-wise Dynamic Quantization
- With the above components, performing an optimizer update with 8-bit states is straightforward. the 8-bit optimizer states are dequantized to 32-bit to perform the update, and then the states are quantized back to 8-bit for storage.
2.1. Block-wise Quantization
Block-wise dynamic quantization reduces this cost by chunking an input tensor into small blocks of size B = 2048 and performing normalization independently in each core across this block.
- This means for an input tensor T with n elements we have n/B blocks. We proceed to compute a normalization constant for each block:
- With this block-wise normalization constant, each block can be quantized independently:
- This approach has several advantages, both for stability and efficiency:
- First, each block normalization can be computed independently. Thus no synchronization between cores is required, and throughput is enhanced.
- Secondly, it is also much more robust to outliers in the input tensor.
2.2. Dynamic Quantization
- Since the second Adam state is strictly positive, in this work, dynamic tree quantization is extended for non-signed input tensors by re-purposing the sign bit.
- The dynamic tree quantization is extended with a fixed bit for the fraction. This extension is motivated by the observation that the second Adam state varies around 3–5 orders of magnitude during the training of a language model. In comparison, dynamic tree quantization already has a range of 7 orders of magnitude.
2.3. Stable Embedding Layer
- The stable embedding layer is a standard word embedding layer variation (Devlin et al., 2019) designed to ensure stable training for NLP tasks.
The Stable Embedding Layer is initialized with Xavier uniform initialization (Glorot and Bengio, 2010) and layer normalization is applied before adding position embeddings.
- This method maintains a variance of roughly one both at initialization and during training. Additionally, the uniform distribution initialization has less extreme values than a normal distribution, reducing maximum gradient size.
3. Results
8-bit optimizers match replicated 32-bit performance for all tasks.
The proposed 8-bit optimizers save up to 8.5 GB of GPU memory for the largest 1.5B parameter language model and 2.0 GB for RoBERTa.
The models are now accessible with smaller GPUs.
The Ablations show that dynamic quantization, block-wise quantization, and the stable embedding layer are critical for either performance or stability. In addition, block-wise quantization is critical for large-scale language model stability.