Review — PaLM 2 Technical Report
PaLM 2 Technical Report,
PaLM 2, by Google
2023 arXiv v1 (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT] [GLM-130B] [UL2]
==== My Other Paper Readings Are Also Over Here ====
- PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2.
- PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction.
Outline
- Scaling Law Experiments & Model Variants
- Training Dataset
- Evaluations
1. Scaling Law Experiments & Model Variants
- Several differently sized models are trained with 4 different compute budgets: 1×10¹⁹, 1×10²⁰, 1×10²¹, and 1×10²² FLOPs.
Similar findings to Chinchilla are found, training data (D) and model size (N) should grow in equal proportions as the FLOPs budget increases. Finally, several models from 400M to 15B are trained.
- Three variants of PaLM 2: a Small (S), Medium (M), and Large (L) are trained. Unless indicated, normally L is mentioned.
- (Recently, there is a tendency that authors/corporations keep their own secret sauce, not to disclose the details such as: model architecture, training objective & strategies, dataset, etc. for AI safety/security issue, and probably also for their profit issue.)
2. Training Dataset
The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data.
- PaLM 2 is trained on a dataset that includes a higher percentage of non-English data than previous large language models, which is beneficial for multilingual tasks.
PaLM 2 is also trained on parallel data covering hundreds of languages in the form of source and target text pairs where one side is in English.
- Several data cleaning and quality filtering methods are employed.
- Even though PaLM 2 has a smaller proportion of English data than PaLM, we still observe significant improvements on English evaluation datasets. Higher quality data is obtained.
PaLM 2 was trained to increase the context length for tasks such as long dialog, long-range reasoning and comprehension, summarization.
3. Evaluation
3.1. Language Proficiency Exams
PaLM 2 outperforms PaLM across all exams and achieves a passing grade for every language, demonstrating language proficiency across all evaluated languages.
3.1. English QA and Classification
Even the smallest PaLM 2 variant achieves performance competitive with the much larger PaLM 540B model while PaLM 2-M already outperforms PaLM consistently.
3.3. Reasoning
PaLM 2 outperforms PaLM across all datasets and achieves results competitive with GPT-4.
3.4. Coding
PaLM 2-S* outperforms PaLM on all but two languages, with surprisingly little degradation on low-resource languages like Julia and Haskell; for instance PaLM 2-S* improves upon the much larger PaLM-Coder-540B by 6.3× on Haskell and on Julia by 4.7×.
3.5. Translation
PaLM 2 improves quality both over PaLM and Google Translate.
- (There are also natural language generation (NLG), memorization, toxicity & bias experiments. Please feel free to read the paper directly if interested.)