Review — LLaMA: Open and Efficient Foundation Language Models

LLaMA-65B, Surpasses , , , , .

Sik-Ho Tsang
6 min readMay 20, 2023

--

LLaMA (Free Image from Pexels: )

,
LLaMA, by Meta AI,
2023 arXiv v1, Over 20 Citations (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[] [] [] [] [] [] [] [] [] [] [] [] [] 2023 []

  • LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, is proposed.
  • The models are trained on trillions of tokens, using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
  • In particular, LLaMA-13B outperforms (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, -70B and -540B.

Outline

  1. LLaMA
  2. Results

1. LLaMA

1.1. Pretraining Data

Pre-training data.
  • The training dataset is a mixture of several sources, with the restriction of only using data that is publicly available, and compatible with open sourcing.
  1. English CommonCrawl [67%]: Five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline, with preprocessing.
  2. C4 [15%]: The publicly available C4 dataset.
  3. Github [4.5%]: Public GitHub dataset available on Google BigQuery.
  4. Wikipedia [4.5%]: Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
  5. Gutenberg and Books3 [4.5%]: Two book copora includings the Gutenberg Project, and Books3 section of ThePile.
  6. ArXiv [2.5%]: arXiv Latex files are added to include scientific data.
  7. Stack Exchange [2%]: A dump of Stack Exchange, a website of high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry.

1.2. Model Architecture

Model variants.
  • , with various improvements, is used.
  • Pre-normalization []: To improve the training stability, the input of each sub-layer is normalized, instead of normalizing the output. RMSNorm normalizing function is used.
  • SwiGLU activation function []: is replaced by . A dimension of (2/3)*4d instead of 4d as in .
  • Rotary Embeddings []: The absolute positional embeddings are removed. Instead, is added, at each layer of the network.
  • The details of different models are shown as in the above table.
  • When training a 65B-parameter model, the code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over the proposed dataset containing 1.4T tokens takes approximately 21 days.

1.3. Evaluation

  • A total of 20 benchmarks is evaluated using:
  1. Zero-shot: A textual description of the task and a test example is provided. The model either provides an answer using open-ended generation, or ranks the proposed answers.
  2. Few-shot: A few examples of the task (between 1 and 64) and a test example are provided.

2. Results

2.1. Common Sense Reasoning

Zero-shot performance on Common Sense Reasoning tasks.

LLaMA-65B outperforms -70B on all reported benchmarks but BoolQ. Similarly, this model surpasses -540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms on most benchmarks despite being 10× smaller.

2.2. Closed-book Question Answering

Left: NaturalQuestions. Right: TriviaQA.

On both benchmarks, LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings.

More importantly, the LLaMA-13B is also competitive on these benchmarks with and , despite being 5–10× smaller. This model runs on a single V100 GPU during inference.

2.3. Reading Comprehension

Reading Comprehension.

On these benchmarks, LLaMA-65B is competitive with -540B, and, LLaMA-13B outperforms by a few percents.

2.4. Mathematical Reasoning

Quantitative reasoning.

On GSM8k, LLaMA-65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.

2.5. Code Generation

Code generation.

For a similar number of parameters, LLaMA outperforms other general models such as and , which are not trained or finetuned specifically for code.

LLaMA with 13B parameters and more outperforms 137B on both HumanEval and MBPP.

LLaMA 65B also outperforms 62B, even when it is trained longer.

2.6. Massive Multitask Language Understanding (MMLU)

Massive Multitask Language Understanding (MMLU).

LLaMA-65B is behind both -70B and -540B by a few percent in average, and across most domains. A potential explanation is that a limited amount of books and academic papers is used in the pretraining data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books.

2.7. Evolution of Performance During Training

Training loss over train tokens for the 7B, 13B, 33B, and 65 models.
Evolution of performance on question answering and common sense reasoning during training.

On most benchmarks, the performance improves steadily, and correlates with the training perplexity of the model.

The exceptions are SIQA and WinoGrande.

2.8. Instruction Finetuning

Instruction finetuning — MMLU (5-shot).
  • Briefly finetuning on instructions data rapidly leads to improvements on MMLU.
  • Since this is not the focus of this paper, only a single experiment following the same protocol as Chung et al. (2022) is conducted to train an instruct model, LLaMA-I.

Despite the simplicity of the instruction finetuning approach used here, 68.9% is reached on MMLU. LLaMA-I (65B) outperforms on MMLU existing instruction finetuned models of moderate sizes, but are still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022)).

2.9. Bias, Toxicity and Misinformation

RealToxicityPrompts. (Toxicity)

Toxicity increases with the size of the model, especially for Respectful prompts. The larger model, , has worse performance than , suggesting that the relation between toxicity and model size may only apply within a model family.

CrowS-Pairs. (Bias)

LLaMA compares slightly favorably to both models on average.

WinoGender. (Bias)

LLaMA-65B, makes more errors on the gotcha examples, clearly showing that it capture societal biases related to gender and occupation. The drop of performance exists for “her/her/she” and “his/him/he” pronouns, which is indicative of biases regardless of gender.

TruthfulQA.

Compared to , the model scores higher in both categories, but the rate of correct answers is still low, showing that the proposed model is likely to hallucinate incorrect answers.

2.10. Carbon Footprint

Carbon footprint.
  • Developing these models would have cost around 2,638 MWh under the assumptions, and a total emission of 1,015 tCO2eq.

Authors hope that releasing these models will help to reduce future carbon emission since the training is already done.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response