Review — LIMA: Less Is More for Alignment
LIMA, Quality is More Important Than Quantity
LIMA: Less Is More for Alignment,
LIMA, by Meta AI, Carnegie Mellon University, University of Southern California, and Tel Aviv University,
2023 arXiv v1 (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] 2023 [GPT-4] [LLaMA]
==== My Other Paper Readings Are Also Over Here ====
- LIMA is proposed, where a 65B parameter LLaMa language model is fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
- LIMA demonstrates remarkably strong performance, generalize well to unseen tasks.
Outline
- Superficial Alignment Hypothesis
- Alignment Data
- LIMA Training
- Results
1. Superficial Alignment Hypothesis
A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which sub-distribution of formats should be used when interacting with users.
If this hypothesis is correct, and alignment is largely about learning style. The pretrained model can be sufficiently tuned with a rather small set of examples.
2. Alignment Data
2.1. Prompts & Responses
750 top questions and answers from community forums, such as Stack Exchange and wikiHow, sampling for quality and diversity.
In addition, authors manually write 250 examples of prompts and responses, while optimizing for task diversity and emphasizing a uniform response style in the spirit of an AI assistant.
- Answers from Stack Exchange and wikiHow are well-aligned with the behavior of a helpful AI agent.
- Reddit answers tend to be humorous or trolling, requiring a more manual approach to curate responses that follow the appropriate style.
- (For quick read, please skip the details from 2.2 to 2.6.)
2.2. Stack Exchange
Stack Exchange contains 179 online communities (exchanges), each one dedicated to a specific topic, with the most popular one being programming (Stack Overflow). Stack Exchange has successfully maintained a high bar for content quality. Since Stack Exchange questions contain both a title and a description, authors randomly select the title as the prompt for some examples, and the description for others.
- First, the exchanges are divided into 75 STEM exchanges (including programming, math, physics, etc.) and 99 other (English, cooking, travel, and more); 5 niche exchanges are discarded. Then 200 questions and answers are sampled from each set using a temperature of τ=3 to get a more uniform sample of the different domains.
- Within each exchange, questions with high scores, and top answers are selected.
2.3. wikiHow
wikiHow is an online wiki-style publication featuring over 240,000 how-to articles on a variety of topics.
- 200 articles are sampled from wikiHow, as above, sampling a category first (out of 19) and then an article within it to ensure diversity.
2.4. The Pushshift Reddit Dataset
The samples are restricted to two subsets, r/AskReddit and r/WritingPrompts, and examples are manually select from within the most upvoted posts in each community.
- From r/AskReddit, 70 self-contained prompts (title only, no body) are found, which are used for the test set, since the top answers are not necessarily reliable.
- For r/WritingPrompts, it contains premises of fictional stories, which other users are then encouraged to creatively complete. 150 prompts and high-quality responses are found, encompassing topics such as love poems and short science fiction stories, which are added to the training set.
2.5. Manually Authored Examples
To further diversify the data beyond questions asked by users in online communities, prompts are collected by authors. There are two sets of authors, Group A and Group B, to create 250 prompts each, inspired by their own interests or those of their friends.
- 200 prompts are selected from Group A for training and 50 prompts as a held-out development set. After filtering some problematic prompts, the remaining 230 prompts from Group B are used for test.
- The 200 training prompts are supplemented with high-quality answers, in which authors write by themselves, with a uniform tone, and a consistent format. It is hypothesized that it assists the model in forming a chain of thought.
- 13 training prompts with some degree of toxicity or malevolence are also included. Responses are carefully written that partially or fully reject the command, and explain why the assistant will not comply. There are also 30 prompts with similar issues in the test set.
2.6. Super-Natural Instructions
50 training examples are sampled from Super-Natural Instructions [Wang et al., 2022b], which related to 50 natural language generation tasks such as summarization, paraphrasing, and style transfer. Each sample corresponding to 1 task.
- The intuition is that this small sample adds diversity to the overall mix of training examples, and can potentially increase model robustness.
After getting the dataset as in the table above, they will be used for training LIMA.
3. LIMA Training
- LIMA (Less Is More for Alignment) is trained using the following protocol starting from LLaMA 65B.
- LLaMA 65B is fine-tuned on the proposed 1,000-example alignment training set. To differentiate between each speaker (user and assistant), a special end-of-turn token (EOT) is introduced at the end of each utterance; this token plays the same role as EOS of halting generation, but avoids conflation with any other meaning.
- It is fine-tuned for 15 epochs using AdamW. The batch size is set to 32 examples (64 for smaller models), and texts longer than 2048 tokens are trimmed. Residual dropout is used.
It is found that perplexity does not correlate with generation quality, and thus manually select checkpoints between the 5th and the 10th epochs using the held-out 50-example development set.
4. Results
4.1. Human & GPT-4 Preference Evaluation
- 5 baseline models, Alpaca, DaVinci003, BARD, Claude, and GPT-4, are used for comparisons.
- For each prompt, a single response is generated from each baseline model using nucleus sampling [Holtzman et al., 2019] with p=0.9 and a temperature of τ=0.7.
- For each prompt, two responses are given, the annotators are asked to label which response was better, or whether neither response was significantly better than the other.
Both Human & GPT-4 Preferences:
Despite training on 52 times more data, Alpaca 65B tends to produce less preferable outputs than LIMA. The same is true for DaVinci003, though to a lesser extent.Human Preference:
BARD shows the opposite trend to DaVinci003, producing better responses than LIMA 42% of the time; however, this also means that 58% of the time the LIMA response was at least as good as BARD.Both Human & GPT-4 Preferences:
Finally, we see that while Claude and GPT-4 generally perform better than LIMA, there is a non-trivial amount of cases where LIMA does actually produce better responses.GPT-4 Preference:
Perhaps ironically, even GPT-4 prefers LIMA outputs over its own 19% of the time.
4.2. Meet the Requirement of Prompt
50% of LIMA answers are considered excellent, and that it is able to follow all but 6 of the 50 analyzed prompts.
4.3. Some Examples
4.4. Why is Less More? Ablations on Data Diversity, Quality, and Quantity
- A 7B parameter LLaMA model is fine-tuned.
- 5 responses are sampled for each test set prompt, and evaluate response quality by asking ChatGPT (GPT-3.5 Turbo) to grade the helpfulness of a response on a 1–6 likert scale.
Diversity (Figure 5): The more diverse Stack Exchange data yields significantly higher performance.
Quality (Figure 5): There is a significant 0.5 point difference between models trained on the filtered and unfiltered data sources.
Quantity (Figure 6): Surprisingly, doubling the training set does not improve response quality.
4.5. Multi-Turn Dialogue
- Can a model fine-tuned on only 1,000 single-turn interactions engage in multi-turn dialogue?
- 30 multi-turn dialogue chains are gathered. Among these, 10 dialogues are composed by the authors, while the remaining 20 are based on comment chains from Stack Exchange, which are editted to fit the assistant’s style.
Adding conversations substantially improves generation quality, raising the proportion of excellent responses from 45.2% to 76.1%. Moreover, the failure rate drops from 15 fails per 42 turns (zero-shot) to 1 fail per 46 (fine-tuned).
- (There are multi-turn dialogue examples in the paper. Please feel free to read the paper directly.)