Koala: A Dialogue Model for Academic Research
Koala 13B, Collecting a Small High-Quality Dataset, Much Smaller Model Size
3 min readJun 3, 2023
Koala: A Dialogue Model for Academic Research,
Koala, by Berkeley Artificial Intelligence Research (BAIR),
2023 BAIR Blog, 3th Apr 2023 (Sik-Ho Tsang @ Medium)
- BAIR proposes Koala, which is a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web.
- By curating high-quality datasets, smaller model can be trained.
- (Indeed, there is no paper for the model. They only have the website for it currently: https://bair.berkeley.edu/blog/2023/04/03/koala/. Here, I just shorten it to give a quick read. For more details, please read their website.)
- Demo provided by authors: https://chat.lmsys.org/?model=koala-13b
Outline
- Koala
- Results
1. Koala
1.1. Datasets
- A primary obstacle in building dialogue models is curating training data. Rather than maximizing quantity by scraping as much web data as possible, author focus on collecting a small high-quality dataset.
1.2. ChatGPT Distillation Data
- Public User-Shared Dialogues with ChatGPT (ShareGPT): Around 60K dialogues.
- Human ChatGPT Comparison Corpus (HC3): 60K human answers and 27K ChatGPTanswers for around 24K questions.
1.3. Open Source Data
- Open Instruction Generalist (OIG): A manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION.
- Stanford Alpaca: 52K examples, which is generated by OpenAI’s text-davinci-003 following the self-instruct process.
- Anthropic HH: ~160K human-rated examples, where each example in this dataset consists of a pair of responses from a chatbot, one of which is preferredby humans.
- OpenAI WebGPT: Around 20K comparisonswhere each example comprises a question, a pair of model answers, and metadata.
- OpenAI Summarization: ~93K examples, eachexample consists of feedback from humans regarding the summarizations by models.
1.4. Training
- The Koala model is implemented with JAX/Flax in EasyLM, authors’ open source framework that makes it easy to pre-train, fine-tune, serve, and evaluate various large language models.
- Koala model is trained on a single Nvidia DGX server with 8 A100 GPUs. It takes 6 hours to complete the training for 2 epochs.
- On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances.
2. Results
- Two models are evaluated:
- Koala-Distill, which solely employs distillation data, and;
- Koala-All, which employs all of the data, including both distillation and open-source data.
- Two different test sets are evaluated,
- Alpaca Test Set: 180 test queries used by Stanford’s Alpaca, and
- Koala Test Set: Authors’ own test set, which contains 180 real user queries that were posted online.
- A blind pairwise comparison is performed by asking approximately 100 evaluators on Amazon Mechanical Turk platform.
Koala-All was rated as better than Alpaca in nearly half the cases, and either exceeded or tied Alpaca in 70% of the cases.
This suggests that Koala would be expected to perform better in assistant-like applications. This suggests that data of LLM interactions sourced from examples posted by users on the web is an effective strategy.
- The key to building strong dialogue models may lie more in curating high-quality dialogue data that is diverse in user queries, rather than simply reformatting existing datasets as questions and answers.
- (Hope I can have time to read about Alpaca later in the coming future.)