Koala: A Dialogue Model for Academic Research

Koala 13B, Collecting a Small High-Quality Dataset, Much Smaller Model Size

Sik-Ho Tsang
3 min readJun 3, 2023
Koala (Free Image from Pexels: Pixabay)

Koala: A Dialogue Model for Academic Research,
Koala, by Berkeley Artificial Intelligence Research (BAIR),
2023 BAIR Blog, 3th Apr 2023 (Sik-Ho Tsang @ Medium)

  • BAIR proposes Koala, which is a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web.
  • By curating high-quality datasets, smaller model can be trained.
  • (Indeed, there is no paper for the model. They only have the website for it currently: https://bair.berkeley.edu/blog/2023/04/03/koala/. Here, I just shorten it to give a quick read. For more details, please read their website.)
  • Demo provided by authors: https://chat.lmsys.org/?model=koala-13b
I also tried the demo, Koala knows about LLaMA but doesn’t know itself, lol.

Outline

  1. Koala
  2. Results

1. Koala

Koala 13B Training Pipeline
Comparison with Alpaca and ChatGPT.

1.1. Datasets

  • A primary obstacle in building dialogue models is curating training data. Rather than maximizing quantity by scraping as much web data as possible, author focus on collecting a small high-quality dataset.

1.2. ChatGPT Distillation Data

  1. Public User-Shared Dialogues with ChatGPT (ShareGPT): Around 60K dialogues.
  2. Human ChatGPT Comparison Corpus (HC3): 60K human answers and 27K ChatGPTanswers for around 24K questions.

1.3. Open Source Data

  1. Open Instruction Generalist (OIG): A manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION.
  2. Stanford Alpaca: 52K examples, which is generated by OpenAI’s text-davinci-003 following the self-instruct process.
  3. Anthropic HH: ~160K human-rated examples, where each example in this dataset consists of a pair of responses from a chatbot, one of which is preferredby humans.
  4. OpenAI WebGPT: Around 20K comparisonswhere each example comprises a question, a pair of model answers, and metadata.
  5. OpenAI Summarization: ~93K examples, eachexample consists of feedback from humans regarding the summarizations by models.

1.4. Training

  • The Koala model is implemented with JAX/Flax in EasyLM, authors’ open source framework that makes it easy to pre-train, fine-tune, serve, and evaluate various large language models.
  • Koala model is trained on a single Nvidia DGX server with 8 A100 GPUs. It takes 6 hours to complete the training for 2 epochs.
  • On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances.

2. Results

  • Two models are evaluated:
  1. Koala-Distill, which solely employs distillation data, and;
  2. Koala-All, which employs all of the data, including both distillation and open-source data.
  • Two different test sets are evaluated,
  1. Alpaca Test Set: 180 test queries used by Stanford’s Alpaca, and
  2. Koala Test Set: Authors’ own test set, which contains 180 real user queries that were posted online.
  • A blind pairwise comparison is performed by asking approximately 100 evaluators on Amazon Mechanical Turk platform.
Blind Pairwise Comparison

Koala-All was rated as better than Alpaca in nearly half the cases, and either exceeded or tied Alpaca in 70% of the cases.

This suggests that Koala would be expected to perform better in assistant-like applications. This suggests that data of LLM interactions sourced from examples posted by users on the web is an effective strategy.

  • The key to building strong dialogue models may lie more in curating high-quality dialogue data that is diverse in user queries, rather than simply reformatting existing datasets as questions and answers.
  • (Hope I can have time to read about Alpaca later in the coming future.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.