Koala: A Dialogue Model for Academic Research

Koala 13B, Collecting a Small High-Quality Dataset, Much Smaller Model Size

Sik-Ho Tsang
3 min readJun 3


Koala (Free Image from Pexels: Pixabay)

Koala: A Dialogue Model for Academic Research,
Koala, by Berkeley Artificial Intelligence Research (BAIR),
2023 BAIR Blog, 3th Apr 2023 (Sik-Ho Tsang @ Medium)

  • BAIR proposes Koala, which is a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web.
  • By curating high-quality datasets, smaller model can be trained.
  • (Indeed, there is no paper for the model. They only have the website for it currently: https://bair.berkeley.edu/blog/2023/04/03/koala/. Here, I just shorten it to give a quick read. For more details, please read their website.)
  • Demo provided by authors: https://chat.lmsys.org/?model=koala-13b
I also tried the demo, Koala knows about LLaMA but doesn’t know itself, lol.


  1. Koala
  2. Results

1. Koala

Koala 13B Training Pipeline
Comparison with Alpaca and ChatGPT.

1.1. Datasets

  • A primary obstacle in building dialogue models is curating training data. Rather than maximizing quantity by scraping as much web data as possible, author focus on collecting a small high-quality dataset.

1.2. ChatGPT Distillation Data

  1. Public User-Shared Dialogues with ChatGPT (ShareGPT): Around 60K dialogues.
  2. Human ChatGPT Comparison Corpus (HC3): 60K human answers and 27K ChatGPTanswers for around 24K questions.

1.3. Open Source Data

  1. Open Instruction Generalist (OIG): A manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION.
  2. Stanford Alpaca: 52K examples, which is generated by OpenAI’s text-davinci-003 following the self-instruct process.
  3. Anthropic HH: ~160K human-rated examples, where each example in this dataset consists of a pair of responses from a chatbot, one of which is preferredby humans.
  4. OpenAI WebGPT: Around 20K comparisonswhere each example comprises a question, a pair of model answers, and metadata.
  5. OpenAI Summarization: ~93K examples, eachexample consists of feedback from humans regarding the summarizations by models.

1.4. Training

  • The Koala model is implemented with JAX/Flax in EasyLM, authors’ open source framework that makes it easy to pre-train, fine-tune, serve, and evaluate various large language models.
  • Koala model is trained on a single Nvidia DGX server with 8 A100 GPUs. It takes 6 hours to complete the training for 2 epochs.
  • On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances.

2. Results

  • Two models are evaluated:
  1. Koala-Distill, which solely employs distillation data, and;
  2. Koala-All, which employs all of the data, including both distillation and open-source data.
  • Two different test sets are evaluated,
  1. Alpaca Test Set: 180 test queries used by Stanford’s Alpaca, and
  2. Koala Test Set: Authors’ own test set, which contains 180 real user queries that were posted online.
  • A blind pairwise comparison is performed by asking approximately 100 evaluators on Amazon Mechanical Turk platform.
Blind Pairwise Comparison

Koala-All was rated as better than Alpaca in nearly half the cases, and either exceeded or tied Alpaca in 70% of the cases.

This suggests that Koala would be expected to perform better in assistant-like applications. This suggests that data of LLM interactions sourced from examples posted by users on the web is an effective strategy.

  • The key to building strong dialogue models may lie more in curating high-quality dialogue data that is diverse in user queries, rather than simply reformatting existing datasets as questions and answers.
  • (Hope I can have time to read about Alpaca later in the coming future.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.