Brief Review — Alpaca: A Strong, Replicable Instruction-Following Model

Stanford Alpaca 7B & 13B

Sik-Ho Tsang
3 min readMay 21, 2024
Stanford Alpaca

Alpaca: A Strong, Replicable Instruction-Following Model
Alpaca
, by Stanford University
2023 Stanford Web Site (Sik-Ho Tsang @ Medium)

Large Langauge Model (LLM)
2020 … 2023
[GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5] [AlphaCode 2] [Mistral 7B]
==== My Other Paper Readings Are Also Over Here ====

  • Alpaca 7B instruction-following model is proposed by fine-tuning LLaMA.
  • In their GitHub, Alpaca 13B is constructed. They claimed that they also tried using LoRA for fine-tuning as well.
  • Later, Alpaca is further fine-tuned as MedAlpaca using medical data.
  • (Alpaca is one of the famous LLM models. Yet it is not a paper or arXiv tech report.)

Outline

  1. Alpaca 7B
  2. Results

1. Alpaca 7B

Alpaca 7B Training Recipes

1.1. Data

For the data, instruction-following demonstrations are generated by building upon the self-instruct method. Authors started with the 175 human-written instruction-output pairs from the self-instruct seed set.

Then text-davinci-003 is used for prompting to generate more-instructions using the seed set as in-context examples.

  • The self-instruct method is improved by simplifying thegeneration pipeline, which significantly reduced the cost.

This data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API.

1.2. Model

With the data, LLaMA models are fine-tuned using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training.

For the initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

2. Preliminary Results

  • Human evaluation (by the 5 student authors) is conducted on the inputs from the self-instruct evaluation set, which covers a diverse list of user-oriented instructions including email writing, social media, and productivity tools.

Alpaca wins 90 versus 89 comparisons against text-davinci-003.

  • Authors have also been testing the Alpaca model interactively and found that Alpaca often behaves similarly to text-davinci-003 on a diverse set of inputs. However, it is noted that the evaluation may be limited in scale and diversity.
Qualitative Results
  • However, similar to other models, Alpaca also has hallucination and misinformation:

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.