Brief Review — Jurassic-1: Technical Details and Evaluation
Jurassic-1, Comparable or Even Better Than GPT-3
Jurassic-1 Technical details and evaluation,
Jurassic-1, by AI21 Labs,
2021 White Paper, Over 70 Citations (Sik-Ho Tsang @ Medium)Language Model
2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====
- Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model.
- Their performance is compared with GPT-3.
Outline
- Jurassic-1
- Results
1. Jurassic-1
1.1. Model
- The models are based on the decoder module of the Transformer architecture with the modifications proposed by GPT-2.
- Input tokens are first converted to vector representation with an nvocab-by-dmodel embedding matrix.
- The architecture is composed of nlayers Transformer layers using a hidden dimension dmodel, each equipped with a self-attention module with nheads attention heads of size dhead and a feed-forward module.
Two model sizes are designed which is called in short J1-Large (7.5B parameters) and J1-Jumbo (178B parameters). They are roughly correspond to GPT-3 6.7B and GPT-3 175B models respectively.
- The design is based on a recently proposed theory (Levine et al., 2020) for the depth-to-width expressivity tradeoff.
Specifically, for a parameter budget of 175B (not including embedding matrix), the optimal depth should be around 80 layers, far from the 96 layers used by GPT-3 175B.
- However, 76 layers are used rather than 80 layers because of various hardware considerations during both training and inference.
In the benchmarks, comparing the proposed architecture against GPT-3 175B on the same hardware configuration, J1 has modest benefits in training time (1.5% speedup per iteration), but significant runtime gains in batch inference (7%) and text generation (up to 23%).
1.2. Large Vocabulary for Tokenization Efficiency
- A SentencePiece tokenizer with a larger budget of 256K vocabulary items and without restricting it to word boundaries, is trained.
The vocabulary embedding for J1-Jumbo, for example, requires 3.6B parameters, which are just 2% of all parameters.
2. Results
On almost all corpora, Jurassic-1 models are well ahead of their GPT-3’s counterparts.
On some tasks, the Jurassic-1 models come ahead and in some GPT-3. On average, both models attain the same performance.
J1-Large is able to attain better results by allowing for more training examples to fit in the prompt for the same number of tokens.
Jurassic-1 models are marginally less biased than GPT-3, but bare in mind that this is merely one benchmark.
Summary
Jurassic-1 models can predict text from a broader set of domains (web, academic, legal, source code, and more) than GPT-3, achieve comparable performance in zero-shot settings, and can be superior to GPT-3 in few-shot due to their ability to fit more examples into a prompt.
In all cases, J1 models perform either on par or better than their GPT-3 counterparts.