Brief Review — Jurassic-1: Technical Details and Evaluation
Jurassic-1 Technical details and evaluation,
Jurassic-1, by AI21 Labs,
2021 White Paper, Over 70 Citations (Sik-Ho Tsang @ Medium)
- Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model.
- Their performance is compared with GPT-3.
- The models are based on the decoder module of the Transformer architecture with the modifications proposed by GPT-2.
- Input tokens are first converted to vector representation with an nvocab-by-dmodel embedding matrix.
- The architecture is composed of nlayers Transformer layers using a hidden dimension dmodel, each equipped with a self-attention module with nheads attention heads of size dhead and a feed-forward module.
- The design is based on a recently proposed theory (Levine et al., 2020) for the depth-to-width expressivity tradeoff.
Specifically, for a parameter budget of 175B (not including embedding matrix), the optimal depth should be around 80 layers, far from the 96 layers used by GPT-3 175B.
- However, 76 layers are used rather than 80 layers because of various hardware considerations during both training and inference.
In the benchmarks, comparing the proposed architecture against GPT-3 175B on the same hardware configuration, J1 has modest benefits in training time (1.5% speedup per iteration), but significant runtime gains in batch inference (7%) and text generation (up to 23%).
1.2. Large Vocabulary for Tokenization Efficiency
- A SentencePiece tokenizer with a larger budget of 256K vocabulary items and without restricting it to word boundaries, is trained.
The vocabulary embedding for J1-Jumbo, for example, requires 3.6B parameters, which are just 2% of all parameters.
On almost all corpora, Jurassic-1 models are well ahead of their GPT-3’s counterparts.
On some tasks, the Jurassic-1 models come ahead and in some GPT-3. On average, both models attain the same performance.
J1-Large is able to attain better results by allowing for more training examples to fit in the prompt for the same number of tokens.
Jurassic-1 models are marginally less biased than GPT-3, but bare in mind that this is merely one benchmark.
Jurassic-1 models can predict text from a broader set of domains (web, academic, legal, source code, and more) than GPT-3, achieve comparable performance in zero-shot settings, and can be superior to GPT-3 in few-shot due to their ability to fit more examples into a prompt.
In all cases, J1 models perform either on par or better than their GPT-3 counterparts.