Brief Review — Jurassic-1: Technical Details and Evaluation

Jurassic-1, Comparable or Even Better Than GPT-3

4 min readMar 25, 2023

--

Jurassic-1 Technical details and evaluation,
Jurassic-1, by AI21 Labs,
2021 White Paper, Over 70 Citations (Sik-Ho Tsang @ Medium)
Language Model
2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model.
Their performance is compared with GPT-3.

Outline

Jurassic-1
Results

1. Jurassic-1

1.1. Model

**Comparing the architecture of our Jurassic-1 models to their** **GPT-3** **counterparts.**

The models are based on the decoder module of the Transformer architecture with the modifications proposed by GPT-2.
Input tokens are first converted to vector representation with an nvocab-by-dmodel embedding matrix.
The architecture is composed of nlayers Transformer layers using a hidden dimension dmodel, each equipped with a self-attention module with nheads attention heads of size dhead and a feed-forward module.

Two model sizes are designed which is called in short J1-Large (7.5B parameters) and J1-Jumbo (178B parameters). They are roughly correspond to GPT-3 6.7B and GPT-3 175B models respectively.

The design is based on a recently proposed theory (Levine et al., 2020) for the depth-to-width expressivity tradeoff.

Specifically, for a parameter budget of 175B (not including embedding matrix), the optimal depth should be around 80 layers, far from the 96 layers used by GPT-3 175B.

However, 76 layers are used rather than 80 layers because of various hardware considerations during both training and inference.

In the benchmarks, comparing the proposed architecture against GPT-3 175B on the same hardware configuration, J1 has modest benefits in training time (1.5% speedup per iteration), but significant runtime gains in batch inference (7%) and text generation (up to 23%).

1.2. Large Vocabulary for Tokenization Efficiency

**Examples of items from J1’s vocabulary, including word-pieces, whole words, and multi-word expressions.**

**An example showing how multi-word tokens and an overall larger vocabulary can better articulate the various options the model considers at a given point in the text.**

A SentencePiece tokenizer with a larger budget of 256K vocabulary items and without restricting it to word boundaries, is trained.

The vocabulary embedding for J1-Jumbo, for example, requires 3.6B parameters, which are just 2% of all parameters.

2. Results

Comparing the efficiency of different tokenizers on various corpora, as measured by the average tokens-perbytes (TPB) ratio, i.e., number of tokens divided by number of bytes in a sample from the corpus.

**Average log-probabilities per byte on variety of corpora (Raffel et al., 2020; Gao et al., 2020).**

On almost all corpora, Jurassic-1 models are well ahead of their GPT-3’s counterparts.

**Question-answering tasks (using J1-Jumbo)**

**Zero-shot results on a select set of tasks from Brown et al. (2020).**

On some tasks, the Jurassic-1 models come ahead and in some GPT-3. On average, both models attain the same performance.

**Results for few-shot learning on the DBPedia-14 and TREC-6 text-classification tasks.**

J1-Large is able to attain better results by allowing for more training examples to fit in the prompt for the same number of tokens.

Jurassic-1 models are marginally less biased than GPT-3, but bare in mind that this is merely one benchmark.

Summary

Jurassic-1 models can predict text from a broader set of domains (web, academic, legal, source code, and more) than GPT-3, achieve comparable performance in zero-shot settings, and can be superior to GPT-3 in few-shot due to their ability to fit more examples into a prompt.
In all cases, J1 models perform either on par or better than their GPT-3 counterparts.

Brief Review — Jurassic-1: Technical Details and Evaluation

Jurassic-1, Comparable or Even Better Than GPT-3

Outline

1. Jurassic-1

1.1. Model

1.2. Large Vocabulary for Tokenization Efficiency

2. Results

Summary

Written by Sik-Ho Tsang