Brief Review — Jurassic-1: Technical Details and Evaluation

Jurassic-1, Comparable or Even Better Than GPT-3

Sik-Ho Tsang
4 min readMar 25



Jurassic-1 Technical details and evaluation,
Jurassic-1, by AI21 Labs,
2021 White Paper, Over 70 Citations (Sik-Ho Tsang @ Medium)

Language Model
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model.
  • Their performance is compared with GPT-3.


  1. Jurassic-1
  2. Results

1. Jurassic-1

1.1. Model

Comparing the architecture of our Jurassic-1 models to their GPT-3 counterparts.
  • The models are based on the decoder module of the Transformer architecture with the modifications proposed by GPT-2.
  • Input tokens are first converted to vector representation with an nvocab-by-dmodel embedding matrix.
  • The architecture is composed of nlayers Transformer layers using a hidden dimension dmodel, each equipped with a self-attention module with nheads attention heads of size dhead and a feed-forward module.

Two model sizes are designed which is called in short J1-Large (7.5B parameters) and J1-Jumbo (178B parameters). They are roughly correspond to GPT-3 6.7B and GPT-3 175B models respectively.

  • The design is based on a recently proposed theory (Levine et al., 2020) for the depth-to-width expressivity tradeoff.

Specifically, for a parameter budget of 175B (not including embedding matrix), the optimal depth should be around 80 layers, far from the 96 layers used by GPT-3 175B.

  • However, 76 layers are used rather than 80 layers because of various hardware considerations during both training and inference.

In the benchmarks, comparing the proposed architecture against GPT-3 175B on the same hardware configuration, J1 has modest benefits in training time (1.5% speedup per iteration), but significant runtime gains in batch inference (7%) and text generation (up to 23%).

1.2. Large Vocabulary for Tokenization Efficiency

Examples of items from J1’s vocabulary, including word-pieces, whole words, and multi-word expressions.
An example showing how multi-word tokens and an overall larger vocabulary can better articulate the various options the model considers at a given point in the text.
  • A SentencePiece tokenizer with a larger budget of 256K vocabulary items and without restricting it to word boundaries, is trained.

The vocabulary embedding for J1-Jumbo, for example, requires 3.6B parameters, which are just 2% of all parameters.

2. Results

Comparing the efficiency of different tokenizers on various corpora, as measured by the average tokens-perbytes (TPB) ratio, i.e., number of tokens divided by number of bytes in a sample from the corpus.
Average log-probabilities per byte on variety of corpora (Raffel et al., 2020; Gao et al., 2020).

On almost all corpora, Jurassic-1 models are well ahead of their GPT-3’s counterparts.

Question-answering tasks (using J1-Jumbo)
Zero-shot results on a select set of tasks from Brown et al. (2020).

On some tasks, the Jurassic-1 models come ahead and in some GPT-3. On average, both models attain the same performance.

Results for few-shot learning on the DBPedia-14 and TREC-6 text-classification tasks.

J1-Large is able to attain better results by allowing for more training examples to fit in the prompt for the same number of tokens.

Jurassic-1 models are marginally less biased than GPT-3, but bare in mind that this is merely one benchmark.


Jurassic-1 models can predict text from a broader set of domains (web, academic, legal, source code, and more) than GPT-3, achieve comparable performance in zero-shot settings, and can be superior to GPT-3 in few-shot due to their ability to fit more examples into a prompt.

In all cases, J1 models perform either on par or better than their GPT-3 counterparts.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.