Brief Review — Codex: Evaluating Large Language Models Trained on Code

Codex is Proposed to Solve Programming Tasks. HumanEval Evaluation Dataset is Also Proposed.

Sik-Ho Tsang
4 min readApr 1


Codex (Image from OpenAI Codex)

Evaluating Large Language Models Trained on Code,
Codex & HumanEval, by OpenAI,
2021 arXiv v2, Over 470 Citations (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • Codex is proposed, which is a GPT language model fine-tuned on publicly available code from GitHub to study its Python code-writing capabilities.
  • On HumanEval, a new evaluation set, functional correctness is measured for synthesizing programs from docstrings.
Pass rates of Codex on the HumanEval dataset as a function of model size.
  • The proposed Codex solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.


  1. Codex Showcase & HumanEval Evaluation Dataset
  2. Codex Training Dataset & Model
  3. Results

1. Codex Showcase & HumanEval Evaluation Dataset

1.1. Showcase

Three example problems from the HumanEval dataset
  • The prompt provided to the model is shown with a white background, and a successful model-generated completion is shown in a yellow background.

1.2. HumanEval

  • HumanEval is proposed to evaluate the functional correctness on a set of 164 handwritten programming problems with unit tests.
  • Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.
  • These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
  • Real-world programming tasks often involve iterations of approaches and bug fixes, which is approximated by generating many samples from the models and selecting one that passes all unit tests.
  • It it quite different from natural language, BLEU is not accurate. pass@k metric, is used where k code samples are generated per problem see if any sample passes the unit tests.

2. Codex Training Dataset & Model

2.1. Training Dataset

  • The training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB.
  • Authors filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.
  • After filtering, the final dataset totaled 159 GB.

2.2. Model

  • GPT models containing up to 12B parameters, are fine-tuned on code to produce Codex. Fine-tuning GPT-3 does not shows improvements.
  • The distribution of words in GitHub code differs from that of natural text, the original tokenizer is not very effective for representing code. An additional set of tokens is added for representing whitespace runs of different lengths. This allows to represent code using approximately 30% fewer tokens.

3. Results

3.1. Test Loss

Model cross-entropy test loss measured on a held-out split of the Proposed Python GitHub code corpus.
  • Test loss on a held-out validation set is plotted against Codex model size.

The model test loss follows a power law in model size.

3.2. SOTA Comparisons on HumanEval

Codex, GPT-Neo, & TabNine evaluations for HumanEval.

GPT-Neo achieves 6.4% pass@1 and 21.3% pass@100, with GPT-Neo-2.7B roughly equivalent to Codex-85M (30× fewer parameters).

Similarly, GPT-J-6B achieves 11.6% pass@1 and 27.7% pass@100, which is roughly equivalent to Codex-300M (20× fewer parameters).

3.3. SOTA Comparisons on APPS

Finetuned GPT-Neo numbers from the APPS paper referenced above.
  • The APPS dataset consists of 5000 training and 5000 test examples of coding problems. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis.

Codex-12B evaluated 1-shot achieves comparable performance to a GPT-Neo model fine-tuned on APPS.

3.4. Codex-S

  • There is also Codex-S for supervised fine-tuning.

Codex-S outperforms the corresponding Codex by an average margin of 6.5 percentage points on pass@1 and by a larger average margin of 15.1 percentage points on pass@100 across model size.

3.5. Codex-D

Pass rates for docstring generating model Codex-D
  • There is also Codex-D for docstring generation.

It is found that Codex-D obtains similar but lower pass-rates compared to Codex-S.

  • The paper/report has 35 pages in total. I present some of them only. Please feel free to read the paper directly for more details, thanks.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.