Brief Review — Codex: Evaluating Large Language Models Trained on Code
Codex is Proposed to Solve Programming Tasks. HumanEval Evaluation Dataset is Also Proposed.
- Codex is proposed, which is a GPT language model fine-tuned on publicly available code from GitHub to study its Python code-writing capabilities.
- On HumanEval, a new evaluation set, functional correctness is measured for synthesizing programs from docstrings.
- The proposed Codex solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.
- Codex Showcase & HumanEval Evaluation Dataset
- Codex Training Dataset & Model
1. Codex Showcase & HumanEval Evaluation Dataset
- The prompt provided to the model is shown with a white background, and a successful model-generated completion is shown in a yellow background.
- HumanEval is proposed to evaluate the functional correctness on a set of 164 handwritten programming problems with unit tests.
- Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.
- These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
- Real-world programming tasks often involve iterations of approaches and bug fixes, which is approximated by generating many samples from the models and selecting one that passes all unit tests.
- It it quite different from natural language, BLEU is not accurate. pass@k metric, is used where k code samples are generated per problem see if any sample passes the unit tests.
2. Codex Training Dataset & Model
2.1. Training Dataset
- The training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB.
- Authors filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.
- After filtering, the final dataset totaled 159 GB.
- GPT models containing up to 12B parameters, are fine-tuned on code to produce Codex. Fine-tuning GPT-3 does not shows improvements.
- The distribution of words in GitHub code differs from that of natural text, the original tokenizer is not very effective for representing code. An additional set of tokens is added for representing whitespace runs of different lengths. This allows to represent code using approximately 30% fewer tokens.
3.1. Test Loss
- Test loss on a held-out validation set is plotted against Codex model size.
The model test loss follows a power law in model size.
3.2. SOTA Comparisons on HumanEval
GPT-Neo achieves 6.4% pass@1 and 21.3% pass@100, with GPT-Neo-2.7B roughly equivalent to Codex-85M (30× fewer parameters).
Similarly, GPT-J-6B achieves 11.6% pass@1 and 27.7% pass@100, which is roughly equivalent to Codex-300M (20× fewer parameters).
3.3. SOTA Comparisons on APPS
- The APPS dataset consists of 5000 training and 5000 test examples of coding problems. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis.
Codex-12B evaluated 1-shot achieves comparable performance to a GPT-Neo model fine-tuned on APPS.
- There is also Codex-S for supervised fine-tuning.
Codex-S outperforms the corresponding Codex by an average margin of 6.5 percentage points on pass@1 and by a larger average margin of 15.1 percentage points on pass@100 across model size.
- There is also Codex-D for docstring generation.
It is found that Codex-D obtains similar but lower pass-rates compared to Codex-S.
- The paper/report has 35 pages in total. I present some of them only. Please feel free to read the paper directly for more details, thanks.