Brief Review — Competition-level code generation with AlphaCode

AlphaCode, Transformer for Competition-Level Code Generation

Sik-Ho Tsang
4 min readFeb 23, 2024

Competition-level code generation with AlphaCode
, by DeepMind
2022 Science, Over 540 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] 2023 [ERNIE-Code]
==== My Other Paper Readings Are Also Over Here ====

  • Recent Transformer-based neural network models show impressive code generation abilities yet still perform poorly on more complex tasks requiring problem-solving skills.
  • AlphaCode is proposed, which is a system for code generation, by generating millions of diverse programs using specially trained Transformer-based networks and then filtering and clustering those programs to a maximum of just 10 submissions.
  • This result marks the first time an AI system has performed competitively in programming competitions.
  • Later on, AlphaCode 2, powered by Gemini, is also proposed. (Hope I can read it in the coming future.)


  1. AlphaCode
  2. Results

1. AlphaCode

1.1. Problem & Motivation

  • Creating solutions to problems can be seen as a sequence-to-sequence (Seq2Seq) prediction task:

Given a problem description X in natural language (e.g., Fig. 1A), produce a corresponding solution Y in a programming language (e.g., Fig. 1B).

  • This task naturally motivated an encoder-decoder Transformer architecture for AlphaCode, which models p(Y|X).

1.2. AlphaCode Framework


The models are pretrained on a 715-GB snapshot of human code from GitHub, with cross-entropy next-token prediction loss.

Then, models are fine-tuned and evaluated on a 2.6-GB dataset of competitive programming problems that authors created and released under the name CodeContests.

  • CodeContests includes problems, correct and incorrect human submissions, and test cases. The training set contains 13,328 problems, with an average of 922.4 submissions per problem. The validation and test sets contain 117 and 165 problems, respectively.

1.3. Enhancements

  • The models included the following enhancements on top of a standard Transformer encoder-decoder architecture.
  • 1) Multi-Query Attention: Multi-Query Attention used a full set of query heads but shared key and value heads per attention block to substantially reduce memory usage and cache updates. The change increased the sampling speed of AlphaCode more than 10-fold.
  • 2) Masked language modeling (MLM): MLM is used, as in BERT, which empirically improved model solve rates.
  • 3) Tempering: Tempering (25) artificially made the training distribution sharper, producing a smoother sampling distribution (a regularization effect to prevent overfitting).
  • 4) Value conditioning and prediction: It is used to discriminate between correct and incorrect problem submissions contained in CodeContests, providing an additional training signal.
  • 5) Generation by off-policy learning from demonstrations (GOLD): A variation of GOLD (26) is adopted, an offline reinforcement learning (RL) algorithm that focuses training on the most likely solutions for each problem instead of all possible solutions.

1.4. Sampling, Filtering, and Clustering

  • At sampling time, the diversity of samples was important, so that the millions of samples for each problem. A high temperature is used and samples conditioned on random metadata: problem difficulty ratings, problem tags.
  • To select the 10 best samples for submission, filtering and clustering are applied.
  • Filtering executed samples using example tests included in the problem statement and removed samples that failed those tests. This filtering removed ~99% of model samples.
  • The possibly tens of thousands of candidate samples that remained were then clustered by executing them on inputs generated by a separate Transformer model trained to do test input generation and by grouping together programs that produced the same outputs on the generated inputs.
  • One sample is then picked from each of the 10 largest clusters for submission, approximately the most likely program behaviors from the model.
  • Intuitively, correct programs would behave the same and form large clusters, but incorrect programs could fail in many different ways.

1.5. Ensembling

  • 41 billion (41B) and 9 billion (9B) parameter models are ensembled by pooling their samples.

2. Results

Overall, the proposed system achieved an average ranking in the top 54.3% when limited to 10 submissions per problem, although 66% of solved problems were solved with the first submission.

  • This performance in competitions approximately corresponds to a novice programmer with a few months to a year of training.
  • This is the first time that a computer system has been competitive with human participants in programming competitions.
  • Prior works such as Codex and APPS on comparable datasets achieved low single-digit solve rates.
Ablation Study on Scaling
  • Fig. 3A: Solve rates scale log-linearly with more samples. Better models have higher slopes in the scaling curve.
  • Fig. 3B: Solve rates scale log-linearly with more compute.
  • Fig. 3C: Larger models required more compute per sample but eventually outperformed smaller models.
  • Fig. 3D: Sample selection is critical to solve rate scaling.
Ablation Study on Enhancements

A buildup ablation of AlphaCode’s enhancements, which significantly improved the solve rate on models at the 1B scale.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.