Brief Review — AlphaCode 2 Technical Report

AlphaCode 2, Code Generation Using Gemini, Performed Better than 85% of Competition Participants

Sik-Ho Tsang
3 min readMar 28, 2024

AlphaCode 2 Technical Report
AlphaCode 2
, by AlphaCode Team, Google DeepMind
2023 DeepMind (Sik-Ho Tsang @ Medium)

Large Langauge Model (LLM)
2020 … 2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5]
==== My Other Paper Readings Are Also Over Here ====

  • AlphaCode 2 is proposed, which is a new and enhanced system with massively improved performance, powered by Gemini.
  • It relies on the combination of powerful language models and a bespoke search and reranking mechanism.

Outline

  1. AlphaCode 2
  2. Results

1. AlphaCode 2

AlphaCode 2
  • The major difference from AlphaCode is that AlphaCode 2 makes use of Gemini Pro as a foundation model.

1.1. Policy and Fine-Tuning

The starting point is the Gemini Pro model, on which two consecutive rounds of fine-tuning are applied using GOLD.

  • First, the model is fine-tuned on an updated version of the CodeContests dataset. This dataset contains approximately 15 thousand problems and 30 million human code samples.

1.2. Sampling

Sampling is used to generate up to a million code samples per problem, using a randomized temperature parameter for each sample to encourage diversity.

  • Randomize targeted metadata is included in the prompt, such as the problem difficulty rating and its categorical tags.
  • While Python and C++ are sampled for AlphaCode, only C++ samples are used for AlphaCode 2 as they are having higher quality.

1.3. Filtering

Each code sample is executed on the corresponding test input, and filter out all which do not produce the expected output and therefore could not have been correct, as well as the less than 5% of samples that do not compile.

  • On average, this filtering removes approximately 95% of the samples.

1.4. Clustering

  • After filtering, there are still left an average of 50 thousand candidates per problem.

A separate model is trained to generate new test inputs for each problem, then execute the remaining samples on these new inputs. The produced outputs form a signature that can help to group similar code samples together into clusters. The clusters are then ordered according to their cardinality, and only keep the 10 largest.

  • A single one per cluster is submitted to the online judge to obtain the best result.

1.5. Scoring Model

  • A second Gemini Pro model is fine-tuned to attribute an estimated correctness score between 0 and 1 to code samples.

Using this scoring model, a score is computed for each code sample in the remaining clusters; the best candidate sample is then selected out of each cluster based on this predicted score to form the final list of 10 submissions.

2. Results

Ranking
  • AlphaCode 2 is evaluted on Codeforces, same as AlphaCode.
  • 12 recent contests are selected, with more than 8000 participants, either from division 2 or the harder division “1+2”. This makes for a total of 77 problems.

AlphaCode 2 solved 43% of these competition problems, a close to 2× improvement over the prior record-setting AlphaCode system, which solved 25%.

Mapping this to competition rankings, as above, it is estimated that AlphaCode 2 sits at the 85th percentile on average — i.e. it performs better than 85% of entrants, ranking just between the ‘Expert’ and ‘Candidate Master’ categories on Codeforces.

  • In the two contests where it performs best, AlphaCode 2 outperforms more than 99.5% of competition participants!
Solve Rate

AlphaCode 2 requires about 100 samples to reach the level of performance of AlphaCode with a million samples, making it over 10000× more sample efficient.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.