Brief Review — ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Text-to-Code, Code-to-Text

Sik-Ho Tsang
5 min readNov 3, 2023
(a) Multilingual code pretraining; (b) Multilingual text pre-training; (c) Proposed Universal multilingual text-code pre-training.

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
ERNIE-Code, by Baidu,
2023 ACL (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

  • Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
  • ERNIE-Code is proposed, which is a unified pre-trained language model for 116 NLs and 6 PLs. Two methods are employed for universal cross-lingual pre-training: (1) Span-corruption language modeling that learns patterns from monolingual NL or PL; and (2) pivot-based translation language modeling that relies on parallel data of many NLs and PLs.

Outline

  1. ERNIE-Code: Pretraining Objectives
  2. ERNIE-Code: Models & Corpus
  3. Results

1. ERNIE-Code

  • There are 2 pretraining tasks: One uses monolingual PL/NL data (unsupervised), while the other requires parallel NL-PL and NL-NL pairs (supervised).

The former advances to learn intra-modal patterns from PL or NL only, while the latter endows the model with cross-lingual/modal alignment and zero-shot capabilities.

1.1. Task#1: Span-Corruption Language Modeling (SCLM)

Span-Corruption Language Modeling (SCLM)
  • The denoising pretraining objective first corrupts input sequences by masking or adding noise; and then recovers the original inputs by forcing the model to predict corrupted spans, sentences, or documents.

The span-corruption denoising pre-training is extended on both PL and NL, which is referred as span-corruption language modeling (SCLM).

  • It corrupts 15% of the original NL/PL input tokens with a mean span length of 3 by replacing contiguous, randomly-spaced spans of tokens as a single mask placeholder and then predicting the corrupted span on the target side.
  • Suppose we have a total of M monolingual corpora of NL and PL corpora {Cm} where m=1, …, M. SCLM pre-training objective is applied on both NL and PL data in a multi-tasking fashion:
  • where θ denotes trainable parameters, x\mask(m) and xmask(m) are span-corrupted inputs and corresponding target spans from monolingual corpus Cm, respectively. xmask(m),<t indicates the generated tokens until the t-th time.

1.2. Task#2: Pivot-based Translation Language Modeling (PTLM)

Pivot-based Translation Language Modeling (PTLM)
  • This work aims at narrowing the cross-modal crosslingual gap between multiple NLs and PLs.

With bilingual PL-NL and NL-NL corpora, the parallelism with pivoting in dual directions is jointly learned: for instance, Python↔English and English↔Russian.

Parallel source-target sentences are concatenated and the model learns to predict the corrupted target language. Instead of masking random tokens, the whole sentence is corrupted.

  • Suppose we have N bilingual NL-NL and NL-PL parallel corpora {Dn} where n=1, … , N. The PTLM training is formulated as:
  • where xsource(n) and xtarget(n) denote source and target sentences from bilingual corpus Dn.

1.3. Zero-Shot Prompting

PTLM is reformated by prompting with a task prefix (See Figure 3), in which a task instruction “translate A to B: \n” is prepended on the left of input sentences, where A and B denote the source and target language, respectively.

  • This prompt instruction indicates the target language the model should translate to, resulting in descent zero-shot abilities.

2. ERNIE-Code: Model & Corpus

2.1. Model

  • ERNIE-Code is built on “T5.1.1” version, which improves upon T5 using gated nonlinearities.
  • A set of tokens representing whitespace indentation of different lengths in PL is added.

2.2. Code Corpus

  • It covers 6 monolingual PLs (Go, Java, JavaScript, PHP, Python, and Ruby) and 6 NL-PL parallel data, i.e., PL-NL query pairs.

2.3. Text Corpus

  • Monolingual data from CC-100 containing 116 different NLs.
  • Parallel data from OPUS website covering 15 languages. The collected NL translation pairs include MultiUN, IIT Bombay, OPUS, WikiMatrix, etc.
  • To alleviate the bias towards high-resource languages, authors follow XLM to rebalance the data distribution.

3. Results

3.1. Multilingual Code Summarization (Code-to-Text)

Multilingual Code Summarization (Code-to-Text)

ERNIE-Code outperforms all baseline LLMs for either NL (mBART, mT5) or PL (PLBART, CodeT5). In particular, ERNIE-Code, with a length of 1024, exceeds its counterpart of 512-length (1.12 vs. 0.88).

  • Zero-Shot Prompting (Last Row): The proposed model demonstrates excellent zero-shot capability on Japanese and Russian summary generation, even outperforming “translate-train” settings by 0.43 / 9.05 on BLEU / ROUGE-L in general.
Examples
  • Some examples are shown above.

3.2. Multilingual Text-to-Code Generation (Text-to-Code)

Multilingual Text-to-Code Generation (Text-to-Code)

ERNIE-Code outperforms all baselines on BLEU-4, ROUGE-L, and CodeBLEU scores, showing that the multilingual PL-NL pre-training can capture code syntax and semantics.

  • Zero-Shot Prompting (Last Row): The proposed model can zero-shotly produce code fragments with higher CodeBLEU scores than “translate-train” settings.

3.3. Documentation Translation (Text-to-Text)

Documentation Translation (Text-to-Text)

ERNIE-Code surpasses mT5 and XLM-R in all 8 translation directions.

3.4. Program Repair (Code-to-Code)

Program Repair (Code-to-Code)

On “small” and “medium” tasks, ERNIE-Code achieves 80.10 and 91.20 BLEU scores, outperforming or achieving competitive results compared with previous SOTA performance.

3.5. Ablation Study

Ablation Study

Removing either monolingual (\SCLM) or bilingual (\PTLM) pre-training task deteriorates overall performance of all tasks.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.