Brief Review — ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Text-to-Code, Code-to-Text

5 min readNov 3, 2023

(a) Multilingual code pretraining; (b) Multilingual text pre-training; **(c) Proposed Universal multilingual text-code pre-training.**

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
ERNIE-Code, by Baidu,
2023 ACL (Sik-Ho Tsang @ Medium)
Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
ERNIE-Code is proposed, which is a unified pre-trained language model for 116 NLs and 6 PLs. Two methods are employed for universal cross-lingual pre-training: (1) Span-corruption language modeling that learns patterns from monolingual NL or PL; and (2) pivot-based translation language modeling that relies on parallel data of many NLs and PLs.

Outline

ERNIE-Code: Pretraining Objectives
ERNIE-Code: Models & Corpus
Results

1. ERNIE-Code

There are 2 pretraining tasks: One uses monolingual PL/NL data (unsupervised), while the other requires parallel NL-PL and NL-NL pairs (supervised).

The former advances to learn intra-modal patterns from PL or NL only, while the latter endows the model with cross-lingual/modal alignment and zero-shot capabilities.

1.1. Task#1: Span-Corruption Language Modeling (SCLM)

The denoising pretraining objective first corrupts input sequences by masking or adding noise; and then recovers the original inputs by forcing the model to predict corrupted spans, sentences, or documents.

The span-corruption denoising pre-training is extended on both PL and NL, which is referred as span-corruption language modeling (SCLM).

It corrupts 15% of the original NL/PL input tokens with a mean span length of 3 by replacing contiguous, randomly-spaced spans of tokens as a single mask placeholder and then predicting the corrupted span on the target side.
Suppose we have a total of M monolingual corpora of NL and PL corpora {Cm} where m=1, …, M. SCLM pre-training objective is applied on both NL and PL data in a multi-tasking fashion:

where θ denotes trainable parameters, x\mask(m) and xmask(m) are span-corrupted inputs and corresponding target spans from monolingual corpus Cm, respectively. xmask(m),<t indicates the generated tokens until the t-th time.

1.2. Task#2: Pivot-based Translation Language Modeling (PTLM)

This work aims at narrowing the cross-modal crosslingual gap between multiple NLs and PLs.

With bilingual PL-NL and NL-NL corpora, the parallelism with pivoting in dual directions is jointly learned: for instance, Python↔English and English↔Russian.
Parallel source-target sentences are concatenated and the model learns to predict the corrupted target language. Instead of masking random tokens, the whole sentence is corrupted.

Suppose we have N bilingual NL-NL and NL-PL parallel corpora {Dn} where n=1, … , N. The PTLM training is formulated as:

where xsource(n) and xtarget(n) denote source and target sentences from bilingual corpus Dn.

1.3. Zero-Shot Prompting

PTLM is reformated by prompting with a task prefix (See Figure 3), in which a task instruction “translate A to B: \n” is prepended on the left of input sentences, where A and B denote the source and target language, respectively.

This prompt instruction indicates the target language the model should translate to, resulting in descent zero-shot abilities.

2. ERNIE-Code: Model & Corpus

2.1. Model

ERNIE-Code is built on “T5.1.1” version, which improves upon T5 using gated nonlinearities.
A set of tokens representing whitespace indentation of different lengths in PL is added.

2.2. Code Corpus

It covers 6 monolingual PLs (Go, Java, JavaScript, PHP, Python, and Ruby) and 6 NL-PL parallel data, i.e., PL-NL query pairs.

2.3. Text Corpus

Monolingual data from CC-100 containing 116 different NLs.
Parallel data from OPUS website covering 15 languages. The collected NL translation pairs include MultiUN, IIT Bombay, OPUS, WikiMatrix, etc.
To alleviate the bias towards high-resource languages, authors follow XLM to rebalance the data distribution.

3. Results

3.1. Multilingual Code Summarization (Code-to-Text)

ERNIE-Code outperforms all baseline LLMs for either NL (mBART, mT5) or PL (PLBART, CodeT5). In particular, ERNIE-Code, with a length of 1024, exceeds its counterpart of 512-length (1.12 vs. 0.88).

Zero-Shot Prompting (Last Row): The proposed model demonstrates excellent zero-shot capability on Japanese and Russian summary generation, even outperforming “translate-train” settings by 0.43 / 9.05 on BLEU / ROUGE-L in general.

Some examples are shown above.

3.2. Multilingual Text-to-Code Generation (Text-to-Code)

ERNIE-Code outperforms all baselines on BLEU-4, ROUGE-L, and CodeBLEU scores, showing that the multilingual PL-NL pre-training can capture code syntax and semantics.

Zero-Shot Prompting (Last Row): The proposed model can zero-shotly produce code fragments with higher CodeBLEU scores than “translate-train” settings.

3.3. Documentation Translation (Text-to-Text)

ERNIE-Code surpasses mT5 and XLM-R in all 8 translation directions.

3.4. Program Repair (Code-to-Code)

On “small” and “medium” tasks, ERNIE-Code achieves 80.10 and 91.20 BLEU scores, outperforming or achieving competitive results compared with previous SOTA performance.

3.5. Ablation Study

Removing either monolingual (\SCLM) or bilingual (\PTLM) pre-training task deteriorates overall performance of all tasks.

Brief Review — ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Text-to-Code, Code-to-Text

Outline

1. ERNIE-Code

1.1. Task#1: Span-Corruption Language Modeling (SCLM)

1.2. Task#2: Pivot-based Translation Language Modeling (PTLM)

1.3. Zero-Shot Prompting

2. ERNIE-Code: Model & Corpus

2.1. Model

2.2. Code Corpus

2.3. Text Corpus

3. Results

3.1. Multilingual Code Summarization (Code-to-Text)

3.2. Multilingual Text-to-Code Generation (Text-to-Code)

3.3. Documentation Translation (Text-to-Text)

3.4. Program Repair (Code-to-Code)

3.5. Ablation Study

Written by Sik-Ho Tsang

No responses yet