Brief Review — ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
Text-to-Code, Code-to-Text
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
ERNIE-Code, by Baidu,
2023 ACL (Sik-Ho Tsang @ Medium)Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====
- Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
- ERNIE-Code is proposed, which is a unified pre-trained language model for 116 NLs and 6 PLs. Two methods are employed for universal cross-lingual pre-training: (1) Span-corruption language modeling that learns patterns from monolingual NL or PL; and (2) pivot-based translation language modeling that relies on parallel data of many NLs and PLs.
Outline
- ERNIE-Code: Pretraining Objectives
- ERNIE-Code: Models & Corpus
- Results
1. ERNIE-Code
- There are 2 pretraining tasks: One uses monolingual PL/NL data (unsupervised), while the other requires parallel NL-PL and NL-NL pairs (supervised).
The former advances to learn intra-modal patterns from PL or NL only, while the latter endows the model with cross-lingual/modal alignment and zero-shot capabilities.
1.1. Task#1: Span-Corruption Language Modeling (SCLM)
- The denoising pretraining objective first corrupts input sequences by masking or adding noise; and then recovers the original inputs by forcing the model to predict corrupted spans, sentences, or documents.
The span-corruption denoising pre-training is extended on both PL and NL, which is referred as span-corruption language modeling (SCLM).
- It corrupts 15% of the original NL/PL input tokens with a mean span length of 3 by replacing contiguous, randomly-spaced spans of tokens as a single mask placeholder and then predicting the corrupted span on the target side.
- Suppose we have a total of M monolingual corpora of NL and PL corpora {Cm} where m=1, …, M. SCLM pre-training objective is applied on both NL and PL data in a multi-tasking fashion:
- where θ denotes trainable parameters, x\mask(m) and xmask(m) are span-corrupted inputs and corresponding target spans from monolingual corpus Cm, respectively. xmask(m),<t indicates the generated tokens until the t-th time.
1.2. Task#2: Pivot-based Translation Language Modeling (PTLM)
- This work aims at narrowing the cross-modal crosslingual gap between multiple NLs and PLs.
With bilingual PL-NL and NL-NL corpora, the parallelism with pivoting in dual directions is jointly learned: for instance, Python↔English and English↔Russian.
Parallel source-target sentences are concatenated and the model learns to predict the corrupted target language. Instead of masking random tokens, the whole sentence is corrupted.
- Suppose we have N bilingual NL-NL and NL-PL parallel corpora {Dn} where n=1, … , N. The PTLM training is formulated as:
- where xsource(n) and xtarget(n) denote source and target sentences from bilingual corpus Dn.
1.3. Zero-Shot Prompting
PTLM is reformated by prompting with a task prefix (See Figure 3), in which a task instruction “translate A to B: \n” is prepended on the left of input sentences, where A and B denote the source and target language, respectively.
- This prompt instruction indicates the target language the model should translate to, resulting in descent zero-shot abilities.
2. ERNIE-Code: Model & Corpus
2.1. Model
- ERNIE-Code is built on “T5.1.1” version, which improves upon T5 using gated nonlinearities.
- A set of tokens representing whitespace indentation of different lengths in PL is added.
2.2. Code Corpus
- It covers 6 monolingual PLs (Go, Java, JavaScript, PHP, Python, and Ruby) and 6 NL-PL parallel data, i.e., PL-NL query pairs.
2.3. Text Corpus
- Monolingual data from CC-100 containing 116 different NLs.
- Parallel data from OPUS website covering 15 languages. The collected NL translation pairs include MultiUN, IIT Bombay, OPUS, WikiMatrix, etc.
- To alleviate the bias towards high-resource languages, authors follow XLM to rebalance the data distribution.
3. Results
3.1. Multilingual Code Summarization (Code-to-Text)
ERNIE-Code outperforms all baseline LLMs for either NL (mBART, mT5) or PL (PLBART, CodeT5). In particular, ERNIE-Code, with a length of 1024, exceeds its counterpart of 512-length (1.12 vs. 0.88).
- Zero-Shot Prompting (Last Row): The proposed model demonstrates excellent zero-shot capability on Japanese and Russian summary generation, even outperforming “translate-train” settings by 0.43 / 9.05 on BLEU / ROUGE-L in general.
- Some examples are shown above.
3.2. Multilingual Text-to-Code Generation (Text-to-Code)
ERNIE-Code outperforms all baselines on BLEU-4, ROUGE-L, and CodeBLEU scores, showing that the multilingual PL-NL pre-training can capture code syntax and semantics.
- Zero-Shot Prompting (Last Row): The proposed model can zero-shotly produce code fragments with higher CodeBLEU scores than “translate-train” settings.
3.3. Documentation Translation (Text-to-Text)
ERNIE-Code surpasses mT5 and XLM-R in all 8 translation directions.
3.4. Program Repair (Code-to-Code)
On “small” and “medium” tasks, ERNIE-Code achieves 80.10 and 91.20 BLEU scores, outperforming or achieving competitive results compared with previous SOTA performance.
3.5. Ablation Study
Removing either monolingual (\SCLM) or bilingual (\PTLM) pre-training task deteriorates overall performance of all tasks.