Brief Review — BERTScore: Evaluating Text Generation with BERT

BERTScore, A Similarity Score for Evalutation Metric Via BERT

Sik-Ho Tsang
4 min readMar 5, 2024

BERTScore: Evaluating Text Generation with BERT
, by Cornell University, and ASAPP Inc.
2020 ICLR, Over 3400 Citations (Sik-Ho Tsang @ Medium)

Neural Machine Translation (NMT)
2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] [R-Drop] 2022 [DeepNet] [PaLM] [BLOOM] [AlexaTM 20B]

Image Captioning
2015 …
2017 [Visual N-Grams] 2018 [Conceptual Captions]
==== My Other Paper Readings Are Also Over Here ====

  • BERTScore is proposed, which computes a similarity score using contextual embeddings (e.g.: BERT) for each token in the candidate sentence with each token in the reference sentence.


  1. BERTScore
  2. Results

1. BERTScore


1.1. Precision, Recall, F1

  • Given a reference sentence x = ⟨x1, …, xkand a candidate sentence ˆx = ⟨ˆx1, …, ˆxl, contextual embeddings are used to represent the tokens, and compute matching using cosine similarity:
  • Pre-normalized vectors are used, which reduces this calculation to the inner product:
  • BERTscore can be optionally weighted with inverse document frequency scores.

The complete score matches each token in x to a token in ˆx to compute recall, RBERT, and each token in ˆx to a token in x to compute precision, PBERT. Greedy matching is used to maximize the matching similarity score, where each token is matched to the most similar token in the other sentence. Precision and recall are combined to compute an F1 measure, FBERT.

  • Previous study demonstrated that rare words can be more indicative for sentence similarity than common words.

1.2. Weighted With Inverse Document Frequency (idf)

Inverse document frequency (idf) scores are computed from the test corpus. Given M reference sentences {x(i)} where i from 1 to M, the idf score of a word-piece token w is:

  • For example, recall with idf weighting is:
  • Plus-one smoothing is applied to handle unknown word pieces.

1.3. Baseline Rescaling

  • Although the range of cosine similarity is between −1 and 1, in practice it is observed scores in a more limited range.

This is addressed by rescaling BERTScore with respect to its empirical lower bound b as a baseline.

  • b is computed using Common Crawl monolingual datasets.
  • For each language and contextual embedding model, 1M candidate-reference pairs are created by grouping two random sentences. Because of the random pairing and the corpus diversity, each pair has very low lexical and semantic overlapping.

b is computed by averaging BERTScore computed on these sentence pairs. Equipped with baseline b, BERTScore is rescaled linearly.

  • For example, the rescaled value ˆRBERT of RBERT is:
  • This method does not affect the ranking ability and human correlation of BERTSCore, and is intended solely to increase the score readability.

1.4. Model

  • 24-layer RoBERTalarge model is used for English tasks, 12-layer BERTchinese model is used for Chinese tasks, and the 12-layer cased multilingual BERTmulti model is used for other languages.

2. Results

2.1. Machine Translation

  • Tables 1–3 show system-level correlation to human judgements, correlations on hybrid systems, and model selection performance.

BERTScore is consistently a top performer.

  • Table 4 shows segment-level correlations.

BERTScore exhibits significantly higher performance compared to the other metrics.

2.2. Image Captioning & Adversarial Paraphrase Classification

On Table 5, For COCO Captioning Challenge, BERTScore outperforms all task-agnostic baselines by large margins.

On Table 6, the performance of BERTScore drops only slightly, showing more robustness than the other metrics.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.