Review — Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

C4 English Dataset for NLP

Sik-Ho Tsang
4 min readNov 5, 2023
C4 English Dataset

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
C4, by University of Washington, Hugging Face, Allen Institute for Artificial Intelligence, Queer in AI
2021 EMNLP, Over 170 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

  • Colossal Clean Crawled Corpus (C4) used in T5, is a dataset created by applying a set of filters to a single snapshot of Common Crawl.
  • This dataset is documented for analysis in this paper.

Outline

  1. C4: Dataset Filtering
  2. C4: Dataset Analysis
  3. Discussion & Recommendations

1. C4: Dataset Filtering

Statistics of 3 Corpora C4.EN, C4.EN.NOCLEAN, and C4.EN.NOBLOCKLIST

The English Colossal Clean Crawled Corpus (C4) is created by taking the April 2019 snapshot of Common Crawl, by applying a number of filters with the intention of removing text that is not natural English. Filtering out:

  1. Lines which don’t end in a terminal punctuation mark;
  2. Have fewer than 3 words;
  3. Less than 5 sentences;
  4. Contain Lorem ipsum placeholder text;
  5. Removing documents which contain any word on the “List of Dirty, Naughty, Obscene, or Otherwise Bad Words”.
  6. langdetect is used to remove documents which weren’t classified as English.

C4.EN: This “cleaned” version of C4 is named: C4.EN.

C4.EN.NOCLEAN: The “uncleaned” version, which is the snapshot of Common Crawl identified as English (with no other filters applied), and

C4.EN.NOBLOCKLIST: is the same as C4.EN but without filtering out documents containing tokens from a blocklist of words.

2. C4: Dataset Analysis

2.1. Corpus-Level Statistics

Domains and Websites

Left: Unsurprisingly, popular top-level domains such as .com, .org, and .net are well represented.

Right: Surprisingly, the cleaned corpus contains substantial amounts of patent text documents.

Date URLs

92% are estimated to have been written in the last decade (2011–2019). The distribution is long-tailed.

  • Also, 51.3% pages are hosted in the United States. The countries with the estimated 2nd, 3rd, 4th largest English speaking populations15 — India, Pakistan, Nigeria, and The Philippines — have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States.

2.2. What is in the Text?

  • Web-crawled data will increasingly contain data that was not written by humans.
  • There are many OCR documents, which are imperfect, and thus generate text that is different in distribution from natural English.
  • Benchmarking data contamination is to see what extent training or test datasets from downstream NLP tasks appear in the pretraining corpus.
  • The percentage of inputs found in C4.EN varies widely, from less than 2% to over 50%. Interestingly, both the smallest and largest contamination proportions come from QNLI (built from Wikipedia).

Although train set contamination is generally not problematic for classification tasks if it does not include labels, it could be misleading in few-shot and zero-shot learning.

  • Bias: For example: “Jewish” has a significantly higher percentage of positive sentiment tokens (73.2% of 3.4M tokens) than “Arab” does (65.7% of 1.2M tokens).

2.3. What is Excluded from the Corpus?

The exclusion of documents that contain any word from a blocklist of “bad” words with the intent to remove “offensive language” , i.e., hateful, toxic, obscene, sexual, or lewd content.

  • African American English and Hispanic-aligned English are disproportionately affected by the blocklist filtering, are removed at substantially higher rates (42% and 32%, respectively) than WAE and other English (6.2% and 7.2%, respectively).

3. Discussion & Recommendations

  • The need to report website metadata.
  • The need to examine benchmark contamination.
  • Issue of Social biases and representational harms.
  • Issue of excluded voices and identities.
  • There also exists personally identifiable information and copyrighted text within C4.EN, but we leave quantifying or removing such text to future work.
  • (Recently, there are many discussions about ChatGPT or other LLMs copyright issue as they are trained using texts that have copyright.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.