Brief Review — MMLU: Measuring Massive Multitask language Understanding

MMLU Dataset With 57 Tasks

Sik-Ho Tsang
4 min readNov 30, 2023
MMLU Leaderboard (Few-Shot GPT-4 is the best so far (30/11/2023))

Measuring Massive Multitask language Understanding
MMLU, by UC Berkeley, Columbia University, UChicago, UIUC
2021 ICML, Over 400 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

  • The Massive Multitask Language Understanding (MMLU) test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. The models must possess extensive world knowledge and problem solving ability.
  • While GLUE and SuperGLUE performances are saturated, the best models still need substantial improvements before they can reach expert-level accuracy.
  • Some models still have near-random accuracy on some socially important subjects such as morality and law.
  • (This dataset is used for evalutation for many LLMs nowadays.)

Outline

  1. Massive Multitask Language Understanding (MMLU)
  2. Results

1. Massive Multitask Language Understanding (MMLU)

1.1. MMLU Statistics

  • There are 57 tasks in total. 15908 questions in total are collected, which are split into a few-shot development set, a validation set, and a test set.
  • The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions.
  • Each subject contains 100 test examples at the minimum, which is longer than most exams designed to assess people.
  • The expert-level accuracy is estimated approximately as 89.8%.
  • There are few main sections: Humanity, Social Science, STEM, and Others.

1.2. Humanity

  • Branches of the humanities include law, philosophy, history, and so on.
  • Legal understanding is necessary for understanding and following rules and regulations.
  • For philosophy, the questions cover concepts like logical fallacies, formal logic, and famous philosophical arguments. It also covers moral scenarios.
  • Finally, the history questions cover a wide range of time periods and geographical locations, including prehistory and other advanced subjects.

1.3. Social Science

  • Social science includes branches of knowledge that examine human behavior and society.
  • The economics questions include microeconomics, macroeconomics, and econometrics, and cover different types of problems, including questions that require a mixture of world knowledge, qualitative reasoning, or quantitative reasoning.
  • Social science also includes psychology, a field that may be especially important for attaining a nuanced understanding of humans.

1.4. Science, Technology, Engineering, and Mathematics (STEM)

  • STEM subjects include physics, computer science, mathematics, and more.
  • Conceptual physics tests understanding of physics principles.
  • College mathematics questions, like those found on the GRE mathematics subject test, often require chains of reasoning and abstract knowledge.

1.5. Others

  • This section includes the Professional Medicine task, which has difficult questions that require humans many years of study to master.
  • This section also contains business topics like finance, accounting, and marketing, as well as knowledge of global facts.

2. Results

Figure 1. GPT-3 Performance
  • Figure 1a: GPT-3 using few-shot learning and inference example.
Average weighted accuracy for each model on all four broad disciplines.
  • Figure 1b & Table 1: The 3 smaller GPT-3 models have near random accuracy (around 25%).
  • In contrast, X-Large 175 billion parameter GPT-3 model performs substantially better than random, with an accuracy of 43.9%.

This show that MMLU is a challenging evaluation set.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.