DataPerf: Benchmarks for Data-Centric AI Development

DataPerf, Improves Data Rather Than Models

Sik-Ho Tsang
7 min readJun 6, 2023
DataPerf (
ML-benchmark saturation relative to human performance (black line)

DataPerf: Benchmarks for Data-Centric AI Development,
DataPerf, by 1Harvard University, 2ETH Zurich, 3Coactive.AI, 4Landing AI, 5DeepLearning.AI, 6Hugging Face, 7MLCommons, 8Meta, 9Google, 10Stanford University, 11San Diego Supercomputer Center, UC San Diego, 12Carnegie Mellon University, 13Cleanlab, 14TU Eindhoven, 15Institute for Human and Machine Cognition,
2022 ICML Workshop (Sik-Ho Tsang @ Medium)

Data-Centric AI (DCAI)
2021 [SimSiam+AL] [BYOL+LP] 2022 [Small is the New Big]
==== My Other Paper Readings Are Also Over Here ====

  • Models are attaining perfect or “human-level” performance, as shown above. This saturation raises two questions: First, is ML research making real progress on the underlying capabilities, or is it just overfitting to the benchmark? Second, how should benchmarks evolve to push the frontier of ML research?
  • DataPerf, a benchmark package, is presented for evaluating ML datasets and dataset-working algorithms.


  1. Data Centric Operations & DataPerf’s Goal
  2. DataPerf
  3. Competitions, Challenges and Leaderboards

1. Data Centric Operations & DataPerf’s Goals

Data Centric Operations

1.1. Data Centric Operations

  • Today’s complex data-centric development pipelines as above, can be the main bottlenecks.
  • A user often relies on a collection of data-centric operations to improve data quality and repeated data-centric iterations to refine these operations strategically, given the errors a model makes.

1.2. DataPerf’s Goals

DataPerf’s goal is to capture the primary stages of such a data-centric pipeline to improve ML data quality. Benchmark examples include data debugging, data valuation, training- and test-set creation, and selection algorithms covering a range of ML applications.

DataPerf is a scientific instrument to systematically measure the quality of training and test datasets.

This paper defines the DataPerf benchmark suite, which is a collection of tasks, metrics and rules.

2. DataPerf

DataPerf Design Overview

While most of the benchmarks focus on the model development, DataPerf benchmark tasks include (1) Training set creation, (2) Test set creation, (3) Data selection, (4) Data debugging, (5) Data valuation, and (6) Slice discovery.

DataPerf Benchmark Types
  • The above table summarize the benchmark types. Belows show a little bit of details.

2.1. Training Set Creation

Generating data, augmenting data and other data-centric development techniques can transform small datasets into valuable training sets, but finding the right combination of methods can be painstaking and error prone.

  • The challenge is to create a pipeline that expands a limited dataset into one that represents the real world. This type of benchmark aims to measure a novel training dataset by training various models and measuring the resulting accuracy.

2.2. Test Set Creation

  • Conceptually, a Test Dataset benchmark measures a novel test dataset, or adversarial test data, by evaluating if it is (1) labeled incorrectly by a variety of models, (2) labeled correctly by humans, and (3) novel relative to other existing test data.
  • The purpose of this type of benchmark is to foster innovation in the way we sample data for test sets and to discover how data properties influence ML performance with respect to accuracy, reliability, fairness, diversity and reproducibility.

2.3. Selection Algorithm

  • Collecting large amounts of data has become straightforward, but creating valuable training sets from that data can be cumbersome and resource intensive. Naively processing the data wastes valuable computational and labeling resources because the data is often redundant and heavily skewed.
  • This challenge tasks participants with algorithmically identifying and selecting the most informative examples from a dataset to use for training.

2.4. Debugging Algorithm

  • Training datasets can contain data errors, such as missing or corrupted values and incorrect labels. Repairing these errors is costly and often involves human labor.
  • Given a fixed budget of data examples that can be repaired, the challenge is to select a subset of training examples that, after repair, yield the biggest performance improvement.

2.5. Valuation Algorithm

  • Conceptually, there is a data market in which data acquirers and data providers buy and sell datasets. Assessing the data quality is crucial.
  • This benchmark will measure the quality of an algorithm that estimates the relative value of a new dataset by measuring the difference between estimated accuracy and the true accuracy of a model trained on the union of the two datasets.

2.6. Slice Discovery Algorithm

  • ML models that achieve high overall accuracy often make systematic errors on important data subgroups (or slices). For instance, models trained to detect collapsed lungs in chest X-rays usually make predictions based on the presence of chest drains. As a result, these models frequently make prediction errors in cases without chest drains.
  • The benchmark measures how closely the top-k examples in each slice match the top-k examples in a ground truth slice, and it adds newly discovered useful slices into the ground truth.

2.7. Benchmark Types × Tasks

Benchmark Matrix
  • A benchmark matrix is proposed for three main reasons.
  1. First, each column embodies a data ratchet for a specific problem in the form of a training set benchmark and a test set benchmark.
  2. Second, as long as it is model-agnostic, the same data-centric algorithm can be submitted to all benchmarks in the same row for algorithmic benchmarks to demonstrate generality.
  3. Third, pragmatically, rules and infrastructure developed to support one benchmark may be leveraged for other challenges.

3. Competitions, Challenges and Leaderboards

  • DataPerf will use leaderboards and challenges to encourage constructive competition, identify the best ideas, and inspire the next generation of concepts for building and optimizing datasets.
  • A leaderboard is a public summary of benchmark results.

3.1. Selection for Speech

Speech Benchmark Example
  • DataPerf v0.5 includes a dataset-selection-algorithm challenge with a speech-centric focus.
  • The objective of the speech-selection task is to develop a selection algorithm that chooses the most effective training samples from a vast (and noisy) corpus of spoken words.
  • e.g.: One such dataset is the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of over 340,000 spoken words in 50 languages. Collectively, these languages represent more than five billion people. But owing to the errors in the generation process and in the source data, some of the samples are incorrect.

The algorithm will select the fewest possible data samples to train a suite of five-target keyword-spotting models.

  • In brief, participants submit a containerized version of their selection algorithm to the Dynabench server, where the benchmark infrastructure will first rerun the open test set on the selection container to check whether the submission is valid and then, if so, run the selection algorithm on the hidden test set to produce an official score. That score will appear on a live leaderboard.

3.2. Selection for Vision

  • Large datasets have been critical to many ML achievements, but they create problems. Massive datasets are cumbersome and expensive to work with, especially when they contain unstructured data such as images, videos and speech.
  • Careful data selection can mitigate some of the difficulties by focusing on the most valuable examples. The task is to design a data-selection strategy that chooses the best training examples from a large pool of training images.
  • e.g.: Creating a subset of the Open Images Dataset V6 training set that maximizes the mean average precision (mAP) for a set of concepts (“cupcake,” “hawk” and “sushi”). Because the images are unlabeled, we provide a set of positive examples for each classification task that participants can use to search for images that contain the target concepts.
VisionPerf in DataPerf (
  • There is a VisionPerf, aiming to train a model with fraction of data instead of full data.
  • There are multiple related data-centric benchmarks:
  • DCBench is a benchmark for algorithms used to construct and analyze machine learning datasets.
  • Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge aims to raise the bar for ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process, starting with image classification.
  • Dynabench is the platform DataPerf is using for dynamic data collection and benchmarking that challenges existing ML. benchmarking dogma by embracing dynamic dataset generation.
  • MLPerf benchmark suites closely parallel what DataPerf is trying. MLPerf defines clear rules for measuring speed and power consumption across various systems, spanning from datacenter scale ML systems that consume megawatts of power to tiny embedded ML systems that consume only microwatts of power.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.