Brief Review — DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

DetectGPT, LLM (e.g.: GPT-3) Detection, Detect If a Passage is Generated From a Given LLM

Sik-Ho Tsang
6 min readMar 4, 2023
DetectGPT aims to determine whether a piece of text was generated by a particular LLM p, such as GPT-3. To classify a candidate passage x, DetectGPT first generates minor perturbations of the passage ~xi using a generic pre-trained model such as T5. Then DetectGPT compares the log probability under p of the original sample x with each perturbed sample ~xi. If the average log ratio is high, the sample is likely from the source model.

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature,
DetectGPT, by Stanford University,
2023 arXiv v1 (Sik-Ho Tsang @ Medium)
NLP, LM, LLM, Transformer, T5, GPT-3, InstructGPT, ChatGPT

2.3. Summarization
2018 [T-DMCA] 2020 [Human Feedback Model] 2022 [InstructGPT]
==== My Other Paper Readings Also Over Here ====

  • ChatGPT is a hot topic. People are discussing whether we can detect a passage is generated from large language model (LLM).
  • A new curvature-based criterion, DetectGPT, is defined for judging if a passage is generated from a given LLM.
  • DetectGPT does not require training a separate classifier, collecting a dataset of real or generated passages, or explicitly watermarking generated text.
  • It uses only log probabilities computed by the model of interest and random perturbations of the passage from another generic pre-trained language model, e.g, T5.
  • (For quick read, please read 1, 2, and 4.1.)

Outline

  1. DetectGPT: Random Permutations & Hypothesis
  2. DetectGPT: Automated Testing
  3. Interpretation of the Perturbation Discrepancy as Curvature
  4. Results

1. DetectGPT: Random Permutations & Hypothesis

We identify and exploit the tendency of machinegenerated passages x~(left) to lie in negative curvature regions of log p(x), where nearby samples have lower model log probability on average. In contrast, human-written text x~preal(.) (right) tends not to occupy regions with clear negative log probability curvature.

DetectGPT is based on the hypothesis that samples from a source model typically lie in areas of negative curvature of the log probability function of , unlike human text.

  • If we apply small perturbations to a passage x~, producing ~x, the quantity log (x) - log (~x) should be relatively large on average for machine-generated samples compared to human-written text.
  • To leverage this hypothesis, first consider a perturbation function q(.|x) that gives a distribution over ~x, slightly modified versions of x with similar meaning (generally consider roughly paragraph-length texts x).
  • As an example, q(.|x) might be the result of simply asking a human to rewrite one of the sentences of x, while preserving the meaning of x.
  • Using the notion of a perturbation function, we can define the perturbation discrepancy d (x; , q):
  • Thus, the below hypothesis 4.1. is formed:
Hypothesis 4.1.

If q(.|x) are samples from a mask-filling model such as T5, rather than human rewrites, Hypothesis 4.1 can be empirically tested in an automated, scalable manner.

2. DetectGPT: Automated Testing

The average drop in log probability (perturbation discrepancy) after rephrasing a passage is consistently higher for model-generated passages than for human-written passages.
  • For real data, 500 news articles from the XSum dataset is used.
  • For model samples, the outputs of four different LLMs are used when prompted with the first 30 tokens of each article in XSum.
  • T5–3B is used to apply perturbations, masking out randomly-sampled 2-word spans until 15% of the words in the article are masked.
  • The expectation in the eq. (1) is approximated with 100 samples from T5.

The result of this experiment as above shows that the distribution of perturbation discrepancies is significantly different for human-written articles and model samples; model samples tend to have a larger perturbation discrepancy.

Given these results, we can detect if a piece of text was generated by a model p by simply thresholding the perturbation discrepancy.

  • In practice, normalizing the perturbation discrepancy by the standard deviation of the observed values used to estimate E~xq(.|x) log p(~x) provides a slightly better signal for detection, typically increasing AUROC by around 0.020, so normalized version of the perturbation discrepancy is used in the experiments.
Algorithm: DetectGPT model-generated text detection
  • The above algorithm summarized the normalized DetectGPT.

The perturbation discrepancy may be useful, it is not immediately obvious what it measures. Authors suggest to use curvature for interpretation as in the next section.

3. Interpretation of the Perturbation Discrepancy as Curvature

The perturbation discrepancy approximates a measure of the local curvature of the log probability function near the candidate passage, more specifically, that it is proportional to the negative trace of the Hessian of the log probability function.

  • First, Hutchinson’s trace estimator (Hutchinson, 1990) is invoked, giving an unbiased estimate of the trace of matrix A:
  • provided that the elements of z~qz are IID with E[zi] = 0 and Var(zi)=1.
  • To use the above Equation 2 to estimate the trace of the Hessian, the expectation of the directional second derivative zT Hf(x) z is computed. This expression is approximated with finite differences:
  • Combining Equations 2 and 3 and simplifying with h=1, an estimate of the negative Hessian trace is:
  • If the noise distribution is symmetric, that is, p(z)=p(-z) for all z, then we can simplify Equation 4 to:
  • The RHS of Equation 5 corresponds to the perturbation discrepancy (1) where the perturbation function q(~x|x) is replaced by the distribution qz(z) used in Hutchinson’s trace estimator (2).
  • ~x is a high-dimensional sequence of tokens while qz is a vector in a compact semantic space.

Sampling in semantic space ensures that all samples stay near the data manifold, which is useful because we would expect the log probability to always drop if we randomly perturb tokens. We can therefore interpret our objective as approximating the curvature restricted to the data manifold.

4. Results

4.1. Zero-Shot Machine-Generated Text Detection

AUROC for detecting samples from the given model on the given dataset for DetectGPT and four previously proposed criteria (500 samples used for evaluation).
  • Each experiment uses between 150 and 500 examples for evaluation.
  • Again, for each experiment, the machine-generated text is generated by prompting with the first 30 tokens of the real text.
  • The performance is evaluated using the area under the receiver operating characteristic curve (AUROC), which can be interpreted as the probability that a classifier correctly ranks a randomly-selected positive (machine-generated) example higher than a randomly-selected negative (human-written) example.
  • The mask rate is 15%. Masked span length of 2 is used on a held-out set of XSum data.

DetectGPT most improves average detection accuracy for XSum stories (0.1 AUROC improvement) and SQuAD Wikipedia contexts (0.05 AUROC improvement).

For 14 of the 15 combinations of dataset and model, DetectGPT provides the most accurate detection performance, with a 0.06 AUROC improvement on average.

4.2. Comparison with Supervised Detectors

Supervised machine-generated text detection models trained on large datasets of real and generated texts perform as well as or better than DetectGPT on in-distribution (top row) text. Zero-shot methods work out-of-the-box for new domains (bottom row) such as PubMed medical texts and German news data from WMT16.

Using 200 samples from each dataset for evaluation, the supervised detectors can provide similar detection performance to DetectGPT on in-distribution data like English news, but perform significantly worse than zero-shot methods in the case of English scientific writing and fail altogether for German writing.

DetectGPT detects GPT-3 generations with average AUROC on-par with supervised models trained specifically for machine-generated text detection.
  • 150 examples are sampled from the PubMedQA, XSum, and WritingPrompts datasets. Two pre-trained RoBERTa-based detector models are compared with DetectGPT and the probability thresholding baseline.

DetectGPT can provide detection competitive with the stronger supervised model, and it again outperforms probability thresholding on average.

4.3. Variants of Machine-Generated Text Detection

Authors simulate human edits to machine-generated text by replacing varying fractions of model samples with T5–3B generated text
  • This part is to see if detectors can detect human-editted machine-generated text. Human revision is simulated by replacing 5 word spans of the text with samples from T5–3B until r% of the text has been replaced.

DetectGPT maintains detection AUROC above 0.8 even when nearly a quarter of the text in model samples has been replaced. DetectGPT shows the strongest detection performance for all revision levels.

  • (There are other experiments, please check out the paper if interested.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.