Brief Review — DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
DetectGPT, LLM (e.g.: GPT-3) Detection, Detect If a Passage is Generated From a Given LLM
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature,
DetectGPT, by Stanford University,
2023 arXiv v1 (Sik-Ho Tsang @ Medium)
NLP, LM, LLM, Transformer, T5, GPT-3, InstructGPT, ChatGPT2.3. Summarization
2018 [T-DMCA] 2020 [Human Feedback Model] 2022 [InstructGPT]
==== My Other Paper Readings Also Over Here ====
- ChatGPT is a hot topic. People are discussing whether we can detect a passage is generated from large language model (LLM).
- A new curvature-based criterion, DetectGPT, is defined for judging if a passage is generated from a given LLM.
- DetectGPT does not require training a separate classifier, collecting a dataset of real or generated passages, or explicitly watermarking generated text.
- It uses only log probabilities computed by the model of interest and random perturbations of the passage from another generic pre-trained language model, e.g, T5.
- (For quick read, please read 1, 2, and 4.1.)
Outline
- DetectGPT: Random Permutations & Hypothesis
- DetectGPT: Automated Testing
- Interpretation of the Perturbation Discrepancy as Curvature
- Results
1. DetectGPT: Random Permutations & Hypothesis
DetectGPT is based on the hypothesis that samples from a source model pθ typically lie in areas of negative curvature of the log probability function of pθ, unlike human text.
- If we apply small perturbations to a passage x~pθ, producing ~x, the quantity log pθ(x) - log pθ(~x) should be relatively large on average for machine-generated samples compared to human-written text.
- To leverage this hypothesis, first consider a perturbation function q(.|x) that gives a distribution over ~x, slightly modified versions of x with similar meaning (generally consider roughly paragraph-length texts x).
- As an example, q(.|x) might be the result of simply asking a human to rewrite one of the sentences of x, while preserving the meaning of x.
- Using the notion of a perturbation function, we can define the perturbation discrepancy d (x; pθ, q):
- Thus, the below hypothesis 4.1. is formed:
If q(.|x) are samples from a mask-filling model such as T5, rather than human rewrites, Hypothesis 4.1 can be empirically tested in an automated, scalable manner.
2. DetectGPT: Automated Testing
- For real data, 500 news articles from the XSum dataset is used.
- For model samples, the outputs of four different LLMs are used when prompted with the first 30 tokens of each article in XSum.
- T5–3B is used to apply perturbations, masking out randomly-sampled 2-word spans until 15% of the words in the article are masked.
- The expectation in the eq. (1) is approximated with 100 samples from T5.
The result of this experiment as above shows that the distribution of perturbation discrepancies is significantly different for human-written articles and model samples; model samples tend to have a larger perturbation discrepancy.
Given these results, we can detect if a piece of text was generated by a model p by simply thresholding the perturbation discrepancy.
- In practice, normalizing the perturbation discrepancy by the standard deviation of the observed values used to estimate E~xq(.|x) log p(~x) provides a slightly better signal for detection, typically increasing AUROC by around 0.020, so normalized version of the perturbation discrepancy is used in the experiments.
- The above algorithm summarized the normalized DetectGPT.
The perturbation discrepancy may be useful, it is not immediately obvious what it measures. Authors suggest to use curvature for interpretation as in the next section.
3. Interpretation of the Perturbation Discrepancy as Curvature
The perturbation discrepancy approximates a measure of the local curvature of the log probability function near the candidate passage, more specifically, that it is proportional to the negative trace of the Hessian of the log probability function.
- First, Hutchinson’s trace estimator (Hutchinson, 1990) is invoked, giving an unbiased estimate of the trace of matrix A:
- provided that the elements of z~qz are IID with E[zi] = 0 and Var(zi)=1.
- To use the above Equation 2 to estimate the trace of the Hessian, the expectation of the directional second derivative zT Hf(x) z is computed. This expression is approximated with finite differences:
- Combining Equations 2 and 3 and simplifying with h=1, an estimate of the negative Hessian trace is:
- If the noise distribution is symmetric, that is, p(z)=p(-z) for all z, then we can simplify Equation 4 to:
- The RHS of Equation 5 corresponds to the perturbation discrepancy (1) where the perturbation function q(~x|x) is replaced by the distribution qz(z) used in Hutchinson’s trace estimator (2).
- ~x is a high-dimensional sequence of tokens while qz is a vector in a compact semantic space.
Sampling in semantic space ensures that all samples stay near the data manifold, which is useful because we would expect the log probability to always drop if we randomly perturb tokens. We can therefore interpret our objective as approximating the curvature restricted to the data manifold.
4. Results
4.1. Zero-Shot Machine-Generated Text Detection
- Each experiment uses between 150 and 500 examples for evaluation.
- Again, for each experiment, the machine-generated text is generated by prompting with the first 30 tokens of the real text.
- The performance is evaluated using the area under the receiver operating characteristic curve (AUROC), which can be interpreted as the probability that a classifier correctly ranks a randomly-selected positive (machine-generated) example higher than a randomly-selected negative (human-written) example.
- The mask rate is 15%. Masked span length of 2 is used on a held-out set of XSum data.
DetectGPT most improves average detection accuracy for XSum stories (0.1 AUROC improvement) and SQuAD Wikipedia contexts (0.05 AUROC improvement).
For 14 of the 15 combinations of dataset and model, DetectGPT provides the most accurate detection performance, with a 0.06 AUROC improvement on average.
4.2. Comparison with Supervised Detectors
Using 200 samples from each dataset for evaluation, the supervised detectors can provide similar detection performance to DetectGPT on in-distribution data like English news, but perform significantly worse than zero-shot methods in the case of English scientific writing and fail altogether for German writing.
- 150 examples are sampled from the PubMedQA, XSum, and WritingPrompts datasets. Two pre-trained RoBERTa-based detector models are compared with DetectGPT and the probability thresholding baseline.
DetectGPT can provide detection competitive with the stronger supervised model, and it again outperforms probability thresholding on average.
4.3. Variants of Machine-Generated Text Detection
- This part is to see if detectors can detect human-editted machine-generated text. Human revision is simulated by replacing 5 word spans of the text with samples from T5–3B until r% of the text has been replaced.
DetectGPT maintains detection AUROC above 0.8 even when nearly a quarter of the text in model samples has been replaced. DetectGPT shows the strongest detection performance for all revision levels.
- (There are other experiments, please check out the paper if interested.)