# Brief Review — DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

## DetectGPT, LLM (e.g.: GPT-3) Detection, **Detect If a Passage is Generated From a Given LLM**

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature,DetectGPT, by Stanford University,2023 arXiv v1(Sik-Ho Tsang @ Medium)

NLP, LM, LLM, Transformer, T5, GPT-3, InstructGPT, ChatGPT

2.3. Summarization2018[T-DMCA]2020[Human Feedback Model]2022[InstructGPT]

==== My Other Paper Readings Also Over Here ====

- ChatGPT is a hot topic. People are discussing whether we can detect a passage is generated from large language model (LLM).
**A new curvature-based criterion, DetectGPT,**is defined for**judging if a passage is generated from a given LLM.****DetectGPT does not require training a separate classifier**,**collecting a dataset**of real or generated passages,**or**explicitly**watermarking**generated text.- It uses
**only log probabilities computed by the model of interest**and**random perturbations**of the passage from another generic pre-trained language model, e.g, T5. - (For quick read, please read 1, 2, and 4.1.)

# Outline

**DetectGPT: Random Permutations & Hypothesis****DetectGPT: Automated Testing****Interpretation of the Perturbation Discrepancy as Curvature****Results**

**1. DetectGPT: **Random Permutations & Hypothesis

DetectGPT is based on the hypothesis that

samples from a source modeltypicallypθlie in areas of negative curvature of the log probability function of.pθ, unlike human text

**If we apply small perturbations to a passage**, the*x*~*pθ*, producing ~*x***quantity log**compared to human-written text.*pθ*(x) - log*pθ*(~*x*) should be relatively large on average for machine-generated samples- To leverage this hypothesis,
**first consider a perturbation function**that gives a*q*(.|*x*)**distribution over ~**,*x***slightly modified versions of**(*x*with similar meaning**generally consider roughly paragraph-length texts**).*x* - As an example,
might be the result of*q*(.|*x*)**simply asking a human to rewrite one of the sentences of**, while*x***preserving the meaning of**.*x* - Using the notion of a perturbation function, we can define the
**perturbation discrepancy d (x;**:*pθ*,*q*)

- Thus, the below
**hypothesis 4.1.**is formed:

Ifq(.|x) are samples from a mask-filling model such asT5, rather than human rewrites, Hypothesis 4.1 can be empiricallytested in an automated, scalable manner.

# 2. DetectGPT: Automated Testing

- For
**real data**,**500 news articles**from the**XSum dataset**is used. - For
**model samples**,**the outputs of four different LLMs are used when prompted with the first 30 tokens**of each article in XSum. **T5****–3B**is used to**apply perturbations**,**masking out randomly-sampled 2-word spans**until 15% of the words in the article are masked.- The
**expectation in the eq. (1)**is**approximated with 100 samples from****T5****.**

The result of this experiment as above shows that

the distribution of perturbation discrepancies is significantly different for human-written articles and model samples;model samples tend to have a larger perturbation discrepancy.Given these results,

we can detect if a piece of text was generated by a model.pby simply thresholding the perturbation discrepancy

- In practice,
**normalizing the perturbation discrepancy**by the standard deviation of the observed values used to estimate E~*xq*(.|*x*)**log p(~**provides a*x*)**slightly better signal**for detection, typically increasing AUROC by around 0.020, so normalized version of the perturbation discrepancy is used in the experiments.

- The above algorithm summarized the normalized DetectGPT.

The perturbation discrepancy may be useful, it is

not immediately obvious what it measures.Authors suggest to use curvature for interpretation as in the next section.

# 3. Interpretation of the Perturbation Discrepancy as Curvature

Theperturbation discrepancy approximates a measure of the local curvature of the log probability function near the candidate passage, more specifically, that it isproportional to the negative trace of the Hessian of the log probability function.

- First,
**Hutchinson’s trace estimator**(Hutchinson, 1990) is invoked, giving an**unbiased estimate of the trace of matrix**:*A*

- provided that the
**elements of**are IID with*z*~*qz***E[**and*zi*] = 0**Var(**.*zi*)=1 - To use the above Equation 2 to estimate the trace of the Hessian, the
**expectation of the directional second derivative**is computed. This expression is*zT**Hf*(*x*)*z***approximated with finite differences**:

**Combining Equations 2 and 3**and simplifying with, an estimate of*h*=1**the negative Hessian trace**is:

- If the noise distribution is symmetric, that is,
, then we can simplify Equation 4 to:*p*(*z*)=*p*(-*z*) for all*z*

- The
**RHS of Equation 5 corresponds to the perturbation discrepancy (1)**where the**perturbation function**used in Hutchinson’s trace estimator (2).*q*(~*x*|*x*) is replaced by the distribution*qz*(*z*) **~**is a high-dimensional*x***sequence of tokens**whileis a vector in a*qz***compact semantic space**.

Sampling in semantic spaceensures thatall samples stay near the data manifold, which is useful because we wouldexpect the log probability to always drop if we randomly perturb tokens. We can therefore interpret our objective asapproximating the curvature restricted to the data manifold.

# 4. Results

## 4.1. Zero-Shot Machine-Generated Text Detection

- Each experiment uses between 150 and 500 examples for evaluation.
- Again, for each experiment,
**the machine-generated text is generated by prompting with the first 30 tokens of the real text**. - The performance is evaluated using the
**area under the receiver operating characteristic curve (AUROC)**, which can be interpreted as the**probability that a classifier correctly ranks a randomly-selected positive (machine-generated) example higher than a randomly-selected negative (human-written) example**. - The
**mask rate**is**15%**.**Masked span length of 2**is used on a held-out set of XSum data.

DetectGPT most improves average detection accuracy for XSum stories(0.1 AUROC improvement)and SQuAD Wikipedia contexts(0.05 AUROC improvement).For

14 of the 15combinations of dataset and model,DetectGPT provides the most accurate detection performance, with a0.06 AUROC improvement on average.

## 4.2. Comparison with Supervised Detectors

Using

200 samples from each datasetfor evaluation,the supervised detectorscan providesimilar detection performanceto DetectGPT onin-distribution datalike English news, but performsignificantly worse than zero-shot methodsin the case ofEnglish scientific writingandfail altogether for German writing.

**150 examples**are sampled from the PubMedQA, XSum, and WritingPrompts datasets.**Two pre-trained****RoBERTa****-based detector models**are**compared with DetectGPT and the probability thresholding baseline.**

DetectGPTcan provide detectioncompetitive with the stronger supervised model, and it againoutperforms probability thresholding on average.

## 4.3. Variants of Machine-Generated Text Detection

- This part is to see if detectors can detect
**human-editted machine-generated text.**Human revision is**simulated by replacing 5 word spans of the text with samples from****T5****–3B**until*r*% of the text has been replaced.

DetectGPTmaintains detectionAUROC above 0.8even when nearly a quarter of the text in model samples has been replaced. DetectGPT shows thestrongestdetection performancefor all revision levels.

- (There are other experiments, please check out the paper if interested.)