Are AI text detectors reliable?

Reliable enough to flag obvious cases, unreliable enough to be unsafe as a sole adjudicator. Published false-positive rates on human writing range from 1% to 9% depending on the detector and the writing style. Non-native English writers are flagged disproportionately.

What's the most defensible signal in text detection?

Burstiness and perplexity together. Single-signal detectors are easy to game; ensembles that combine perplexity, burstiness, n-gram statistics, and sentence-length variance are harder.

Can paraphrasing tools defeat detectors?

Naive paraphrasers often make text *more* detectable by smoothing it. Real text humanizers introduce controlled burstiness, sentence-length variance, and idiosyncratic word choices.

AI Text Detectors: Why GPTZero, Originality & Turnitin Disagree

title: "AI Text Detectors: Why GPTZero, Originality & Turnitin Disagree" description: "An honest technical look at how AI text detectors work, why their verdicts conflict so often, and what humanization pipelines actually need to defeat." slug: "ai-text-detectors-disagree" publishedAt: "2026-04-04" updatedAt: "2026-04-21" author: "SynthGuard Team" category: "ai-detection" tags: ["text-detection", "gptzero", "originality", "turnitin", "perplexity"] readingTime: 12 coverImage: "/blog/covers/ai-text-detectors-disagree.webp" faq:

q: "Are AI text detectors reliable?" a: "Reliable enough to flag obvious cases, unreliable enough to be unsafe as a sole adjudicator. Published false-positive rates on human writing range from 1% to 9% depending on the detector and the writing style. Non-native English writers are flagged disproportionately."
q: "What's the most defensible signal in text detection?" a: "Burstiness and perplexity together. Single-signal detectors are easy to game; ensembles that combine perplexity, burstiness, n-gram statistics, and sentence-length variance are harder."
q: "Can paraphrasing tools defeat detectors?" a: "Naive paraphrasers often make text more detectable by smoothing it. Real text humanizers introduce controlled burstiness, sentence-length variance, and idiosyncratic word choices." related: ["how-ai-image-detectors-work", "privacy-first-browser-tools", "can-turnitin-detect-chatgpt", "can-copyleaks-grammarly-detect-chatgpt"]

AI text detectors are everywhere — in classrooms, in publishing workflows, in HR screening. They are also, frequently, wrong. The same paragraph submitted to GPTZero, Originality.ai, and Turnitin will routinely produce three different verdicts. This is not a quirk; it's a structural consequence of how these systems actually work.

What a text detector is doing#

Almost every public detector reduces to a small set of statistical signals computed over the text. The signals differ; the structure is the same:

1. Tokenize the text
2. Compute N statistical features per chunk
3. Aggregate into a single score
4. Threshold the score → "AI" or "Human"

Step 4 is where most of the disagreement comes from. The same internal score can produce opposite verdicts depending on where the operator drew the line — and that line is rarely published.

Signal 1 — Perplexity#

Perplexity measures how predictable a text is to a language model. Run the text through a reference LLM and ask: at each token, how likely was this exact word given the preceding context? High likelihood across the board → low perplexity → looks generated. Variable likelihood → high perplexity → looks human.

The intuition is sound. AI text is generated by sampling from a high-probability distribution; human text frequently lurches to lower-probability words. But there are problems:

Choice of reference model changes everything. Perplexity measured with GPT-2 says different things than perplexity measured with Llama-3.
Style matters more than authorship. Academic writing has lower perplexity than fiction regardless of who wrote it.
Non-native English writers produce systematically lower-perplexity text (smaller vocabulary, simpler constructions) and get flagged as AI.

GPTZero's original release relied heavily on perplexity. They've since added other signals because perplexity alone produced unacceptable false-positive rates on student writing.

Signal 2 — Burstiness#

Burstiness measures the variance of perplexity (or sentence length) across the document. Human writing is bursty: a long, complex sentence followed by a short punchy one, an unusual word among common ones. AI writing — especially un-prompted, default-temperature output — is less bursty: more uniform sentence lengths, more uniform vocabulary register, more uniform syntactic complexity.

The standard burstiness metric is:

B = (σ - μ) / (σ + μ)

where σ is standard deviation and μ is mean of sentence length (or perplexity per sentence). B > 0 is bursty (human-like); B ≤ 0 is uniform (AI-like).

This is the second main signal in GPTZero and the primary signal in several academic detectors. It's harder to game than perplexity because it requires structural variation, not just word choice.

Signal 3 — Stylometric features#

Originality.ai and several enterprise detectors layer in classical stylometry: function-word frequencies, type-token ratios, average word length, punctuation density, sentence-final patterns. These features were originally developed for authorship attribution (Federalist Papers analysis, etc.) and turn out to discriminate AI from human at modest rates.

The trick: stylometry works well across many documents from the same author and poorly on isolated short texts. A 200-word blog comment doesn't have enough signal for stylometry to be reliable. A 5,000-word essay does.

Signal 4 — Watermarking#

The same story as image detection. OpenAI, Google and Anthropic have all published cryptographic text-watermarking schemes that bias the sampler toward a distribution detectable by a verifier. Properly implemented watermarks have near-zero false positives on watermarked text.

Practical reality:

OpenAI has watermarking but does not deploy it on consumer ChatGPT (likely because of UX concerns).
Google's SynthID-Text ships in some Gemini products.
Most public detectors do not check watermarks (they don't have access to the model-side keys).
Watermark resilience to paraphrasing is limited — moderate rewording defeats most current schemes.

Signal 5 — Learned classifiers#

The "throw a transformer at the problem" approach: train a binary classifier on a corpus of (human, AI) text pairs. Originality and Turnitin both ship learned components. The training data shapes the verdict catastrophically:

A classifier trained on GPT-3.5 outputs is unreliable on GPT-4 outputs.
A classifier trained on essays mislabels code documentation.
A classifier trained on English mislabels translated text.

Worse, learned classifiers are vulnerable to distribution shift: the moment the underlying generator changes (new model release, new system prompt, new fine-tuning), accuracy drops sharply until the detector's training data catches up.

Why the detectors disagree#

Three structural reasons:

Different signal weights#

GPTZero leans on perplexity + burstiness. Turnitin leans on stylometry + a learned classifier. Originality leans on a learned classifier + n-gram statistics. The same paragraph hits each weighting differently.

Different thresholds#

A 60% AI-probability score becomes "AI" in one product and "uncertain" in another. The thresholds are tuned for the customer base — Turnitin (academic) optimizes for low false negatives at the cost of false positives; Originality (publishers) tunes for the opposite.

Different pre-processing#

Some detectors strip markdown. Some include it. Some chunk by sentence; some by 256-token windows. These choices materially change every signal computation.

The 2024 Liang et al. study ("GPT detectors are biased against non-native English writers") found that the same TOEFL essays were flagged as AI by 50-90% of detectors when written by non-native speakers, vs. 5-10% when written by natives. The text was identical in topic and length — the detectors were responding to writing-style features, not authorship.

What humanization actually has to do#

A good text humanizer doesn't just paraphrase. It restructures along the dimensions detectors measure:

Increase burstiness by mixing sentence lengths intentionally (a 24-word sentence next to a 6-word one)
Increase perplexity by occasionally choosing the second- or third-most-likely word at semantically safe positions
Add idiosyncratic punctuation — em-dashes, semicolons, parenthetical asides
Vary register — switch between formal and conversational at clause boundaries
Inject deliberate, mild redundancy — humans repeat for emphasis; LLMs avoid it

Naive paraphrasers (synonym substitution) make text more uniform and easier to detect. Real humanization makes text more variable across exactly the dimensions detectors measure. Our text humanizer is built around this principle: it operates on burstiness and perplexity directly, not on word substitution.

The non-native English problem#

The most damning finding in the field: detectors disproportionately flag non-native English writing as AI. The mechanism is straightforward — non-native writers tend to have:

Lower perplexity (smaller working vocabulary)
Lower burstiness (more uniform sentence structures)
More predictable function-word patterns

These are exactly the features that make AI-generated text detectable. The detectors are not "biased" in any moral sense — they are correctly responding to the statistical features they were designed to find. The problem is that those features are not unique to AI authorship. They appear in any writing produced under linguistic constraint.

This is the strongest single argument against using text detectors for high-stakes adjudication.

Where the field is going#

Three directions in 2026:

Provenance metadata — text watermarking with cryptographic verification. Unfortunately incentives are misaligned: model providers gain little from watermarking and lose user satisfaction.
Stylometric baselines per author — comparing a submission against an author's prior known writing. Effective in classrooms with longitudinal data, useless for one-shot screening.
Active probes — embedding invisible challenge tokens in prompts to detect downstream LLM use. Easy to defeat by humans, harder by LLM agents.

None of these solves the fundamental problem: distinguishing "AI-assisted human writing" from "lightly edited AI writing" is statistically intractable when the editing is competent.

Bottom line#

AI text detection in 2026 is useful for triage and dangerous as adjudication. Every detector is a stack of signals with documented failure modes, especially on non-native English, formal writing, and short texts. The verdict is a probability — treat it as one.

If you ship text into a detection environment and need it to read as human, a real text humanizer operates on the right signals — burstiness, perplexity, idiosyncratic structure. If you build detectors, expose the contributing signals, never just the verdict. And if you make decisions based on a detector's output, require multiple detectors and human review for any consequential action.

Detection is a tool. Adjudication is a human job.

AI Text Detectors: Why GPTZero, Originality & Turnitin Disagree

What a text detector is doing#

Signal 1 — Perplexity#

Signal 2 — Burstiness#

Signal 3 — Stylometric features#

Signal 4 — Watermarking#

Signal 5 — Learned classifiers#

Why the detectors disagree#

Different signal weights#

Different thresholds#

Different pre-processing#

What humanization actually has to do#

The non-native English problem#

Where the field is going#

Bottom line#

Review method, sources and limits

Primary references

Frequently asked questions

Glossary terms in this article

Keep reading

How AI Image Detectors Actually Work — A 2026 Technical Guide

Privacy-First AI Tools: Why Browser-Only Processing Matters

Can Turnitin Detect ChatGPT? (2026)