The Most Trustworthy AI Output Is the Least Reliable
By Lovro Lucic ·
The Fabrication Problem · 4 of 4
The signals you use to judge whether AI output is trustworthy are the same signals fabrication produces.
More citations. More confidence. More specific numbers. More professional structure. Longer output with more detail. These are what make an AI response feel reliable. They're also what the model generates when it's fabricating, because fabrication has no constraint on specificity. Real data has limits. Fabricated data doesn't. The model can cite as many sources, produce as many numbers, and assert as confidently as the output requires. Sourced output has to work with what's available, which often means acknowledging gaps.
I tested this with blinded evaluation. One domain expert, six documents. Three AI outputs with real data sourced from named studies. Three with fabricated citations and invented numbers. I rated the fabricated versions as equivalent or more trustworthy. The fabricated versions cited more sources, used more specific numbers, and asserted more confidently. The sourced versions acknowledged limitations. The acknowledgment of limitations, which is the signal of honesty, cost the output credibility in my evaluation.
I was the evaluator. 90+ experiments in AI evaluation. I still couldn't distinguish sourced from fabricated based on the output alone.
The mechanism: RLHF trains models to produce output humans rate highly. Humans rate confident, well-cited, specific output highly. Fabrication produces all three without constraint because there's no external anchor. Sourced output is constrained by what the source actually says, which is often more limited, more qualified, and less impressive than what unconstrained generation produces. The training that makes AI output sound trustworthy is the same training that makes fabrication sound more trustworthy than truth.
This is measurable in the output itself. Programmatic measurement across 60 documents and 10 topics confirmed it: unsourced output produces 55 percent more citations, 57 percent more named entities, and a higher confidence-to-hedge ratio than sourced output. The one exception: sourced output has more precise decimal numbers, because real data has real decimals.
These are objective counts. No evaluator, no judgment, no bias. The fabricated output literally contains more of every trust signal except decimal precision.
An LLM evaluator, rating the same 60 documents blind, scored unsourced output higher in 7 of 10 topics. But that finding is circular: LLMs share the RLHF training that produces the trust signals being measured. An LLM rating confident, well-cited output as more trustworthy is the bias confirming itself, not an independent validation. The programmatic measurement is the real evidence. The LLM evaluation illustrates the mechanism.
The practical implication: the feeling that AI output is trustworthy is not evidence that it's correct. Especially when the output is detailed, well-cited, and confident. Those properties correlate with fabrication, not with accuracy. The outputs that deserve the most scrutiny are the ones that feel the most trustworthy. The ones that acknowledge limits and gaps are more likely to be honest, even though they feel less reliable.
The output you trusted most this week was probably the most fabricated. The one with the most citations, the most specific numbers, the most confident tone. Your trust signals are calibrated backward, and the calibration feels like judgment.
The test: Take the AI output you trust most from this week. Check every specific claim against actual sources. Count how many hold up. The gap between your trust and the verification is the calibration error.
Test this yourself
Take one AI output you trusted. Count the citations, specific numbers, and confident assertions. Now check three of them against the cited sources.
What survived testing
- Fabricated output rated as trustworthy as or more than sourced output in blinded evaluationCopy link
- Citation count 55% higher, named entities 57% higher, confidence ratio higher in fabricated output (programmatic measurement, 60 documents)Copy link
- Sourced output penalized for acknowledging limitationsCopy link
- Domain expertise did not protect against trust inversionCopy link
- One exception: sourced output has more precise decimal numbers (real data has real decimals)Copy link
What didn't survive
- "Trust signals are always inverted" too strong. For content the reader produced themselves, verification is possible. The inversion applies to content the reader hasn't independently verified.Copy link
Honest limits
- Human evaluation is single domain expert. LLM evaluation confirmed same direction but LLMs share the same RLHF bias as the mechanism being tested. Human replication with multiple evaluators is the remaining gap.Copy link
- Programmatic measurement captures signal counts, not whether humans actually weight those signals as described. The correlation between signal presence and trust rating is not yet human-confirmed at scale.Copy link
- March 2026 models.Copy link
- ## Audit the data yourselfCopy link
- The replication kit at [/receipts/trust-signals-are-inverted](/receipts/trust-signals-are-inverted) has the experiment script, the 60 documents with per-document signal counts and blinded trust scores, and the regex patterns that count citations, named entities, and confidence-to-hedge ratios (`trust_signals.py` lines 154-200). The 55 percent and 57 percent numbers can be re-derived from `trust_signals_results.json` without API access; the patterns and prompt templates are reusable verbatim.Copy link
Cited by
Explore other threads
The Evaluation Problem
2 findingsJudgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.
The "It Depends" Problem
3 findingsSame instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.
The "What You Think Works" Problem
1 findingTemporal decay is a myth. Self-critique circles. Constraints narrow. Quality ceiling per mode.
New findings when they land.
No spam. Just what held up.