All research
ExperimentTested March 2026· 2 min

The Most Trustworthy AI Output Is the Least Reliable

The Fabrication Problem · 4 of 4

The signals you use to judge whether AI output is trustworthy are the same signals fabrication produces.

More citations. More confidence. More specific numbers. More professional structure. Longer output with more detail. These are what make an AI response feel reliable. They're also what the model generates when it's fabricating, because fabrication has no constraint on specificity. Real data has limits. Fabricated data doesn't. The model can cite as many sources, produce as many numbers, and assert as confidently as the output requires. Sourced output has to work with what's available, which often means acknowledging gaps.

Tested this with blinded evaluation. Three AI outputs with real data sourced from named studies. Three with fabricated citations and invented numbers. The reader rated the fabricated versions as equivalent or more trustworthy. The fabricated versions cited more sources, used more specific numbers, and asserted more confidently. The sourced versions acknowledged limitations. The acknowledgment of limitations, which is the signal of honesty, cost the output credibility in the reader's evaluation.

The reader was a domain expert with 87 experiments in AI evaluation. They still couldn't distinguish sourced from fabricated based on the output alone.

The mechanism: RLHF trains models to produce output humans rate highly. Humans rate confident, well-cited, specific output highly. Fabrication produces all three without constraint because there's no external anchor. Sourced output is constrained by what the source actually says, which is often more limited, more qualified, and less impressive than what unconstrained generation produces. The training that makes AI output sound trustworthy is the same training that makes fabrication sound more trustworthy than truth.

This is measurable in the output itself. AI-generated citations are more numerous (the model cites as many as the argument needs). AI-generated numbers are more precise (the model generates exact values, not ranges). AI-generated conclusions are more confident (no hedging, no "this is unclear"). Each of these properties is what trust evaluation rewards. Each is more prevalent in fabricated output than in sourced output.

The practical implication: the feeling that AI output is trustworthy is not evidence that it's correct. Especially when the output is detailed, well-cited, and confident. Those properties correlate with fabrication, not with accuracy. The outputs that deserve the most scrutiny are the ones that feel the most trustworthy. The ones that acknowledge limits and gaps are more likely to be honest, even though they feel less reliable.

Test this yourself

Take one AI output you trusted. Count the citations, specific numbers, and confident assertions. Now check three of them against the cited sources.

Evidence: MEDIUM-HIGH. Trust inversion replicated across two separate tests (usefulness bridge + ground-truth check). The mechanism (RLHF trains for human preference, fabrication maximizes human preference signals) is structurally grounded. N=1 evaluator.

What survived testing

  • Fabricated output rated as trustworthy as or more than sourced output (blinded)
  • Citation count, confidence level, and specificity all higher in fabricated output
  • Sourced output penalized for acknowledging limitations
  • Domain expertise did not protect against trust inversion

What didn't survive

  • "Trust signals are always inverted" too strong. For content the reader produced themselves, verification is possible. The inversion applies to content the reader hasn't independently verified.

Honest limits

  • N=1 evaluator (domain expert, not naive reader). The inversion may differ for different expertise levels.
  • 6 documents in usefulness bridge (small sample).
  • The RLHF mechanism is inferred from training design, not experimentally isolated.
Full record
Source
Usefulness bridge (6 documents, 3 source-present / 3 source-absent, blinded evaluation, N=1 domain expert). Ground-truth check (23 claims verified, 2 correct, both widely-cited common knowledge). Trust heuristic analysis. RLHF confirmation bias (structural to training process).
Context
Testing whether domain expertise protects against trust signals. It doesn't. The trust heuristic rewards what fabrication produces.
AI Dimension
STRUCTURAL.
Status
EVIDENCE TRANSMISSION

Follow this research.

Leave your email to hear when new findings are published.