Research
80+ experiments on how AI actually works. What survived testing. What didn't. Honest limits.
The Fabrication Problem
85% of AI numbers are fabricated → the fix → why self-checking fails → trust signals are backwards
Answered Source material collapses fabrication from 85% to under 2%. Prohibition outperforms monitoring 5×. Self-checking fails because the same process generates and evaluates. Trust signals (citations, confidence, specificity) are higher in fabricated output than sourced output.
Open Does the fix work beyond reformulation tasks? Reasoning shows improvement (75% vs 38%), but strategy and creative untested.
The Fabrication Architecture
Most AI-generated numbers are fabricated (77-100% temporally unstable). Better prompts reduce claim volume but don't fix fabrication rate. Source material does: drops unsourced numbers to single digits.
What broke: The most constrained prompt produced the highest fabrication rate (90.7% vs 85.8%).
Source Conditioning
Most AI-generated numbers are unsourced. Three steps drop unsourced numbers to single digits. No prompt tricks. A workflow change. (Commensurable: 46% unsourced → 8% with source, same measurement construct.)
What broke: Source material fixes the data layer only. Vocabulary, conclusions, reasoning stay at baseline.
The Self-Check Illusion
Built quality gates for AI agents. They passed their own checks. The output was wrong. The same process that made the error reviews the error.
What broke: “Ask AI to check its own work” fails because the same process generates and evaluates.
Trust Signals Are Inverted
The signals you use to judge AI output (citations, confidence, specificity) are the same signals fabrication produces. Sourced output acknowledges limits. Fabrication doesn't. The trust heuristic is backwards.
What broke: A domain expert with 80+ experiments in AI evaluation couldn’t distinguish sourced from fabricated output.
The "It Depends" Problem
Same instruction, opposite results → specificity is the lever → the measurement itself was wrong
Answered Constraints produce opposite effects depending on task type (d=2.34 convergent, harmful in exploratory). Negation alone is null — specificity provides the destination. The strongest specificity effect (d=2.34) was three confounds stacked; honest magnitude is d=1.37.
Open Where exactly is the boundary between convergent and exploratory tasks? Domain experts can’t distinguish specific from generic output on quality — specificity changes form, not substance.