What didn't hold up
These findings were tested and didn't hold as stated. Some were killed outright. Some were narrowed in scope. Some were corrected in magnitude. Each links to the full evidence.
The Most Trustworthy AI Output Is the Least Reliable
- "Trust signals are always inverted" too strong. For content the reader produced themselves, verification is possible. The inversion applies to content the reader hasn't independently verified.
The Strongest Finding Was Wrong
- d=2.34 as the effect size (specificity + length stacked; honest range d=1.37-1.65)
- "Strongest effect in 87 experiments" (large, but not as large as claimed)
- Clean separation at density level: quality demands show d=1.19 at density vs d=0.11 raw (density partially conflates specificity with shorter output length)
Why AI Can't Check Its Own Work
- "Self-check is useless" too strong. Catches formatting and surface errors.
- "Multiple passes always help" killed. Iteration without independence circles.
How to Stop AI from Making Up Numbers
- "Source material fixes everything" killed. Vocabulary, conclusions, reasoning stay at baseline.
- "Source format matters" killed. Structured and narrative produce equivalent results (0.7% vs 0.9%).
Most AI Numbers Are Fabricated
- "100% fabrication is universal" killed. xAI shows 77% with topic-dependent retrieval
- "PROTOCOL fixes fabrication" killed. Highest fabrication rate of all conditions
- "Source grounding fixes everything above data" partially killed. Vocabulary and causal framing stay at baseline for same-topic regeneration. BUT on reasoning tasks with ground truth (5 domains, 20 docs): source-present gt_hit_rate 75% vs source-absent 38% (d=1.38, CI [0.61, 2.81]). Source grounding improves correctness on reasoning tasks, not just reformulation. Wrong conclusions: present 0.1 vs absent 0.5 (d=-0.92).