Research

80+ experiments on how AI actually works. What survived testing. What didn't. Honest limits.

The Fabrication Problem

85% of AI numbers are fabricated → the fix → why self-checking fails → trust signals are backwards

Answered Source material collapses fabrication from 85% to under 2%. Prohibition outperforms monitoring 5×. Self-checking fails because the same process generates and evaluates. Trust signals (citations, confidence, specificity) are higher in fabricated output than sourced output.

Open Does the fix work beyond reformulation tasks? Reasoning shows improvement (75% vs 38%), but strategy and creative untested.

The "It Depends" Problem

Same instruction, opposite results → specificity is the lever → the measurement itself was wrong

Answered Constraints produce opposite effects depending on task type (d=2.34 convergent, harmful in exploratory). Negation alone is null — specificity provides the destination. The strongest specificity effect (d=2.34) was three confounds stacked; honest magnitude is d=1.37.

Open Where exactly is the boundary between convergent and exploratory tasks? Domain experts can’t distinguish specific from generic output on quality — specificity changes form, not substance.