What holds up when you test AI.
100+ experiments on AI and human judgment.
A working record of experiments. Each finding ships with what survived testing, what didn't, and how to verify it yourself.
The record
Every finding lists what survived testing, what testing killed, and honest limits on scope. See the record →
The Fabrication Problem
Most AI numbers are fabricated. Source material fixes it. Self-checking fails. Trust signals are backwards.
Answered Source material collapses unsourced numbers to single digits. Prohibition outperforms monitoring 5x. Self-checking fails because the same process generates and evaluates. Trust signals (citations, confidence, specificity) are higher in fabricated output than sourced output.
Open Does the fix work beyond reformulation tasks? Reasoning shows improvement (75% vs 38%), but strategy and creative untested.
Most AI Numbers Are Fabricated
77 to 100 percent of AI-generated numbers are temporally unstable. Source material fixes it. Prompts don't.
What broke: The most constrained prompt produced the highest fabrication rate (90.7% vs 85.8%).
How to Stop AI from Making Up Numbers
Source material drops AI fabrication from 85% to single digits. Three steps.
What broke: Source material fixes the data layer only. Vocabulary, conclusions, reasoning stay at baseline.
Why AI Can't Check Its Own Work
The agent reported clean. The output was wrong. Same process generating and evaluating.
What broke: "Ask AI to check its own work" fails because the same process generates and evaluates.
The Most Trustworthy AI Output Is the Least Reliable
The signals you use to judge AI trustworthiness are the same signals fabrication produces.
What broke: A domain expert with 80+ experiments in AI evaluation couldn't distinguish sourced from fabricated output.
The Evaluation Problem
Judgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.
Answered Evaluation capacity erodes invisibly under sustained AI delegation. Satisfaction suppresses evaluation (d=1.20). The strongest evaluators discriminate least (d=0.00 vs d=4.11).
Open Structural countermeasures: does monthly independent practice actually prevent degradation? Longitudinal evidence is inconclusive.
Why Experts Miss What Beginners Catch
Generation builds the mental model that makes evaluation possible. Without it, evaluation collapses to surface features.
What broke: Domain expertise alone does not enable evaluation. Production expertise does. Cold evaluation collapses to surface features.
The Decision That Was Never Made
AI resolved the uncertainty before your own thinking had time to finish. Resolution and decision are different things.
What broke: "AI advice is always worse than human advice" too strong. The content may be equivalent. The speed and confidence change the cognitive process.
The "It Depends" Problem
Same instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.
Answered Constraints produce opposite effects depending on task type (d=2.34 convergent, harmful in exploratory). Negation alone is null; specificity provides the destination. The strongest specificity effect (d=2.34) was three confounds stacked; honest magnitude is d=1.37.
Open Where exactly is the boundary between convergent and exploratory tasks? Domain experts can't distinguish specific from generic output on quality. Specificity changes form, not substance.
Same Technique, Opposite Results
The structured approach that produced precision on convergent problems actively harmed exploratory ones.
What broke: The same structural instruction that helped convergent tasks harmed exploratory tasks.
More Context Barely Helps
Adding information to an already-thorough prompt produced near zero improvement. Three constraint sentences changed everything.
What broke: Adding information to an already-thorough prompt produced near zero improvement (d=0.19). Three constraint sentences changed everything.
The Most-Cited Finding Was Wrong
The most-cited effect across 80+ experiments was three effects stacked. Honest magnitude: 40% smaller.
What broke: The strongest effect in 80+ experiments (d=2.34) was three effects stacked. Honest magnitude: d=1.37.
The "What You Think Works" Problem
Temporal decay is a myth. Self-critique circles. Constraints narrow. Quality ceiling per mode.
The Model Mechanics
Context shapes the output more than which model produces it. The layers behind the model are mostly invisible. Different errors need different solutions.
Answered Context determines whether behaviors occur. The model adjusts how they express. Six or more distinct failure types need different responses.
Open The ordering (context > model choice) is likely structural, but specific ratios will shift. Tested on 2 model families. Broader replication needed.
The Model Is Rarely the Variable
The context determined whether behaviors existed at all. The model adjusted the volume.
What broke: The specific ratio is experiment-specific. The ordering (prompt > model) is robust, but the magnitude varies.
Four Layers Produce Every AI Output
Four layers produce every AI output. The company's system. Your system. Your prompt. The model. The model is the only one with a name.
What broke: Behaviors attributed to "the model" turned out to be software sitting between the user and the model.
Stop Calling It Hallucination
Hallucination is six or more distinct failure modes. Different mechanisms. Different solutions. Name the type first.
How AI Makes You More Wrong With More Analysis
Five hours of analysis, increasingly sharp, increasingly wrong. The frame amplifies. The reframe is the fix, not the analysis.
Frame Check
Drop any document in. See which analytical perspectives it covers, which it skips, the voice, what evidence backs each numerical claim. Free, open source, useful from the first paste.
Mirror Practices
The AI is the instrument. The person is the subject. Short self-experiments you can run on yourself, grounded in specific experimental findings.
Your Body Reads AI Output Before You Do
180 trials. The same circuits fire on AI disagreement as on human. The first read happens in your body before your conscious evaluation gets a chance. Speed is what hides it.
What Your Body Does When AI Disagrees
Ask AI to oppose you. Four observable signals reveal whether you are evaluating or defending. The same defensive circuits fire on AI opposition as on human.
What broke: Exposure to opposing views does not improve decisions. Without metabolic capacity to hold opposition, exposure produces rebuttal, not update.
Satisfaction Turns Our Evaluator Off
You notice when AI argues with you. You do not notice when AI confirms you. Confirmation has no signature, and the evaluator that would have caught the issue turns off. Satisfaction is the trap.
You Can Only Evaluate What You Could Produce
You defend AI-shaped conclusions you cannot rebuild. The ten-minute test reveals which parts of your work are yours. The discipline that compounds: choose what to own, delegate what AI can verify.