What holds up when you test AI.

100+ experiments on AI and human judgment.

A working record of experiments. Each finding ships with what survived testing, what didn't, and how to verify it yourself.

Start hereLatest: May 19, 2026

The record

Every finding lists what survived testing, what testing killed, and honest limits on scope. See the record →

The Fabrication Problem

Most AI numbers are fabricated. Source material fixes it. Self-checking fails. Trust signals are backwards.

Answered Source material collapses unsourced numbers to single digits. Prohibition outperforms monitoring 5x. Self-checking fails because the same process generates and evaluates. Trust signals (citations, confidence, specificity) are higher in fabricated output than sourced output.

Open Does the fix work beyond reformulation tasks? Reasoning shows improvement (75% vs 38%), but strategy and creative untested.

Experiment10 minMar 23, 2026

Most AI Numbers Are Fabricated

77 to 100 percent of AI-generated numbers are temporally unstable. Source material fixes it. Prompts don't.

What broke: The most constrained prompt produced the highest fabrication rate (90.7% vs 85.8%).

Recipe5 minMar 23, 2026

How to Stop AI from Making Up Numbers

Source material drops AI fabrication from 85% to single digits. Three steps.

What broke: Source material fixes the data layer only. Vocabulary, conclusions, reasoning stay at baseline.

Mechanism4 minMar 23, 2026

Why AI Can't Check Its Own Work

The agent reported clean. The output was wrong. Same process generating and evaluating.

What broke: "Ask AI to check its own work" fails because the same process generates and evaluates.

Experiment4 minMar 23, 2026

The Most Trustworthy AI Output Is the Least Reliable

The signals you use to judge AI trustworthiness are the same signals fabrication produces.

What broke: A domain expert with 80+ experiments in AI evaluation couldn't distinguish sourced from fabricated output.

The Evaluation Problem

Judgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.

Answered Evaluation capacity erodes invisibly under sustained AI delegation. Satisfaction suppresses evaluation (d=1.20). The strongest evaluators discriminate least (d=0.00 vs d=4.11).

Open Structural countermeasures: does monthly independent practice actually prevent degradation? Longitudinal evidence is inconclusive.

Mechanism6 minApr 25, 2026

Why Experts Miss What Beginners Catch

Generation builds the mental model that makes evaluation possible. Without it, evaluation collapses to surface features.

What broke: Domain expertise alone does not enable evaluation. Production expertise does. Cold evaluation collapses to surface features.

From practice4 minApr 25, 2026

The Decision That Was Never Made

AI resolved the uncertainty before your own thinking had time to finish. Resolution and decision are different things.

What broke: "AI advice is always worse than human advice" too strong. The content may be equivalent. The speed and confidence change the cognitive process.

The "It Depends" Problem

Same instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.

Answered Constraints produce opposite effects depending on task type (d=2.34 convergent, harmful in exploratory). Negation alone is null; specificity provides the destination. The strongest specificity effect (d=2.34) was three confounds stacked; honest magnitude is d=1.37.

Open Where exactly is the boundary between convergent and exploratory tasks? Domain experts can't distinguish specific from generic output on quality. Specificity changes form, not substance.

Experiment4 minMar 24, 2026

Same Technique, Opposite Results

The structured approach that produced precision on convergent problems actively harmed exploratory ones.

What broke: The same structural instruction that helped convergent tasks harmed exploratory tasks.

Experiment5 minApr 20, 2026

More Context Barely Helps

Adding information to an already-thorough prompt produced near zero improvement. Three constraint sentences changed everything.

What broke: Adding information to an already-thorough prompt produced near zero improvement (d=0.19). Three constraint sentences changed everything.

Arc4 minMar 23, 2026

The Most-Cited Finding Was Wrong

The most-cited effect across 80+ experiments was three effects stacked. Honest magnitude: 40% smaller.

What broke: The strongest effect in 80+ experiments (d=2.34) was three effects stacked. Honest magnitude: d=1.37.

The "What You Think Works" Problem

Temporal decay is a myth. Self-critique circles. Constraints narrow. Quality ceiling per mode.

Mechanism4 minApr 16, 2026

Stop Polishing, Start Switching

The ceiling is per generation mode. Switch modes to access territory that iteration can't reach.

What broke: "Always switch modes" too prescriptive. Sometimes iteration within a mode is what you need.

The Model Mechanics

Context shapes the output more than which model produces it. The layers behind the model are mostly invisible. Different errors need different solutions.

Answered Context determines whether behaviors occur. The model adjusts how they express. Six or more distinct failure types need different responses.

Open The ordering (context > model choice) is likely structural, but specific ratios will shift. Tested on 2 model families. Broader replication needed.

Experiment6 minMar 24, 2026

The Model Is Rarely the Variable

The context determined whether behaviors existed at all. The model adjusted the volume.

What broke: The specific ratio is experiment-specific. The ordering (prompt > model) is robust, but the magnitude varies.

Mechanism4 minApr 3, 2026

Four Layers Produce Every AI Output

Four layers produce every AI output. The company's system. Your system. Your prompt. The model. The model is the only one with a name.

What broke: Behaviors attributed to "the model" turned out to be software sitting between the user and the model.

Mechanism6 minApr 25, 2026

Stop Calling It Hallucination

Hallucination is six or more distinct failure modes. Different mechanisms. Different solutions. Name the type first.

Mechanism6 minApr 26, 2026

How AI Makes You More Wrong With More Analysis

Five hours of analysis, increasingly sharp, increasingly wrong. The frame amplifies. The reframe is the fix, not the analysis.

From practice10 minMay 5, 2026

Frame Check

Drop any document in. See which analytical perspectives it covers, which it skips, the voice, what evidence backs each numerical claim. Free, open source, useful from the first paste.

Mirror Practices

The AI is the instrument. The person is the subject. Short self-experiments you can run on yourself, grounded in specific experimental findings.

Practice5 minApr 7, 2026

Your Body Reads AI Output Before You Do

180 trials. The same circuits fire on AI disagreement as on human. The first read happens in your body before your conscious evaluation gets a chance. Speed is what hides it.

Practice5 minApr 25, 2026

What Your Body Does When AI Disagrees

Ask AI to oppose you. Four observable signals reveal whether you are evaluating or defending. The same defensive circuits fire on AI opposition as on human.

What broke: Exposure to opposing views does not improve decisions. Without metabolic capacity to hold opposition, exposure produces rebuttal, not update.

Practice6 minMay 11, 2026

Satisfaction Turns Our Evaluator Off

You notice when AI argues with you. You do not notice when AI confirms you. Confirmation has no signature, and the evaluator that would have caught the issue turns off. Satisfaction is the trap.

Practice5 minMay 19, 2026

You Can Only Evaluate What You Could Produce

You defend AI-shaped conclusions you cannot rebuild. The ten-minute test reveals which parts of your work are yours. The discipline that compounds: choose what to own, delegate what AI can verify.