Skip to content

What holds up when you test AI.

100+ experiments on AI and human judgment.

A working record of experiments. Each finding ships with what survived testing, what didn't, and how to verify it yourself.

The record

Every finding lists what survived testing, what testing killed, and honest limits on scope. See the record →

The Fabrication Problem

Most AI numbers are fabricated. Source material fixes it. Self-checking fails. Trust signals are backwards.

Answered Source material collapses unsourced numbers to single digits. Prohibition outperforms monitoring 5x. Self-checking fails because the same process generates and evaluates. Trust signals (citations, confidence, specificity) are higher in fabricated output than sourced output.

Open Does the fix work beyond reformulation tasks? Reasoning shows improvement (75% vs 38%), but strategy and creative untested.

The Evaluation Problem

Judgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.

Answered Evaluation capacity erodes invisibly under sustained AI delegation. Satisfaction suppresses evaluation (d=1.20). The strongest evaluators discriminate least (d=0.00 vs d=4.11).

Open Structural countermeasures: does monthly independent practice actually prevent degradation? Longitudinal evidence is inconclusive.

The "It Depends" Problem

Same instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.

Answered Constraints produce opposite effects depending on task type (d=2.34 convergent, harmful in exploratory). Negation alone is null; specificity provides the destination. The strongest specificity effect (d=2.34) was three confounds stacked; honest magnitude is d=1.37.

Open Where exactly is the boundary between convergent and exploratory tasks? Domain experts can't distinguish specific from generic output on quality. Specificity changes form, not substance.

The "What You Think Works" Problem

Temporal decay is a myth. Self-critique circles. Constraints narrow. Quality ceiling per mode.

The Model Mechanics

Context shapes the output more than which model produces it. The layers behind the model are mostly invisible. Different errors need different solutions.

Answered Context determines whether behaviors occur. The model adjusts how they express. Six or more distinct failure types need different responses.

Open The ordering (context > model choice) is likely structural, but specific ratios will shift. Tested on 2 model families. Broader replication needed.

Mirror Practices

The AI is the instrument. The person is the subject. Short self-experiments you can run on yourself, grounded in specific experimental findings.