What didn't hold up
These are hypotheses I held going into each experiment. The data killed them before the finding was published. They're part of each finding, not against it.
If a claim is ever corrected or withdrawn afterpublication, that's a retraction and shows up separately on the record.
You Can Only Evaluate What You Could Produce
- "Always generate first" as universal prescription, and its mirror, "delegate everything to AI." Both miss the discrimination move. The practice is generating what compounds for you and delegating what does not.
- "Borrowed is bad." Borrowed is fine when labeled. The failure is borrowed-mistaken-for-yours.
- Anchoring risk on generate-first (Tversky and Kahneman 1974) is real and unresolved at the controlled-test level. The mitigation: use the construction trace for structural evaluation (framing, completeness, what's missing), not content comparison.
Satisfaction Turns Our Evaluator Off
- "Be more critical" as the move. Telling yourself to be more critical does not work because the attachment to the frame is already running by the time the instruction fires.
- The strong claim that the data stance enhances detection beyond neutral framing. EXP-075 enhancement direction was killed (d=-0.26). The data stance works through suppression of the communication frame, not enhancement above baseline.
- The claim that satisfaction lets fabricated items slip past detection. EXP-091 planted-fabrication test found 60 of 60 detections regardless of whether the evaluator was satisfied. The trap operates at the holistic preference and selection level, not at per-item acceptance.
Frame Check
- The named-pattern detector across three pre-registered iterations against the pre-registered useful floor of 0.4. v1 macro-F1 0.157 (n=12). v2 0.274 (n=12, same-day rule audit). v3 0.360 (n=28, signal-level additions). All three below the floor. The labelers were two coders (curator and LLM-judge); the LLM-judge is permissive by construction (78 percent of slots flagged versus the curator's 30 percent). Track B with independent human annotators is pending. The named-pattern layer ships as hypothesis-with-evidence; the rest of the stack survives the gap.
- "Detection equals truth" claim, at the publish layer. Surfaces now read "low structural coverage of X" or "no markers detected" rather than "does not address X."
- Three v1 detection rules (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) retired in the same-day v2 audit, because they fired on cases they should not flag. The v3 follow-on study reintroduced two of them with revised signal substrate (S-3 growth vocabulary, S-4 efficiency vocabulary). FVS-001 remains permanently retired. The frame concepts stand as library entries; signal-level rebuild replaced the v1 rules for the two recovered.
Why Experts Miss What Beginners Catch
- "Domain expertise enables evaluation" too broad. Production expertise and memorized statistics enable evaluation. Domain familiarity alone does not.
- "Generate first always helps" untested. Anchoring risk is real.
- "FRAME improves analytical depth" killed. Zero effect on reasoning tasks. Partial effect on reformulation tasks only.
- "Evaluation degrades over time with AI delegation" not supported. The construction trace only covers produced content. There's nothing to degrade: evaluation of non-produced statistics was never strong.
More Context Barely Helps
- "More context always helps" killed. Information has near-zero effect at higher baseline. Information actively suppresses contrarian thinking on one model.
- "Information produces richer analysis" partially killed. Baseline-dependent, not robust.
- Large information effect from original did not replicate. Baseline-length-dependent.
Four Layers Produce Every AI Output
- "The model doesn't matter" too strong. Model determines format preferences and behavioral intensity independently of system configuration.
- "System effects are small" too dismissive. A single system configuration change (12 skills loaded vs removed) changed model behavior from functional to non-functional.
The Most-Cited Finding Was Wrong
- The original inflated effect size (specificity + length stacked; honest range roughly 40% smaller)
- "Strongest effect in 90+ experiments" (large, but not as large as claimed)
- Clean separation at density level: quality demands show a large density effect vs near-zero raw effect (density partially conflates specificity with shorter output length)
Most AI Numbers Are Fabricated
- "100% fabrication is universal" killed. One generator shows 77% with topic-dependent retrieval
- "PROTOCOL fixes fabrication" killed. Highest fabrication rate of all conditions
- "Source grounding fixes everything above data" partially killed. Vocabulary and causal framing stay at baseline for same-topic regeneration. But on reasoning tasks with ground truth, source-present output finds the correct answer roughly twice as often as source-absent. Source grounding improves correctness on reasoning tasks, not just reformulation.