ArcTested March 2026 · xAI / Gemini· 4 min

The Most-Cited Finding Was Wrong

By Lovro Lucic · Mar 23, 2026 · Updated Jun 16, 2026

The most-cited effect across 90+ experiments was wrong.

Not the direction. The direction was right. Specific instructions produce more specific output than vague instructions. That replicated across evaluators, across models, across tasks. The direction survived everything.

The magnitude was wrong. The number was 2.34. Cited in seven pieces of writing. Referenced in thirty framework documents. Built into the theory of how specificity constrains AI output. The foundation number for the most-cited claim across all the work.

It was actually three effects stacked on top of each other, reported as one.

The original experiment compared a 22-word specific instruction against a 2-word vague instruction. "Do not produce a recommendation that could apply to any B2B SaaS company. Every point must be grounded in Northvane's specific situation" versus "Be specific." Twenty runs. Large effect. Replicated.

The obvious interpretation: specificity is the mechanism. But the comparison confounded specificity content with instruction length. 22 words versus 2. Any effect could be the extra instruction text, not the specificity within it.

A length-controlled replication added two conditions: a short specific instruction (8 words) and a long vague instruction (22 words with quality demands like "detailed, thorough, comprehensive"). Raw scores showed the long vague instruction beating the short specific one. First interpretation: length is the mechanism. The specificity claim collapses.

But that interpretation was also wrong. The long vague instruction produced 48 percent more words. More words produce more specificity markers mechanically. At density (markers per thousand words) the specific instruction was MORE specific per word. Length inflated the raw score. Specificity was real underneath. And the "long vague" instruction contained quality demands that could independently drive the effect.

Three confounded variables across the two replications. A clean decomposition required separating all of them.

A 2×2 factorial. Specificity: present or absent. Quality demands: present or absent. All instructions matched at 19 words. Output length constrained. Forty outputs. One generator.

Specificity: g=1.34 raw score (shorter output, more markers). Quality demands: g=0.58 at raw score, confidence interval down to zero at the lower bound, near zero on their own. (The effect sizes here are Hedges' g, the small-sample-corrected form of Cohen's d.) The two are not additive: quality demands do almost nothing alone but lift the score sharply on top of specificity (the both-present cell runs well above what the separate effects predict). Isolated at density, specificity is the mechanism, not quality demands and not length. The magnitude is g=1.34, not the 2.34 originally claimed.

The tool that caught the confound came from the measurement infrastructure I'd already built. Density analysis (dividing total markers by word count) was developed for fabrication measurement. Applied to the specificity experiment, it separated what raw scores had conflated. The system caught its own error using its own tools.

The measurement tool built for one problem (fabrication detection) revealed a confound in a different problem (specificity measurement). Density analysis isn't novel methodology. It's basic normalization that any researcher would apply. What made it possible here was having the measurement infrastructure already built and the habit of applying it reflexively. The confound led to a cleaner experiment. The cleaner experiment confirmed the mechanism at a smaller, more honest magnitude. The overclaim was replaced by a better-supported claim.

Then a domain expert evaluated the outputs blind. Couldn't tell which were produced with specific instructions and which with generic ones. Picked specific 3 out of 5 times, chance level. Both conditions produce the same analytical conclusions. The specificity instruction changes what the output LOOKS LIKE (more data references, more grounded language), not what it SAYS.

The direction held. The magnitude shrank. And the mechanism turned out to be about verifiability, not quality. Specific outputs can be checked, generic ones can't, but the analysis is the same either way. Three corrections deep. Each one more interesting than the finding it replaced.

Test this yourself

Take your strongest measured finding. Control for one variable you haven't isolated. See if the magnitude holds.

Run in ChatGPT

What survived testing

Specificity as mechanism (confounds controlled, confidence interval excludes zero)Copy link
The decomposition methodology (density analysis catches confounds raw scores miss)Copy link
The self-correction trajectory (own tools caught own overclaim)Copy link

What didn't survive

The original inflated effect size (specificity + length stacked; honest range roughly 40% smaller)Copy link
"Strongest effect in 90+ experiments" (large, but not as large as claimed)Copy link
Clean separation at density level: quality demands show a large density effect vs near-zero raw effect (density partially conflates specificity with shorter output length)Copy link

Honest limits

Clean 2x2 was single-generator (xAI). Cross-generator generalization is a separate trial not included in this receipt and is not established here.Copy link
Specificity heuristic validated against domain expert at chance level. Expert couldn't distinguish specific from generic on quality, only on style. Specificity changes form (verifiable references), not substance (same conclusions).Copy link
10 outputs per condition. Effect sizes directional with confidence intervals that exclude zero.Copy link
"Write exactly 500 words" did not fully control output length (368-443 words).Copy link
## Audit the data yourselfCopy link
The replication kit at [/receipts/catching-your-own-overclaim](/receipts/catching-your-own-overclaim) has the four exact instruction strings from the clean 2x2 factorial, the experiment script (418 lines), all 40 raw outputs with per-run 6-marker scores and density per 1,000 words, and the validation sub-sample. The g = 1.34 effect size and 95% CI [0.74, 2.32] can be re-derived from `data.json` using the shipped Hedges' g implementation, no API access required.Copy link

Receipts

Cited by

ExperimentSame Technique, Opposite Results

The "It Depends" Problem

Explore other threads

New findings when they land.

No spam. Just what held up.

The Most-Cited Finding Was Wrong

What survived testing

What didn't survive

Honest limits

The Fabrication Problem

The Evaluation Problem

The "What You Think Works" Problem

New findings when they land.