Receipts
Receipts: The Most-Cited Finding Was Wrong
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
These files are the raw artifacts behind the finding published at https://blog.clarethium.com/catching-your-own-overclaim.
The published claim is that specificity as a mechanism for more-verifiable AI output produces d = 1.37 (95% CI [0.74, 2.35]), down from a prior overclaim of d = 2.34 that stacked three confounded effects. This folder contains the data and code that made the correction possible.
What's here
| File | What it is |
|---|---|
conditions.md | The four exact instruction strings used in the clean 2×2 factorial, in human-readable form. Length-matched at ~20 words. |
script.py | The experiment script that produced the data, verbatim. 418 lines. Runs a 2×2 × 10-replications design against a single generator family. |
data.json | All 40 raw outputs with per-run 6-marker scores, word counts, specificity density per 1,000 words, and the final effect-size analysis. |
validation.json | A 10-output sub-sample with condition labels and density scores used to validate the scoring rubric. |
analysis.md | Summary of what the numbers show, how to re-run the statistics from data.json, and where the key results appear in the JSON. |
How to read this
- If you want to check the claim: open
analysis.mdfirst. It pulls the relevant numbers out ofdata.jsonand explains how they were computed. - If you want to replicate:
conditions.mdhas the four prompts, andscript.pyhas the full procedure. - If you want to audit:
data.jsonhas every run's output text, every marker's count, and every score. Nothing is summarized away.
What the receipts prove (and don't)
These receipts prove:
- The four conditions were length-matched (see
conditions.md), which was the point of the rerun. - The density-per-1000-words calculation that separated specificity from raw-score inflation is present in every run's score object.
- The reported effect size d = 1.37 can be re-derived from
data.jsonusing any standard Hedges' g implementation. The raw number lives atanalysis.main_effects.specificity_raw.d= 1.3715. - The 95% CI [0.74, 2.35] reported in the post excludes zero.
These receipts do NOT prove:
- That the finding generalizes beyond the XAI generator family. The post's "Honest limits" section states this directly; cross-generator confirmation is a separate experiment (EXP-RERUN-025 cross-generator trials, not included here).
- That the scoring rubric's six markers are the "right" operational
definition of specificity. See
analysis.mdfor the rubric and its validation step. - That a human reader could tell conditions apart from the outputs alone. They couldn't; see the "blind expert evaluation" note in the published post. That evaluation is summarized there, not re-published here.
Re-running the script
_config.py is a documented stub for the four helpers
the script imports:
get_xai_client()andcall_generator(...)raiseNotImplementedError. Replace each body with a call to your own provider's SDK to reproduce generation.hedges_g(scores_a, scores_b)andbootstrap_ci(scores_a, scores_b, n=10000, ci=0.95)are pure-stdlib implementations and run as-is. Use them ondata.jsonto re-derive the d = 1.37 effect size and 95% CI [0.74, 2.35] without touching any API.
Errata
Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.
Related receipts
The receipts for The Fabrication Architecture
(../fabrication-architecture/) cover
the foundational temporal-instability claim that this correction was
about. Reading the two together gives the original strong claim, the
data that was stacked underneath it, and the corrected magnitude in
context.
The receipts for Stop Calling It Hallucination
(../stop-calling-it-hallucination/)
demonstrate three of six AI-output failure modes live, including the
confabulation mechanism this overclaim was about. The methodology-narrative
format there exposes the design history of each demo alongside the
replication kit.
Files in this folder
- README.md4.5 KB
Overview and how to read these artifacts.
- analysis.md4.3 KB
What the numbers show and where to find them.
- conditions.md3.2 KB
The experiment conditions in human-readable form.
- data.json148.3 KB
Raw outputs + per-run scoring + final analysis.
- script.py16.9 KB
The experiment script, verbatim.
- validation.json1.1 KB
Human-scored calibration subset.