Receipts

Receipts: Three Questions Before You Prompt AI

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

These files are the raw artifacts behind the finding published at https://blog.clarethium.com/catching-your-own-overclaim.

The published claim is that specificity as a mechanism for more-verifiable AI output produces Hedges' g = 1.34 (95% CI [0.74, 2.32]; Cohen's d = 1.37), down from a prior overclaim of 2.34 that stacked three confounded effects. This folder contains the data and code that made the correction possible.

What's here

File	What it is
`conditions.md`	The four exact instruction strings used in the clean 2×2 factorial, in human-readable form. Length-matched at ~20 words.
`script.py`	The experiment script that produced the data, verbatim. 418 lines. Runs a 2×2 × 10-replications design against a single generator family.
`data.json`	All 40 raw outputs with per-run 6-marker scores, word counts, specificity density per 1,000 words, and the final effect-size analysis.
`validation.json`	A 10-output sub-sample with condition labels and density scores used to validate the scoring rubric.
`analysis.md`	Summary of what the numbers show, how to re-run the statistics from `data.json`, and where the key results appear in the JSON.

How to read this

If you want to check the claim: open analysis.md first. It pulls the relevant numbers out of data.json and explains how they were computed.
If you want to replicate: conditions.md has the four prompts, and script.py has the full procedure.
If you want to audit: data.json has every run's output text, every marker's count, and every score. Nothing is summarized away.

What the receipts prove (and don't)

These receipts prove:

The four conditions were length-matched (see conditions.md), which was the point of the rerun.
The density-per-1000-words calculation that separated specificity from raw-score inflation is present in every run's score object.
The reported Hedges' g = 1.34 can be re-derived from data.json using the shipped Hedges' g implementation. The numbers live at analysis.main_effects.specificity_raw (Cohen's d = 1.3715, Hedges' g = 1.3442).
The 95% CI [0.74, 2.32] on the Hedges' g excludes zero.

These receipts do NOT prove:

That the finding generalizes beyond the XAI generator family. The post's "Honest limits" section states this directly; cross-generator confirmation is a separate experiment (EXP-RERUN-025 cross-generator trials, not included here).
That the scoring rubric's six markers are the "right" operational definition of specificity. See analysis.md for the rubric and its validation step.
That a human reader could tell conditions apart from the outputs alone. They couldn't; see the "blind expert evaluation" note in the published post. That evaluation is summarized there, not re-published here.

Re-running the script

_config.py is a documented stub for the four helpers the script imports:

get_xai_client() and call_generator(...) raise NotImplementedError. Replace each body with a call to your own provider's SDK to reproduce generation.
hedges_g(scores_a, scores_b) and bootstrap_ci(scores_a, scores_b, n=10000, ci=0.95) are pure-stdlib implementations and run as-is. Use them on data.json to re-derive the Hedges' g = 1.34 effect size and 95% CI [0.74, 2.32] without touching any API.

Errata

Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.

Related receipts

The receipts for The Fabrication Architecture (../fabrication-architecture/) cover the foundational temporal-instability claim that this correction was about. Reading the two together gives the original strong claim, the data that was stacked underneath it, and the corrected magnitude in context.