# Receipts: Catching Your Own Overclaim

These files are the raw artifacts behind the finding published at
<https://blog.clarethium.com/catching-your-own-overclaim>.

The published claim is that **specificity as a mechanism for more-verifiable
AI output produces d = 1.37 (95% CI [0.74, 2.35])**, down from a prior
overclaim of d = 2.34 that stacked three confounded effects. This folder
contains the data and code that made the correction possible.

## What's here

| File | What it is |
|---|---|
| [`conditions.md`](./conditions.md) | The four exact instruction strings used in the clean 2×2 factorial, in human-readable form. Length-matched at ~20 words. |
| [`script.py`](./script.py) | The experiment script that produced the data, verbatim. 418 lines. Runs a 2×2 × 10-replications design against a single generator family. |
| [`data.json`](./data.json) | All 40 raw outputs with per-run 6-marker scores, word counts, specificity density per 1,000 words, and the final effect-size analysis. |
| [`validation.json`](./validation.json) | A 10-output sub-sample with condition labels and density scores used to validate the scoring rubric. |
| [`analysis.md`](./analysis.md) | Summary of what the numbers show, how to re-run the statistics from `data.json`, and where the key results appear in the JSON. |

## How to read this

- **If you want to check the claim:** open [`analysis.md`](./analysis.md) first.
  It pulls the relevant numbers out of `data.json` and explains how they
  were computed.
- **If you want to replicate:** [`conditions.md`](./conditions.md) has the
  four prompts, and [`script.py`](./script.py) has the full procedure.
- **If you want to audit:** [`data.json`](./data.json) has every run's
  output text, every marker's count, and every score. Nothing is
  summarized away.

## What the receipts prove (and don't)

These receipts prove:

- The four conditions were length-matched (see `conditions.md`), which was
  the point of the rerun.
- The density-per-1000-words calculation that separated specificity from
  raw-score inflation is present in every run's score object.
- The reported effect size d = 1.37 can be re-derived from `data.json`
  using any standard Hedges' g implementation. The raw number lives at
  `analysis.main_effects.specificity_raw.d` = 1.3715.
- The 95% CI [0.74, 2.35] reported in the post excludes zero.

These receipts do NOT prove:

- That the finding generalizes beyond the XAI generator family. The
  post's "Honest limits" section states this directly; cross-generator
  confirmation is a separate experiment (EXP-RERUN-025 cross-generator
  trials, not included here).
- That the scoring rubric's six markers are the "right" operational
  definition of specificity. See [`analysis.md`](./analysis.md) for the
  rubric and its validation step.
- That a human reader could tell conditions apart from the outputs alone.
  They couldn't; see the "blind expert evaluation" note in the published
  post. That evaluation is summarized there, not re-published here.

## Re-running the script

[`_config.py`](./_config.py) is a documented stub for the four helpers
the script imports:

- `get_xai_client()` and `call_generator(...)` raise
  `NotImplementedError`. Replace each body with a call to your own
  provider's SDK to reproduce generation.
- `hedges_g(scores_a, scores_b)` and
  `bootstrap_ci(scores_a, scores_b, n=10000, ci=0.95)` are
  pure-stdlib implementations and run as-is. Use them on
  [`data.json`](./data.json) to re-derive the d = 1.37 effect size
  and 95% CI [0.74, 2.35] without touching any API.

## Errata

Found a problem with the data, the method, or the analysis? Send it
via LinkedIn DM (linked from
[/about](https://blog.clarethium.com/about)). Corrections get
published on the record at [/record](https://blog.clarethium.com/record),
with attribution.

## Related receipts

The receipts for [The Fabrication Architecture](/fabrication-architecture)
([`../fabrication-architecture/`](../fabrication-architecture/)) cover
the foundational temporal-instability claim that this correction was
about. Reading the two together gives the original strong claim, the
data that was stacked underneath it, and the corrected magnitude in
context.

The receipts for [Stop Calling It Hallucination](/stop-calling-it-hallucination)
([`../stop-calling-it-hallucination/`](../stop-calling-it-hallucination/))
demonstrate three of six AI-output failure modes live, including the
confabulation mechanism this overclaim was about. The methodology-narrative
format there exposes the design history of each demo alongside the
replication kit.
