Receipts

Receipts: Most AI Numbers Are Fabricated

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

These files are the raw artifacts behind the finding published at https://blog.clarethium.com/fabrication-architecture.

The published claim at its core is that 77 to 100 percent of AI- generated numbers are temporally unstable across regenerations of the same prompt, across three model families. Temporal instability is used as a proxy for fabrication: a number that changes between independent generations is consistent with generation, not with reliable knowledge retrieval. The proxy is directional, not identical to ground-truth fabrication (see the caveats below).

This folder contains the two experiments that anchor that range.

The Fabrication Architecture is a dense post that makes several related claims beyond the temporal-instability range: source grounding effects, prohibition vs monitoring, a 23-citation ground-truth check, and reasoning-task performance. Those are not covered by this receipts folder. Only the foundational temporal-instability claim is receipts-backed here. If the pilot systematizes, those sub-claims get their own receipts.

What's here

File	What it is
`method.md`	The experimental design for both experiments, readable form.
`analysis.md`	What the numbers show, where to find them in the data, and how to recompute.
`temporal_consistency.py`	Script for the single-generator, 20-topic, 3-versions-per-topic experiment. Verbatim.
`temporal_consistency_results.json`	Per-topic results: stable and variable percentages/dollar amounts, fabrication rate, heading overlap. Aggregates at bottom.
`cross_generator.py`	Script for the 3-generator × 10-topic × 3-versions replication. Verbatim.
`cross_generator_results.json`	Per-generator aggregates. This is where the 77% to 100% range comes from.
`generated_documents.json`	The 70 raw documents from the single-generator run (20 topics × 3 STANDARD + 5 × 2 BASIC). Full text of every output. ~500KB.

How to read this

If you want to check the claim: open analysis.md. It pulls the aggregate numbers and maps them to the phrasing used in the published post.
If you want to replicate: method.md describes the design, and temporal_consistency.py has the full procedure. Generator prompts are in generated_documents.json alongside their outputs.
If you want to audit: open generated_documents.json and pick any topic. It has three independent STANDARD generations. Compare the numerical claims across them yourself. You should see few percentages (if any) recurring across all three.

What the receipts prove (and don't)

These receipts prove:

The 20 topics were real tasks in real domains (API design, clinical trial design, business strategy, etc.), not cherry-picked fabrication traps. Topic list is enumerable in temporal_consistency_results.json under standard_results.
Each topic was generated three times independently with the same prompt. Raw outputs for all 70 documents are in generated_documents.json.
The scoring pipeline (what counts as a "stable" vs "variable" number) is deterministic and auditable: the code is in temporal_consistency.py and delegates to a claim_extraction module that extracts percentages, dollar amounts, and integer counts from each version and takes their intersection.
The 93.92 percent all-numbers fabrication rate on the primary generator, and the 76.8 percent,100 percent range across three generators reported in the post, can both be re-derived from cross_generator_results.json.

These receipts do NOT prove:

That the same rates would hold on generators outside this three-model sample (the post states this directly).
That the temporal instability proxy is identical to ground-truth fabrication. A number could be unstable AND correct (sampled differently each time from genuine knowledge), or stable AND wrong. The published piece's own text calls this out.
That fabrication rates are independent of topic. Claim density per topic varies substantially; the mean is only about 3 numerical claims per document. Topics with zero numerical claims have fab_rate: null in the results and are excluded from the aggregate.
Any of the published piece's other sub-claims:
- The source-grounding results are separate experiments, not included here. Two distinct claims: the condition table (BASIC 85.8% temporal instability to 1.7% unsourced, which pairs two different measurement constructs, as the post states), and the 46-percentage-point figure, which is the attribution shift (45.9% to 91.9% source-match), re-derivable from the source-conditioning kit's bridge experiment.
- Prohibition outperforms monitoring 5x (1.6% vs 7.7%) is a separate experiment.
- 2 of 23 named-source citations verified correct is a separate ground-truth check against published sources.
- Source-present output 75% correct vs 38% on reasoning tasks is a separate experiment across 5 reasoning domains. Each of those has its own data in the vault. If the receipts pilot systematizes, they'll get their own publications.

Re-running

Both scripts depend on two helper modules. Both ship with this kit:

claim_extraction.py provides extract_all(text) (returns percentages, dollar amounts, and integer counts from text) and compare_numerical_across_versions(versions) (returns which claims recur across all versions vs. only some). Pure regex over re and collections. No external dependencies. Ships verbatim from the vault and runs as-is.
_config.py is a documented stub for the API helpers (get_xai_client, get_gemini_client, call_generator). Each stub function raises NotImplementedError with a one-line description of what the original did. Replace each body with a call to your own provider's SDK to reproduce generation.

The temporal-stability numbers in the post can be re-derived from generated_documents.json using claim_extraction.py alone, no API access required.

Errata

Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.

Related receipts

Catching Your Own Overclaim (../catching-your-own-overclaim/) carries the receipts for the corrected version of the most-cited effect in this dataset. The two kits together give the original strong claim, the data underneath it, and the corrected magnitude in context.

The receipts for Stop Calling It Hallucination (../stop-calling-it-hallucination/) demonstrate three failure modes live (role framing, template contamination, confabulation with and without prohibition). The Beat 5 demo is a live cross-model replication of the source-grounding fix this fabrication architecture is built on. The methodology-narrative format there exposes design history alongside the replication kit.

Source Conditioning (../source-conditioning/) carries the receipts for the operational fix referenced in this post (source grounding + prohibition). Five sub-experiments, two model families, 90 documents.