Receipts
Receipts: Most AI Numbers Are Fabricated
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
These files are the raw artifacts behind the finding published at https://blog.clarethium.com/fabrication-architecture.
The published claim at its core is that 77 to 100 percent of AI- generated numbers are temporally unstable across regenerations of the same prompt, across three model families. Temporal instability is used as a proxy for fabrication: a number that changes between independent generations is consistent with generation, not with reliable knowledge retrieval. The proxy is directional, not identical to ground-truth fabrication (see the caveats below).
This folder contains the two experiments that anchor that range.
The Fabrication Architecture is a dense post that makes several related claims beyond the temporal-instability range: source grounding effects, prohibition vs monitoring, a 23-citation ground-truth check, and reasoning-task performance. Those are not covered by this receipts folder. Only the foundational temporal-instability claim is receipts-backed here. If the pilot systematizes, those sub-claims get their own receipts.
What's here
| File | What it is |
|---|---|
method.md | The experimental design for both experiments, readable form. |
analysis.md | What the numbers show, where to find them in the data, and how to recompute. |
temporal_consistency.py | Script for the single-generator, 20-topic, 3-versions-per-topic experiment. Verbatim. |
temporal_consistency_results.json | Per-topic results: stable and variable percentages/dollar amounts, fabrication rate, heading overlap. Aggregates at bottom. |
cross_generator.py | Script for the 3-generator × 10-topic × 3-versions replication. Verbatim. |
cross_generator_results.json | Per-generator aggregates. This is where the 77% to 100% range comes from. |
generated_documents.json | The 70 raw documents from the single-generator run (20 topics × 3 STANDARD + 5 × 2 BASIC). Full text of every output. ~500KB. |
How to read this
- If you want to check the claim: open
analysis.md. It pulls the aggregate numbers and maps them to the phrasing used in the published post. - If you want to replicate:
method.mddescribes the design, andtemporal_consistency.pyhas the full procedure. Generator prompts are ingenerated_documents.jsonalongside their outputs. - If you want to audit: open
generated_documents.jsonand pick any topic. It has three independent STANDARD generations. Compare the numerical claims across them yourself. You should see few percentages (if any) recurring across all three.
What the receipts prove (and don't)
These receipts prove:
- The 20 topics were real tasks in real domains (API design, clinical
trial design, business strategy, etc.), not cherry-picked fabrication
traps. Topic list is enumerable in
temporal_consistency_results.jsonunderstandard_results. - Each topic was generated three times independently with the same
prompt. Raw outputs for all 70 documents are in
generated_documents.json. - The scoring pipeline (what counts as a "stable" vs "variable" number)
is deterministic and auditable: the code is in
temporal_consistency.pyand delegates to aclaim_extractionmodule that extracts percentages, dollar amounts, and integer counts from each version and takes their intersection. - The 93.92 percent all-numbers fabrication rate on the primary
generator, and the 76.8 percent,100 percent range across three
generators reported in the post, can both be re-derived from
cross_generator_results.json.
These receipts do NOT prove:
- That the same rates would hold on generators outside this three-model sample (the post states this directly).
- That the temporal instability proxy is identical to ground-truth fabrication. A number could be unstable AND correct (sampled differently each time from genuine knowledge), or stable AND wrong. The published piece's own text calls this out.
- That fabrication rates are independent of topic. Claim density per
topic varies substantially; the mean is only about 3 numerical claims
per document. Topics with zero numerical claims have
fab_rate: nullin the results and are excluded from the aggregate. - Any of the published piece's other sub-claims:
- The 46-percentage-point drop with source grounding (BASIC 85.8% to 1.7% unsourced) is a separate experiment, not included here.
- Prohibition outperforms monitoring 5x (1.6% vs 7.7%) is a separate experiment.
- 2 of 23 named-source citations verified correct is a separate ground-truth check against published sources.
- Source-present output 75% correct vs 38% on reasoning tasks is a separate experiment across 5 reasoning domains. Each of those has its own data in the vault. If the receipts pilot systematizes, they'll get their own publications.
Re-running
Both scripts depend on two helper modules. Both ship with this kit:
claim_extraction.pyprovidesextract_all(text)(returns percentages, dollar amounts, and integer counts from text) andcompare_numerical_across_versions(versions)(returns which claims recur across all versions vs. only some). Pure regex overreandcollections. No external dependencies. Ships verbatim from the vault and runs as-is._config.pyis a documented stub for the API helpers (get_xai_client,get_gemini_client,call_generator). Each stub function raisesNotImplementedErrorwith a one-line description of what the original did. Replace each body with a call to your own provider's SDK to reproduce generation.
The temporal-stability numbers in the post can be re-derived from
generated_documents.json using
claim_extraction.py alone, no API access required.
Errata
Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.
Related receipts
Catching Your Own Overclaim
(../catching-your-own-overclaim/)
carries the receipts for the corrected version of the most-cited
effect in this dataset. The two kits together give the original strong
claim, the data underneath it, and the corrected magnitude in
context.
The receipts for Stop Calling It Hallucination
(../stop-calling-it-hallucination/)
demonstrate three failure modes live (role framing, template contamination,
confabulation with and without prohibition). The Beat 5 demo is a live
cross-model replication of the source-grounding fix this fabrication
architecture is built on. The methodology-narrative format there exposes
design history alongside the replication kit.
Source Conditioning
(../source-conditioning/) carries the
receipts for the operational fix referenced in this post (source
grounding + prohibition). Five sub-experiments, two model families,
90 documents.
Trust Signals Are Inverted
(../trust-signals-are-inverted/)
covers the reading-side consequence of the fabrication mechanism: the
surface signals readers use to assess trustworthiness are higher in
fabricated output than in sourced output.
Files in this folder
- README.md7.7 KB
Overview and how to read these artifacts.
- analysis.md5.8 KB
What the numbers show and where to find them.
- claim_extraction.py14.4 KB
Pure-regex claim extractor. Ships verbatim from the vault; runs without API access. Used to re-derive the temporal-stability numbers from the bundled documents.
- cross_generator.py27.2 KB
Cross-generator replication script (3 models × 10 topics).
- cross_generator_results.json162.1 KB
Per-generator aggregates for the 77% to 100% range.
- generated_documents.json484.6 KB
All 70 raw documents from the single-generator run. Full text.
- method.md4.0 KB
Experimental design and measurement methodology.
- temporal_consistency.py16.7 KB
Single-generator, 20-topic temporal-stability experiment.
- temporal_consistency_results.json41.5 KB
Per-topic scoring + aggregate for the single-generator run.