Skip to content
Back to the finding

Receipts

Receipts: Most AI Numbers Are Fabricated

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

These files are the raw artifacts behind the finding published at https://blog.clarethium.com/fabrication-architecture.

The published claim at its core is that 77 to 100 percent of AI- generated numbers are temporally unstable across regenerations of the same prompt, across three model families. Temporal instability is used as a proxy for fabrication: a number that changes between independent generations is consistent with generation, not with reliable knowledge retrieval. The proxy is directional, not identical to ground-truth fabrication (see the caveats below).

This folder contains the two experiments that anchor that range.

The Fabrication Architecture is a dense post that makes several related claims beyond the temporal-instability range: source grounding effects, prohibition vs monitoring, a 23-citation ground-truth check, and reasoning-task performance. Those are not covered by this receipts folder. Only the foundational temporal-instability claim is receipts-backed here. If the pilot systematizes, those sub-claims get their own receipts.

What's here

FileWhat it is
method.mdThe experimental design for both experiments, readable form.
analysis.mdWhat the numbers show, where to find them in the data, and how to recompute.
temporal_consistency.pyScript for the single-generator, 20-topic, 3-versions-per-topic experiment. Verbatim.
temporal_consistency_results.jsonPer-topic results: stable and variable percentages/dollar amounts, fabrication rate, heading overlap. Aggregates at bottom.
cross_generator.pyScript for the 3-generator × 10-topic × 3-versions replication. Verbatim.
cross_generator_results.jsonPer-generator aggregates. This is where the 77% to 100% range comes from.
generated_documents.jsonThe 70 raw documents from the single-generator run (20 topics × 3 STANDARD + 5 × 2 BASIC). Full text of every output. ~500KB.

How to read this

  • If you want to check the claim: open analysis.md. It pulls the aggregate numbers and maps them to the phrasing used in the published post.
  • If you want to replicate: method.md describes the design, and temporal_consistency.py has the full procedure. Generator prompts are in generated_documents.json alongside their outputs.
  • If you want to audit: open generated_documents.json and pick any topic. It has three independent STANDARD generations. Compare the numerical claims across them yourself. You should see few percentages (if any) recurring across all three.

What the receipts prove (and don't)

These receipts prove:

  • The 20 topics were real tasks in real domains (API design, clinical trial design, business strategy, etc.), not cherry-picked fabrication traps. Topic list is enumerable in temporal_consistency_results.json under standard_results.
  • Each topic was generated three times independently with the same prompt. Raw outputs for all 70 documents are in generated_documents.json.
  • The scoring pipeline (what counts as a "stable" vs "variable" number) is deterministic and auditable: the code is in temporal_consistency.py and delegates to a claim_extraction module that extracts percentages, dollar amounts, and integer counts from each version and takes their intersection.
  • The 93.92 percent all-numbers fabrication rate on the primary generator, and the 76.8 percent,100 percent range across three generators reported in the post, can both be re-derived from cross_generator_results.json.

These receipts do NOT prove:

  • That the same rates would hold on generators outside this three-model sample (the post states this directly).
  • That the temporal instability proxy is identical to ground-truth fabrication. A number could be unstable AND correct (sampled differently each time from genuine knowledge), or stable AND wrong. The published piece's own text calls this out.
  • That fabrication rates are independent of topic. Claim density per topic varies substantially; the mean is only about 3 numerical claims per document. Topics with zero numerical claims have fab_rate: null in the results and are excluded from the aggregate.
  • Any of the published piece's other sub-claims:
    • The 46-percentage-point drop with source grounding (BASIC 85.8% to 1.7% unsourced) is a separate experiment, not included here.
    • Prohibition outperforms monitoring 5x (1.6% vs 7.7%) is a separate experiment.
    • 2 of 23 named-source citations verified correct is a separate ground-truth check against published sources.
    • Source-present output 75% correct vs 38% on reasoning tasks is a separate experiment across 5 reasoning domains. Each of those has its own data in the vault. If the receipts pilot systematizes, they'll get their own publications.

Re-running

Both scripts depend on two helper modules. Both ship with this kit:

  • claim_extraction.py provides extract_all(text) (returns percentages, dollar amounts, and integer counts from text) and compare_numerical_across_versions(versions) (returns which claims recur across all versions vs. only some). Pure regex over re and collections. No external dependencies. Ships verbatim from the vault and runs as-is.
  • _config.py is a documented stub for the API helpers (get_xai_client, get_gemini_client, call_generator). Each stub function raises NotImplementedError with a one-line description of what the original did. Replace each body with a call to your own provider's SDK to reproduce generation.

The temporal-stability numbers in the post can be re-derived from generated_documents.json using claim_extraction.py alone, no API access required.

Errata

Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.

Related receipts

Catching Your Own Overclaim (../catching-your-own-overclaim/) carries the receipts for the corrected version of the most-cited effect in this dataset. The two kits together give the original strong claim, the data underneath it, and the corrected magnitude in context.

The receipts for Stop Calling It Hallucination (../stop-calling-it-hallucination/) demonstrate three failure modes live (role framing, template contamination, confabulation with and without prohibition). The Beat 5 demo is a live cross-model replication of the source-grounding fix this fabrication architecture is built on. The methodology-narrative format there exposes design history alongside the replication kit.

Source Conditioning (../source-conditioning/) carries the receipts for the operational fix referenced in this post (source grounding + prohibition). Five sub-experiments, two model families, 90 documents.

Trust Signals Are Inverted (../trust-signals-are-inverted/) covers the reading-side consequence of the fabrication mechanism: the surface signals readers use to assess trustworthiness are higher in fabricated output than in sourced output.

Files in this folder