Receipts
Receipts: How to Stop AI from Making Up Numbers
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
These files are the raw artifacts behind the finding published at https://blog.clarethium.com/source-conditioning.
The published claim is that putting real data in the prompt is the single largest variable affecting how often AI fabricates numbers, and that prohibition outperforms monitoring by an order of magnitude. This folder contains the five sub-experiments that establish those numbers across two model families and 90 documents.
What's here
| File | What it is |
|---|---|
method.md | The five sub-experiment designs in human-readable form. Topics, conditions, prompts, what each measurement counts. |
analysis.md | The aggregate numbers and how each maps to a claim in the published post. |
number_match.py | Sub-experiment 1: 3 topics × 4 source-architecture tiers × 2 versions = 24 documents. Establishes the source-vs-no-source magnitude. |
number_match_results.json | All 24 outputs with per-document number counts and unsourced details. |
prompt_arch.py | Sub-experiment 2: 3 topics × 3 prompt architectures (CURRENT, PROHIBITION, SEPARATED) × 2 versions = 18 documents (xAI). Establishes prohibition > monitoring. |
prompt_arch_results.json | All 18 outputs with full text and per-document number counts. |
cross_gen.py | Sub-experiment 3: same prohibition test on Gemini. 3 topics × 2 architectures × 2 versions = 12 documents. |
cross_gen_results.json | All 12 outputs with full text and per-document number counts. |
source_degradation.py | Sub-experiment 4: prohibition recipe stress-tested against partial and sparse sources. 3 topics × 3 conditions × 2 versions = 18 documents. |
source_degradation_results.json | All 18 outputs with grounded / parametric / fabricated counts. |
commensurable_bridge.py | Sub-experiment 5: source-match and temporal-stability measured on the same documents under both source-present and source-absent conditions. 3 topics × 2 conditions × 3 versions = 18 documents. |
bridge_results.json | All 18 outputs with cross-tabulated source match × temporal stability. |
How to read this
- If you want to check the claim: open
analysis.mdfirst. Each row in the claims table cites the file and the path inside the JSON that the number came from. - If you want to replicate:
method.mddescribes each sub-experiment's design, the prompt architectures, and what each number-matching layer counts. The five script files hold the procedures exactly as run. - If you want to audit: open the result JSONs. Each contains the full output text for every document, the regex-extracted number list, and the in-source / not-in-source classification. Nothing summarized away.
What the receipts prove (and don't)
These receipts prove:
- Source presence dominates. Across the bridge experiment (18 documents, 3 topics), source-present outputs match the source on 91% of numbers, source-absent outputs match on 45%. A 46 percentage-point gap with no other variable changing.
- Prohibition outperforms monitoring 5x on xAI. T3-CURRENT (monitoring with EXTENDS labels) produced 7.71% unsourced numbers. T3-PROHIBITION (no unsourced numbers allowed) produced 1.59%. Same topics, same model, same temperature.
- The effect replicates on Gemini. Gemini CURRENT 6.07%, Gemini PROHIBITION 1.66%. Different model family, same direction, comparable magnitude.
- Prohibition is costless. Word count goes up under prohibition (927 vs 771 on xAI, +20%). Sourced number count goes up (620 vs 419, +48%). The model compensates by extracting more, not by saying less.
- Partial sources still work. Removing ~50% of source sections (PARTIAL) produced 0.40% fabrication, lower than full source (1.03%). Reducing source to ~600 chars (SPARSE) produced 3.71% unsourced. The recipe degrades gracefully.
- Stable numbers are sourced numbers. In the bridge experiment, source-present documents have 62% temporal stability across regenerations vs 25% for source-absent. Source presence stabilizes the data layer.
These receipts do NOT prove:
- That the gap holds beyond the three tested topics. Topics span
remote work, internal communication, and AI-assisted development.
All three have bullet-pointed source documents in the
EXP-081-data/sources/folder of the vault. Generalization to other domains is untested here. - That the recipe applies to non-reformulation tasks. Every test here has source material available and asks for analytical reformulation. Reasoning, strategy, and creative tasks are not measured by this kit.
- That layers above the data layer are fixed. Source grounding stabilizes numbers (62% stability) but vocabulary, conclusions, and causal reasoning stay near same-topic baseline (~34%). The kit measures the data layer only.
- That fabrication rates hold across model versions over time.
Tested on
grok-4-1-fast(xAI, March 2026) andgemini-3-flash-preview(Gemini, March 2026). As inference compute and retrieval improve, these numbers will move.
Operational scope
Each script imports five names from a _config module:
GENERATOR_MODEL, get_generator_client, call_generator,
format_results, and (in two scripts) get_gemini_client /
gemini_evaluate. _config.py is a documented stub
that defines those names with the right signatures and raises
NotImplementedError from each function body. Replace each body with
a call to your own provider's SDK to reproduce end-to-end.
The reusable parts of each script run without any API access at all:
- The topic definitions and source-file paths (top of each script).
- The prompt-architecture functions (
t3_current,t3_prohibition,t3_separated_pass1,t3_separated_pass2inprompt_arch.py). - The
extract_numbersandnumber_in_sourcehelpers — the regex layer that classifies output numbers as in-source or not-in-source. Pure Python, no external dependencies. - The
analyze_numbersaggregation that produces the per-document counts in the result JSONs.
Source documents (~2-4KB each, bullet-pointed verifiable statistics)
live alongside the original scripts in the vault at
EXP-081-data/sources/. They are not bundled in this kit because they
summarize named third-party reports (Gallup, Stanford Bloom et al.,
Microsoft Work Trend Index, GitHub Octoverse, Pew Research, etc.).
sources/README.md lists the underlying reports
per topic so you can assemble equivalents.
What this kit is for
The point of receipts is verification. If you want to check whether
the 7.7% / 1.6% / 5x numbers in the post are real, open
prompt_arch_results.json, group by arch, sum numbers.total_numbers
and numbers.not_in_source, and recompute. If you want to know
whether the bridge experiment's 91% / 45% gap is real, open
bridge_results.json and read [topic][condition].aggregate.source_match_rate
for each of the three topics.
The patterns in these scripts work without the LLM evaluator the result JSONs were originally formatted with. The number-matching layer is regex against source text, runs in milliseconds, and is the load-bearing measurement for every claim above.
Related receipts
The Fabrication Architecture
(../fabrication-architecture/)
carries receipts for the foundational temporal-instability claim
(77-100 percent unsourced) that this recipe addresses. Reading the
two together gives the problem and the fix on the same dataset
shape.
Trust Signals Are Inverted
(../trust-signals-are-inverted/)
covers what source grounding does NOT change: the surface signals
fluent readers use to assess trustworthiness. Source grounding makes
output checkable; the trust-inversion finding shows that without
checking, fabricated and sourced output read as equally or more
trustworthy.
Errata
Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.
Files in this folder
- README.md8.5 KB
Overview and how to read these artifacts.
- analysis.md10.1 KB
What the numbers show and where to find them.
- bridge_results.json42.3 KB
All 18 outputs with cross-tabulated source match x temporal stability.
- commensurable_bridge.py23.9 KB
Sub-experiment 5: source-match and temporal-stability measured on the same documents under both source-present and source-absent.
- cross_gen.py17.0 KB
Sub-experiment 3: same prohibition test on Gemini. 12 documents. Cross-generator replication.
- cross_gen_results.json100.9 KB
All 12 Gemini outputs with full text and per-document number counts.
- method.md7.6 KB
Experimental design and measurement methodology.
- number_match.py11.7 KB
Sub-experiment 1: 4 source-architecture tiers x 3 topics x 2 versions = 24 documents. Establishes the source-vs-no-source magnitude.
- number_match_results.json30.9 KB
All 24 outputs with per-document number counts and unsourced details.
- prompt_arch.py22.4 KB
Sub-experiment 2: prohibition vs monitoring vs two-pass on xAI. 18 documents. Establishes prohibition outperforms monitoring 5x.
- prompt_arch_results.json184.0 KB
All 18 outputs with full text and per-document number counts.
- source_degradation.py23.8 KB
Sub-experiment 4: prohibition stress-tested against partial and sparse sources. 18 documents. Kill test for the recipe.
- source_degradation_results.json133.1 KB
All 18 outputs with grounded / parametric / fabricated number classifications.