Skip to content
Back to the finding

Receipts

Receipts: How to Stop AI from Making Up Numbers

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

These files are the raw artifacts behind the finding published at https://blog.clarethium.com/source-conditioning.

The published claim is that putting real data in the prompt is the single largest variable affecting how often AI fabricates numbers, and that prohibition outperforms monitoring by an order of magnitude. This folder contains the five sub-experiments that establish those numbers across two model families and 90 documents.

What's here

FileWhat it is
method.mdThe five sub-experiment designs in human-readable form. Topics, conditions, prompts, what each measurement counts.
analysis.mdThe aggregate numbers and how each maps to a claim in the published post.
number_match.pySub-experiment 1: 3 topics × 4 source-architecture tiers × 2 versions = 24 documents. Establishes the source-vs-no-source magnitude.
number_match_results.jsonAll 24 outputs with per-document number counts and unsourced details.
prompt_arch.pySub-experiment 2: 3 topics × 3 prompt architectures (CURRENT, PROHIBITION, SEPARATED) × 2 versions = 18 documents (xAI). Establishes prohibition > monitoring.
prompt_arch_results.jsonAll 18 outputs with full text and per-document number counts.
cross_gen.pySub-experiment 3: same prohibition test on Gemini. 3 topics × 2 architectures × 2 versions = 12 documents.
cross_gen_results.jsonAll 12 outputs with full text and per-document number counts.
source_degradation.pySub-experiment 4: prohibition recipe stress-tested against partial and sparse sources. 3 topics × 3 conditions × 2 versions = 18 documents.
source_degradation_results.jsonAll 18 outputs with grounded / parametric / fabricated counts.
commensurable_bridge.pySub-experiment 5: source-match and temporal-stability measured on the same documents under both source-present and source-absent conditions. 3 topics × 2 conditions × 3 versions = 18 documents.
bridge_results.jsonAll 18 outputs with cross-tabulated source match × temporal stability.

How to read this

  • If you want to check the claim: open analysis.md first. Each row in the claims table cites the file and the path inside the JSON that the number came from.
  • If you want to replicate: method.md describes each sub-experiment's design, the prompt architectures, and what each number-matching layer counts. The five script files hold the procedures exactly as run.
  • If you want to audit: open the result JSONs. Each contains the full output text for every document, the regex-extracted number list, and the in-source / not-in-source classification. Nothing summarized away.

What the receipts prove (and don't)

These receipts prove:

  • Source presence dominates. Across the bridge experiment (18 documents, 3 topics), source-present outputs match the source on 91% of numbers, source-absent outputs match on 45%. A 46 percentage-point gap with no other variable changing.
  • Prohibition outperforms monitoring 5x on xAI. T3-CURRENT (monitoring with EXTENDS labels) produced 7.71% unsourced numbers. T3-PROHIBITION (no unsourced numbers allowed) produced 1.59%. Same topics, same model, same temperature.
  • The effect replicates on Gemini. Gemini CURRENT 6.07%, Gemini PROHIBITION 1.66%. Different model family, same direction, comparable magnitude.
  • Prohibition is costless. Word count goes up under prohibition (927 vs 771 on xAI, +20%). Sourced number count goes up (620 vs 419, +48%). The model compensates by extracting more, not by saying less.
  • Partial sources still work. Removing ~50% of source sections (PARTIAL) produced 0.40% fabrication, lower than full source (1.03%). Reducing source to ~600 chars (SPARSE) produced 3.71% unsourced. The recipe degrades gracefully.
  • Stable numbers are sourced numbers. In the bridge experiment, source-present documents have 62% temporal stability across regenerations vs 25% for source-absent. Source presence stabilizes the data layer.

These receipts do NOT prove:

  • That the gap holds beyond the three tested topics. Topics span remote work, internal communication, and AI-assisted development. All three have bullet-pointed source documents in the EXP-081-data/sources/ folder of the vault. Generalization to other domains is untested here.
  • That the recipe applies to non-reformulation tasks. Every test here has source material available and asks for analytical reformulation. Reasoning, strategy, and creative tasks are not measured by this kit.
  • That layers above the data layer are fixed. Source grounding stabilizes numbers (62% stability) but vocabulary, conclusions, and causal reasoning stay near same-topic baseline (~34%). The kit measures the data layer only.
  • That fabrication rates hold across model versions over time. Tested on grok-4-1-fast (xAI, March 2026) and gemini-3-flash-preview (Gemini, March 2026). As inference compute and retrieval improve, these numbers will move.

Operational scope

Each script imports five names from a _config module: GENERATOR_MODEL, get_generator_client, call_generator, format_results, and (in two scripts) get_gemini_client / gemini_evaluate. _config.py is a documented stub that defines those names with the right signatures and raises NotImplementedError from each function body. Replace each body with a call to your own provider's SDK to reproduce end-to-end.

The reusable parts of each script run without any API access at all:

  • The topic definitions and source-file paths (top of each script).
  • The prompt-architecture functions (t3_current, t3_prohibition, t3_separated_pass1, t3_separated_pass2 in prompt_arch.py).
  • The extract_numbers and number_in_source helpers — the regex layer that classifies output numbers as in-source or not-in-source. Pure Python, no external dependencies.
  • The analyze_numbers aggregation that produces the per-document counts in the result JSONs.

Source documents (~2-4KB each, bullet-pointed verifiable statistics) live alongside the original scripts in the vault at EXP-081-data/sources/. They are not bundled in this kit because they summarize named third-party reports (Gallup, Stanford Bloom et al., Microsoft Work Trend Index, GitHub Octoverse, Pew Research, etc.). sources/README.md lists the underlying reports per topic so you can assemble equivalents.

What this kit is for

The point of receipts is verification. If you want to check whether the 7.7% / 1.6% / 5x numbers in the post are real, open prompt_arch_results.json, group by arch, sum numbers.total_numbers and numbers.not_in_source, and recompute. If you want to know whether the bridge experiment's 91% / 45% gap is real, open bridge_results.json and read [topic][condition].aggregate.source_match_rate for each of the three topics.

The patterns in these scripts work without the LLM evaluator the result JSONs were originally formatted with. The number-matching layer is regex against source text, runs in milliseconds, and is the load-bearing measurement for every claim above.

Related receipts

The Fabrication Architecture (../fabrication-architecture/) carries receipts for the foundational temporal-instability claim (77-100 percent unsourced) that this recipe addresses. Reading the two together gives the problem and the fix on the same dataset shape.

Trust Signals Are Inverted (../trust-signals-are-inverted/) covers what source grounding does NOT change: the surface signals fluent readers use to assess trustworthiness. Source grounding makes output checkable; the trust-inversion finding shows that without checking, fabricated and sourced output read as equally or more trustworthy.

Errata

Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.

Files in this folder