# Receipts: Source Conditioning

These files are the raw artifacts behind the finding published at
<https://blog.clarethium.com/source-conditioning>.

The published claim is that **putting real data in the prompt is the
single largest variable affecting how often AI fabricates numbers**, and
that **prohibition outperforms monitoring** by an order of magnitude.
This folder contains the five sub-experiments that establish those
numbers across two model families and 90 documents.

## What's here

| File | What it is |
|---|---|
| [`method.md`](./method.md) | The five sub-experiment designs in human-readable form. Topics, conditions, prompts, what each measurement counts. |
| [`analysis.md`](./analysis.md) | The aggregate numbers and how each maps to a claim in the published post. |
| [`number_match.py`](./number_match.py) | Sub-experiment 1: 3 topics × 4 source-architecture tiers × 2 versions = 24 documents. Establishes the source-vs-no-source magnitude. |
| [`number_match_results.json`](./number_match_results.json) | All 24 outputs with per-document number counts and unsourced details. |
| [`prompt_arch.py`](./prompt_arch.py) | Sub-experiment 2: 3 topics × 3 prompt architectures (CURRENT, PROHIBITION, SEPARATED) × 2 versions = 18 documents (xAI). Establishes prohibition > monitoring. |
| [`prompt_arch_results.json`](./prompt_arch_results.json) | All 18 outputs with full text and per-document number counts. |
| [`cross_gen.py`](./cross_gen.py) | Sub-experiment 3: same prohibition test on Gemini. 3 topics × 2 architectures × 2 versions = 12 documents. |
| [`cross_gen_results.json`](./cross_gen_results.json) | All 12 outputs with full text and per-document number counts. |
| [`source_degradation.py`](./source_degradation.py) | Sub-experiment 4: prohibition recipe stress-tested against partial and sparse sources. 3 topics × 3 conditions × 2 versions = 18 documents. |
| [`source_degradation_results.json`](./source_degradation_results.json) | All 18 outputs with grounded / parametric / fabricated counts. |
| [`commensurable_bridge.py`](./commensurable_bridge.py) | Sub-experiment 5: source-match and temporal-stability measured on the same documents under both source-present and source-absent conditions. 3 topics × 2 conditions × 3 versions = 18 documents. |
| [`bridge_results.json`](./bridge_results.json) | All 18 outputs with cross-tabulated source match × temporal stability. |

## How to read this

- **If you want to check the claim:** open [`analysis.md`](./analysis.md)
  first. Each row in the claims table cites the file and the path
  inside the JSON that the number came from.
- **If you want to replicate:** [`method.md`](./method.md) describes
  each sub-experiment's design, the prompt architectures, and what each
  number-matching layer counts. The five script files hold the
  procedures exactly as run.
- **If you want to audit:** open the result JSONs. Each contains the
  full output text for every document, the regex-extracted number list,
  and the in-source / not-in-source classification. Nothing summarized
  away.

## What the receipts prove (and don't)

These receipts prove:

- **Source presence dominates.** Across the bridge experiment (18
  documents, 3 topics), source-present outputs match the source on
  91% of numbers, source-absent outputs match on 45%. A 46
  percentage-point gap with no other variable changing.
- **Prohibition outperforms monitoring 5x on xAI.** T3-CURRENT
  (monitoring with EXTENDS labels) produced 7.71% unsourced numbers.
  T3-PROHIBITION (no unsourced numbers allowed) produced 1.59%. Same
  topics, same model, same temperature.
- **The effect replicates on Gemini.** Gemini CURRENT 6.07%,
  Gemini PROHIBITION 1.66%. Different model family, same direction,
  comparable magnitude.
- **Prohibition is costless.** Word count goes up under prohibition
  (927 vs 771 on xAI, +20%). Sourced number count goes up (620 vs 419,
  +48%). The model compensates by extracting more, not by saying less.
- **Partial sources still work.** Removing ~50% of source sections
  (PARTIAL) produced 0.40% fabrication, lower than full source (1.03%).
  Reducing source to ~600 chars (SPARSE) produced 3.71% unsourced. The
  recipe degrades gracefully.
- **Stable numbers are sourced numbers.** In the bridge experiment,
  source-present documents have 62% temporal stability across
  regenerations vs 25% for source-absent. Source presence stabilizes
  the data layer.

These receipts do NOT prove:

- **That the gap holds beyond the three tested topics.** Topics span
  remote work, internal communication, and AI-assisted development.
  All three have bullet-pointed source documents in the
  `EXP-081-data/sources/` folder of the vault. Generalization to other
  domains is untested here.
- **That the recipe applies to non-reformulation tasks.** Every test
  here has source material available and asks for analytical
  reformulation. Reasoning, strategy, and creative tasks are not
  measured by this kit.
- **That layers above the data layer are fixed.** Source grounding
  stabilizes numbers (62% stability) but vocabulary, conclusions, and
  causal reasoning stay near same-topic baseline (~34%). The kit
  measures the data layer only.
- **That fabrication rates hold across model versions over time.**
  Tested on `grok-4-1-fast` (xAI, March 2026) and `gemini-3-flash-preview`
  (Gemini, March 2026). As inference compute and retrieval improve,
  these numbers will move.

## Operational scope

Each script imports five names from a `_config` module:
`GENERATOR_MODEL`, `get_generator_client`, `call_generator`,
`format_results`, and (in two scripts) `get_gemini_client` /
`gemini_evaluate`. [`_config.py`](./_config.py) is a documented stub
that defines those names with the right signatures and raises
`NotImplementedError` from each function body. Replace each body with
a call to your own provider's SDK to reproduce end-to-end.

The reusable parts of each script run without any API access at all:

- The topic definitions and source-file paths (top of each script).
- The prompt-architecture functions (`t3_current`, `t3_prohibition`,
  `t3_separated_pass1`, `t3_separated_pass2` in `prompt_arch.py`).
- The `extract_numbers` and `number_in_source` helpers — the regex
  layer that classifies output numbers as in-source or not-in-source.
  Pure Python, no external dependencies.
- The `analyze_numbers` aggregation that produces the per-document
  counts in the result JSONs.

Source documents (~2-4KB each, bullet-pointed verifiable statistics)
live alongside the original scripts in the vault at
`EXP-081-data/sources/`. They are not bundled in this kit because they
summarize named third-party reports (Gallup, Stanford Bloom et al.,
Microsoft Work Trend Index, GitHub Octoverse, Pew Research, etc.).
[`sources/README.md`](./sources/README.md) lists the underlying reports
per topic so you can assemble equivalents.

## What this kit is for

The point of receipts is verification. If you want to check whether
the 7.7% / 1.6% / 5x numbers in the post are real, open
`prompt_arch_results.json`, group by `arch`, sum `numbers.total_numbers`
and `numbers.not_in_source`, and recompute. If you want to know
whether the bridge experiment's 91% / 45% gap is real, open
`bridge_results.json` and read `[topic][condition].aggregate.source_match_rate`
for each of the three topics.

The patterns in these scripts work without the LLM evaluator the
result JSONs were originally formatted with. The number-matching
layer is regex against source text, runs in milliseconds, and is the
load-bearing measurement for every claim above.

## Related receipts

[The Fabrication Architecture](/receipts/fabrication-architecture)
([`../fabrication-architecture/`](../fabrication-architecture/))
carries receipts for the foundational temporal-instability claim
(77-100 percent unsourced) that this recipe addresses. Reading the
two together gives the problem and the fix on the same dataset
shape.

[Trust Signals Are Inverted](/receipts/trust-signals-are-inverted)
([`../trust-signals-are-inverted/`](../trust-signals-are-inverted/))
covers what source grounding does NOT change: the surface signals
fluent readers use to assess trustworthiness. Source grounding makes
output checkable; the trust-inversion finding shows that without
checking, fabricated and sourced output read as equally or more
trustworthy.

## Errata

Found a problem with the data, the method, or the analysis? Send it
via LinkedIn DM (linked from
[/about](https://blog.clarethium.com/about)). Corrections get
published on the record at [/record](https://blog.clarethium.com/record),
with attribution.
