# Method

The published finding rests on five sub-experiments. Each one tests
a different facet of the source-conditioning claim. Together they
cover 90 documents, two model families (xAI and Gemini), three topics,
and four prompt architectures.

## Common design elements

All five sub-experiments share:

- **Topics.** Three: remote and hybrid work effectiveness, internal
  communication as organizational infrastructure, AI-assisted
  development workflows. Each has a 2-4KB source document with
  bullet-pointed verifiable statistics drawn from named third-party
  reports.
- **Generation parameters.** `grok-4-1-fast` (xAI) for
  sub-experiments 1, 2, 4, 5. `gemini-3-flash-preview` for
  sub-experiment 3. Default temperature in each provider's SDK.
- **Output target.** Roughly 1000 words per document, requested as
  an analytical brief on the topic with the source material in
  context.
- **Measurement.** Programmatic regex extraction of every number
  in the output (percentages, dollar amounts, decimals, integers),
  then exact-string matching against the source document. Year
  numbers (2019-2030) and explicit word-count statements are
  excluded. Zero LLM judgment.

The number-matching code lives in `number_match.py` lines 34-200
and is reproduced verbatim inside each of the other four scripts so
each can run standalone.

## Sub-experiment 1: Source-architecture tiers

Script: `number_match.py`. Result: `number_match_results.json`.

Tests how unsourced-number rate varies across four prompt structures:

- **T1-BASIC.** Source pasted, simple analytical task.
- **T2-STANDARD.** Source pasted, plus structural failure modes
  ("structure survives topic swap = generic", "claims hedge without
  conditions = uncommitted") that the model is asked to avoid.
- **T3-REFINED.** T2 plus an explicit `EXTENDS` labeling
  instruction: any factual claim must trace to source OR be labeled
  `EXTENDS` (beyond source) with confidence H/M/L.
- **T4-AGENTIC.** T3 plus a self-verification pass at the end of
  the document.

3 topics × 4 tiers × 2 versions = 24 documents.

The point of this design is not to crown a winner but to establish
that all four tiers are tightly clustered around 1-10% unsourced
when source is in context. The variable that matters is whether
source is present, not which tier of prompt architecture the
operator picks.

## Sub-experiment 2: Prohibition vs monitoring (xAI)

Script: `prompt_arch.py`. Result: `prompt_arch_results.json`.

Tests whether the residual unsourced-number rate at T3 (the most
constrained of the four tiers in sub-experiment 1) is fixable by
changing the instruction architecture rather than adding more
verification.

Three architectures, all with the same source material in context:

- **T3-CURRENT.** Baseline. Inline EXTENDS labeling: "every claim
  traces to source OR is labeled EXTENDS with confidence H/M/L."
  This is monitoring: the model is asked to flag its own unsourced
  claims as it generates them.
- **T3-PROHIBITION.** "You may ONLY use specific numbers,
  percentages, and dollar amounts that appear in the SOURCE
  MATERIAL above. For any quantitative point where the source
  does not provide a number, you MUST use qualitative language
  instead." This is prohibition: the model is told not to generate
  unsourced numbers in the first place.
- **T3-SEPARATED.** Two-pass. Pass 1 generates the analytical
  skeleton with `[SOURCE: ...]` and `[QUAL: ...]` placeholders.
  Pass 2 fills in source-traceable numbers from the source text and
  converts unsourced placeholders to qualitative language.

3 topics × 3 architectures × 2 versions = 18 documents.

The full prompt strings are in `prompt_arch.py` lines 120-229.

## Sub-experiment 3: Cross-generator replication

Script: `cross_gen.py`. Result: `cross_gen_results.json`.

Tests whether the prohibition > monitoring finding from
sub-experiment 2 replicates across model families. Same topics,
same prompt architectures (T3-CURRENT and T3-PROHIBITION only,
omitting SEPARATED to halve the cost), different model
(`gemini-3-flash-preview` instead of `grok-4-1-fast`).

3 topics × 2 architectures × 2 versions = 12 documents.

The replication target was the xAI gap (7.71% → 1.59%, 4.86x).
What the data shows is reported in `analysis.md`.

## Sub-experiment 4: Source-quality degradation (kill test)

Script: `source_degradation.py`. Result: `source_degradation_results.json`.

Tests whether the prohibition recipe degrades gracefully when source
material is partial or sparse. All other sub-experiments use
"ideal" sources: 2-4KB structured bullet-pointed summaries with
explicit numbers. Real analytical work uses degraded sources.

Three source conditions, all under T3-PROHIBITION (the operational
recipe):

- **FULL.** Complete source (replication of sub-experiment 2 baseline).
- **PARTIAL.** Approximately 50% of source sections removed,
  specifically the sections most relevant to the topic prompt.
  The model is asked to write about productivity when productivity
  data is missing.
- **SPARSE.** Source condensed to 5-6 key bullet points (~600 chars).
  Covers the topic but with minimal detail.

3 topics × 3 conditions × 2 versions = 18 documents.

Numbers are classified against the FULL source, not the degraded
version, into three buckets:

- **GROUNDED.** Number appears in the provided (possibly degraded)
  source AND in the full source.
- **PARAMETRIC.** Number appears in the full source but NOT in the
  provided degraded source. The model could only have produced it
  from training data.
- **FABRICATED.** Number appears in neither the provided source nor
  the full source.

The kill test was: does PARTIAL or SPARSE break the recipe? See
`analysis.md` for the answer.

## Sub-experiment 5: Commensurable bridge

Script: `commensurable_bridge.py`. Result: `bridge_results.json`.

Tests source presence as a single variable, holding the prompt
architecture (T3-CURRENT) fixed and toggling only whether source
material appears in the prompt. Closes the measurement gap between
prior temporal-consistency experiments and the source-matching
experiments by applying BOTH measurements to the same documents.

3 topics × 2 conditions × 3 versions = 18 documents.

- **Source-present.** The prompt includes the topic source document.
- **Source-absent.** Same prompt, source material removed. The
  model must generate from parametric memory.

Two measurements per document:

- **Source-match rate.** What fraction of the document's numbers
  appear in the topic source document. (For source-absent, this
  is a coincidental match: the model's parametric memory happens
  to overlap with the source.)
- **Temporal stability.** Across the three versions of each
  (topic × condition), what fraction of unique numbers appear in
  more than one version.

The cross-tabulation of source-matched × temporally-stable is the
convergent-validity check: numbers that come from the source
should also be the numbers that stabilize across regenerations.

## What the kit doesn't include

The published post mentions eight sub-experiments; this kit packages
five. The three not included here are:

- **Pilot run** (`exp081_pilot.py` in the vault). Initial
  feasibility test on a single topic. Superseded by sub-experiment 1.
- **Full multi-topic run** (`exp081_full.py`). Earlier version of
  sub-experiment 1 with different topic selection. Superseded.
- **Stability gradient** (`exp081h_stability_gradient.py`).
  Four-layer temporal stability decomposition on prohibition outputs
  (data points, headings, vocabulary, conclusions). Backs the
  "data layer only" caveat in the post but does not produce a
  number used in the published claims.

The five included here are the load-bearing sub-experiments for
the published numbers.
