# Analysis: what the numbers show

This document walks the aggregate numbers across the five
sub-experiments and maps them to claims in the published post.

Every aggregate below was recomputed from the raw result JSONs at
full precision. Rounding to the values quoted in the post is noted
where it differs from the underlying number by more than 0.05.

## Sub-experiment 1: Source-architecture tiers

Source: `number_match_results.json → results[*]`, grouped by `tier`.

| Tier         | Documents | Total numbers | In source | Not in source | Unsourced rate |
|:-------------|----------:|--------------:|----------:|--------------:|---------------:|
| T1-BASIC     |         6 |           473 |       465 |             8 |          1.69% |
| T2-STANDARD  |         6 |           457 |       445 |            12 |          2.63% |
| T3-REFINED   |         6 |           517 |       464 |            53 |         10.25% |
| T4-AGENTIC   |         6 |           496 |       453 |            43 |          8.67% |

The 1-10% range is what the post means by "from roughly half the
output to under 10 percent with source material." The "roughly half"
end of that statement is established by sub-experiment 5 (the
source-absent condition, where 55% of numbers are not in the source).

## Sub-experiment 2: Prohibition vs monitoring (xAI)

Source: `prompt_arch_results.json → findings.results[*]`, grouped by `arch`.

| Architecture     | Documents | Total numbers | Not in source | Unsourced rate | Avg words |
|:-----------------|----------:|--------------:|--------------:|---------------:|----------:|
| T3-CURRENT       |         6 |           454 |            35 |          7.71% |       771 |
| T3-PROHIBITION   |         6 |           630 |            10 |          1.59% |       927 |
| T3-SEPARATED     |         6 |           278 |             7 |          2.52% |     1,003 |

CURRENT to PROHIBITION ratio: 4.86x. The post rounds to "5x." The
direction and order of magnitude are unambiguous.

PROHIBITION produces 156 more words (+20%) and 201 more sourced
numbers (+48%) than CURRENT. This is the "compensates by extracting
more, not by saying less" claim in the post.

## Sub-experiment 3: Cross-generator replication (Gemini)

Source: `cross_gen_results.json → findings.results[*]`, grouped by `arch`.

| Architecture     | Documents | Total numbers | Not in source | Unsourced rate | Avg words |
|:-----------------|----------:|--------------:|--------------:|---------------:|----------:|
| T3-CURRENT       |         6 |           247 |            15 |          6.07% |     1,099 |
| T3-PROHIBITION   |         6 |           302 |             5 |          1.66% |     1,145 |

Gemini ratio: 3.67x. xAI ratio was 4.86x. Both converge near a 1.6%
floor under prohibition.

The compensation pattern replicates: PROHIBITION on Gemini also
produces more numbers (302 vs 247, +22%) and slightly more words
(1,145 vs 1,099). Same direction as xAI.

## Sub-experiment 4: Source-quality degradation

Source: `source_degradation_results.json → results[*]`, grouped by `condition`.

| Condition | Documents | Total | Grounded | Parametric | Fabricated | Fab rate | Unsourced rate |
|:----------|----------:|------:|---------:|-----------:|-----------:|---------:|---------------:|
| FULL      |         6 |   583 |      577 |          0 |          6 |    1.03% |          1.03% |
| PARTIAL   |         6 |   506 |      504 |          0 |          2 |    0.40% |          0.40% |
| SPARSE    |         6 |   350 |      337 |          8 |          5 |    1.43% |          3.71% |

PARTIAL fabrication rate is *lower* than FULL. The model adapts to
missing sections by shifting to qualitative language rather than
retrieving from training data. PARTIAL had zero parametric numbers:
when the model didn't have a number in the provided source, it
didn't pull one from training data.

SPARSE (~600 char source, ~5x reduction) starts to leak: 8
parametric numbers appear (the model recovers some from training)
and 5 fabrications appear. Total unsourced 3.71%, still well below
the no-source baseline of ~55%.

## Sub-experiment 5: Commensurable bridge

Source: `bridge_results.json → findings.results[topic][condition]`.

### Per-topic source-match rate

| Topic           | Source present | Source absent | Gap (pp) |
|:----------------|---------------:|--------------:|---------:|
| remote_work     |         87.60% |        43.60% |     44.0 |
| communication   |         96.60% |        65.00% |     31.6 |
| ai_workflows    |         89.00% |        26.10% |     62.9 |

Mean across topics: source-present 91.07%, source-absent 44.90%,
gap 46.2 pp. Pooled across all numbers: 91.88% vs 45.88%, gap
46.0 pp. The post's "46 percentage points" matches both.

### Per-topic temporal stability

| Topic           | Source present | Source absent |
|:----------------|---------------:|--------------:|
| remote_work     |         53.10% |        23.10% |
| communication   |         75.00% |        25.00% |
| ai_workflows    |         58.70% |        27.30% |

Mean: source-present 62.27%, source-absent 25.13%, gap 37.1 pp.

The convergent validity claim: numbers that match the source are also
the numbers that stabilize across regenerations. The cross-tabulation
in `bridge_results.json` (`[topic][condition].cross_tab`) confirms
this for each (topic, condition) pair: stable-and-in-source is the
dominant cell under source-present; stable-and-not-in-source is
nearly empty.

## What the post claims, where it comes from

### Claim 1: "From roughly half the output to under 10 percent with source material, single digits with prohibition."

- "Roughly half" → bridge experiment, source-absent: 54.12%
  not-in-source pooled (100% − 45.88%).
- "Under 10 percent with source material" → sub-experiment 1 tiers
  range 1.69% to 10.25%; bridge source-present 8.12% pooled
  (100% − 91.88%). Ranges 1.7% to 10.3% across architectures.
- "Single digits with prohibition" → sub-experiment 2 PROHIBITION
  1.59%; sub-experiment 3 Gemini PROHIBITION 1.66%; sub-experiment
  4 FULL 1.03%, PARTIAL 0.40%, SPARSE 3.71%.

### Claim 2: "Source material moved source-attribution rate 46 percentage points."

Bridge experiment, mean across three topics: 91.07% − 44.90% = 46.17 pp.
Pooled: 91.88% − 45.88% = 46.00 pp. Direct match.

### Claim 3: "Prompt architecture moved the unsourced rate 6 percentage points."

Sub-experiment 2: T3-CURRENT 7.71% − T3-PROHIBITION 1.59% = 6.12 pp.
Direct match.

### Claim 4: "Five times better. 1.6 percent versus 7.7."

Sub-experiment 2: 7.71% / 1.59% = 4.86x. Rounds to 5x. The percentages
quoted in the post round to the underlying values (7.71 → 7.7,
1.59 → 1.6).

### Claim 5: "The output doesn't get shorter or less detailed. It gets differently detailed."

Sub-experiment 2 word counts: T3-CURRENT 771, T3-PROHIBITION 927
(+20%). Sourced number counts: T3-CURRENT 419, T3-PROHIBITION 620
(+48%). Sub-experiment 3 (Gemini) shows the same direction.

### Claim 6: "Cross-generator confirmed: xAI 1.6%, Gemini Flash 1.7%, Gemini Pro 0% under prohibition."

Three generators tested, all under PROHIBITION:

| Generator               | CURRENT | PROHIBITION | Drop  |
|:------------------------|--------:|------------:|------:|
| xAI grok-4-1-fast       |   7.71% |       1.59% | 6.12 pp |
| Gemini Flash 3          |   6.07% |       1.66% | 4.41 pp |
| Gemini Pro 3.1          |   2.75% |       0.00% | 2.75 pp |

The xAI and Gemini Flash data are in `prompt_arch_results.json` and
`cross_gen_results.json` (this kit). The Gemini Pro replication is in
the vault at `EXP-081-data/exp081_gemini_pro_results.json`; it ran
the same prompt-architecture design on `gemini-3.1-pro-preview` and
is not bundled here. Per-architecture re-derivation is straightforward
from any of the three result files.

### Claim 7: "Even partial sources work. Tested what happens when key sections are removed... 0.4 percent with partial source versus 3.7 percent with very sparse source."

Sub-experiment 4: PARTIAL 0.40%, SPARSE 3.71%. Direct match.

### Claim 8: "When data was incomplete, the model adapted by writing qualitative analysis instead of fabricating numbers."

Sub-experiment 4 PARTIAL row: 0 parametric numbers across 6 documents.
The model didn't pull from training data when the provided source
omitted relevant sections. SPARSE shows 8 parametric numbers,
indicating the model starts retrieving from training only when the
provided source falls below ~600 characters.

## How to recompute

```python
import json
with open("prompt_arch_results.json") as f:
    d = json.load(f)

results = d["findings"]["results"]
for arch in ["T3_CURRENT", "T3_PROHIBITION", "T3_SEPARATED"]:
    grp = [r for r in results if r["arch"] == arch]
    total = sum(r["numbers"]["total_numbers"] for r in grp)
    unsrc = sum(r["numbers"]["not_in_source"] for r in grp)
    print(arch, total, unsrc, unsrc / total)
# T3_CURRENT 454 35 0.0771...
# T3_PROHIBITION 630 10 0.0159...
# T3_SEPARATED 278 7 0.0252...
```

```python
with open("bridge_results.json") as f:
    d = json.load(f)
for topic in ["remote_work", "communication", "ai_workflows"]:
    for cond in ["source_present", "source_absent"]:
        agg = d["findings"]["results"][topic][cond]["aggregate"]
        print(topic, cond, agg["source_match_rate"])
```

## What this analysis does not address

- **Effect sizes for the unsourced-rate deltas.** Sample size is 6 per
  cell (3 topics × 2 versions). Differences are large in relative
  terms. A reader running bootstrap CIs on the per-document
  not-in-source counts in the result JSONs is welcome to publish
  formal effect estimates.

- **Topic dependence of the bridge gap.** The 46 pp gap is the mean
  across three topics. Per-topic gaps range 31.6 pp (communication)
  to 62.9 pp (ai_workflows). The claim is "source material moved the
  rate 46 pp on average across the three topics tested." It is not
  "source material moves the rate 46 pp in any domain."

- **Whether human readers would trust source-present vs source-absent
  outputs differently.** The kit measures number provenance and
  temporal stability programmatically. The post mentions a separate
  N=1 domain-expert observation about epistemic stance; that is
  not measured here.