# Analysis: what the numbers show

This document walks the numbers in the results files and maps them to
the claims in the published post.

## Cross-generator summary (the 77% to 100% range)

From [`cross_generator_results.json`](./cross_generator_results.json) →
`cross_generator_comparison`:

| Generator | Model | Topics | All-numbers fab rate | Percentage fab rate |
|:---|:---|:---|:---:|:---:|
| gemini (original) | gemini-3.1-flash-lite-preview | 20 | **93.92%** | **100.0%** |
| gemini3f | gemini-3-flash-preview | 10 | 95.19% | 100.0% |
| xai | grok-4-1-fast | 10 | **85.79%** | **76.8%** |

The range "77 to 100 percent" reported in the published post refers to
the percentage-claim fab rate: **76.8%** on xAI (the "77" in the post
rounds this up) to **100%** on both Gemini generators.

No single percentage appeared in all three versions of any topic on the
two Gemini generators. On xAI, roughly a quarter did, and the post
notes most of those were "round numbers reused across different
claims" (e.g., a "70%" that appears in three different unrelated
contexts, not the same assertion recurring).

## Single-generator baseline (primary experiment)

From [`temporal_consistency_results.json`](./temporal_consistency_results.json)
→ `aggregate`:

| Metric | Value | What it means |
|:---|:---:|:---|
| `standard_pct_fab_rate_mean` | 1.0 | Every percentage, on average, was temporally unstable |
| `standard_pct_fab_rate_median` | 1.0 | Same finding at the median |
| `standard_all_fab_rate_mean` | 0.9392 | 93.92% of all numerical claims (percentages + dollars + counts) were unstable |
| `basic_pct_fab_rate_mean` | 1.0 | BASIC condition also 100% unstable, no condition effect on rate |
| `mean_all_nums_per_topic` | 3 | Average number of numerical claims per document |
| `mean_heading_jaccard` | 0.0 | Structural headings also don't survive regeneration |

Key observation: `standard_all_fab_rate_mean = 0.9392` is the same
value listed for gemini in the cross-generator table. That's because
the cross-generator comparison re-uses the same 20-topic single-gen
run for the gemini baseline (documented in `cross_generator.py`).

## Per-topic detail

Each topic in `standard_results` (there are 20 of them) has a detailed
breakdown including:

- `all_numbers_fab_rate`, `all_numbers_stable`, `all_numbers_variable`,
  `all_numbers_total`
- `n_pct_stable`, `n_pct_variable`, `n_pct_total`
- `n_dollar_stable`, `n_dollar_variable`
- `examples.stable` and `examples.variable`, actual number strings that
  were tracked
- `heading_jaccard`, overlap of section headings across versions

Topics where the model didn't produce numerical claims in its output
have `fabrication_rate_pct: null` and are excluded from the aggregate.
This is visible in topics like `api_design_STANDARD`, where
`n_pct_total` is 0.

## Claim-by-claim mapping to the published post

### "77 to 100 percent of model-generated numbers changed between runs"

Source: `cross_generator_results.json` → `cross_generator_comparison`.
Lower bound 76.8% (xai, `pct_fab_mean`); upper bound 100% (both Gemini
generators). The "changed between runs" phrasing is the inverse of the
stability count.

### "On two generators, none survived. On the third, roughly a quarter did, mostly round numbers reused across different claims"

Source: `cross_generator_comparison`. Both Gemini generators show 100%
percentage fab rate (zero percentages survived). xAI shows 76.8% fab
rate, so ~23% stable, "roughly a quarter." The "mostly round numbers
reused" characterization is qualitative and isn't encoded in the data;
it can be verified by opening `generated_documents.json`, filtering to
the xAI subset, and inspecting the recurring percentages.

### "20 topics on one generator. Replicated across three generators with 10 topics"

Source: experiment designs are visible in `temporal_consistency.py`
(20 × 3 STANDARD) and `cross_generator.py` (10 topics × 3 generators ×
3 versions).

### Aggregate claim density "3 numerical claims per document"

Source: `temporal_consistency_results.json` → `aggregate.mean_all_nums_per_topic = 3`.
The post doesn't quote this directly; it's relevant because temporal
instability is only informative for topics that produced numbers at
all. See "Honest limits" in the post.

## How to recompute fabrication rate yourself

Pick any topic in `generated_documents.json`. Each entry has a
`topic_id`, `condition`, `version` index, and full `text`. For a given
topic × condition, there are 3 independent version texts.

For each text:
1. Extract percentages (e.g., regex `\d+(?:\.\d+)?\s*%`).
2. Collect them as a set.

Take the intersection of the three sets, these are the "stable"
percentages. The "variable" ones are all percentages that appear in
fewer than all three.

```
fab_rate = n_variable / n_total
```

Repeat for all 20 topics. The mean across topics (for topics with at
least one percentage) is `standard_pct_fab_rate_mean`.

If you implement this, you should match the scoring code's output to
within rounding, the scoring code's percentage extractor is a regex
and shouldn't diverge from a reasonable reimplementation.

## Caveats worth reading before drawing conclusions

- **N=1 practitioner.** Zero external replication (as the post states).
  These receipts establish that the experiment ran and produced these
  numbers. They don't establish that the experiment generalizes outside
  the three tested model families or the 20 tested domains.
- **Claim density is low.** Mean 3 numerical claims per document. The
  fabrication rate describes *of the numbers the model produced, what
  fraction was unstable*. It doesn't describe how many numbers the
  model produces per answer, which varies by topic and condition.
- **Temporal instability is a proxy, not ground truth.** See the
  caveats in the README. The post's "Honest limits" section discusses
  this directly.
