# Method

Two experiments, both testing whether AI-generated numerical claims
survive independent regenerations of the same prompt.

## Experiment 1: Single-generator baseline (EXP-078)

**Script:** [`temporal_consistency.py`](./temporal_consistency.py)
**Data:** [`temporal_consistency_results.json`](./temporal_consistency_results.json) + [`generated_documents.json`](./generated_documents.json)

### Design

- **Generator:** `gemini-3.1-flash-lite-preview`
- **Topics:** 20 real tasks across domains (API design, business
  strategy, clinical trial design, data governance, hiring, incident
  response, ML deployment, open source sustainability, peer review, and
  more, enumerable in the results file).
- **Conditions:** Two prompt architectures.
  - STANDARD: Structural constraints requiring shaped, specific output.
    Tested on all 20 topics.
  - BASIC: Minimal instruction ("analyze this topic"). Tested on 5
    topics as a condition-discrimination probe.
- **Versions:** Three independent generations per topic × condition.
- **Total outputs:** 20 × 3 STANDARD + 5 × 2 BASIC = 70 documents.

### Measurement

For each topic, the three (or two) versions are compared. A numerical
claim (a percentage, dollar amount, or integer count) counts as
**stable** if it appears in every version, and **variable** otherwise.

`fabrication_rate_pct = n_variable / n_total` for percentages
specifically (the cleanest signal because percentages are very rarely
coincidentally identical across regenerations).

`all_numbers_fab_rate` extends the same logic to percentages, dollar
amounts, and counts combined.

The `claim_extraction` helper module does the work. It's not published
here; see the README's "Re-running" section for what it provides.

## Experiment 2: Cross-generator replication (EXP-078B)

**Script:** [`cross_generator.py`](./cross_generator.py)
**Data:** [`cross_generator_results.json`](./cross_generator_results.json)

### Design

- **Generators:** Three model families tested independently.
  - `gemini-3.1-flash-lite-preview` (the primary generator from
    Experiment 1, re-used for consistency)
  - `gemini-3-flash-preview` (newer Gemini variant)
  - `grok-4-1-fast` (xAI)
- **Topics:** 10 topics, a subset of Experiment 1's 20.
- **Versions:** 3 independent STANDARD generations per topic per
  generator.
- **Total outputs:** 10 × 3 × 3 = 90 documents, per generator 30.

### Measurement

Same scoring pipeline as Experiment 1, applied per generator. The
per-generator aggregates (`all_fab_mean`, `pct_fab_mean`) are compared
to establish the 77 percent to 100 percent range reported in the
published post.

## Why this measures what it claims to measure

The core claim is about **temporal instability as a proxy for
fabrication**. The argument:

- If the model is retrieving knowledge it has, the retrieval should be
  approximately deterministic at temperature 0. Different samples can
  re-phrase but should converge on the same underlying facts.
- If the model is generating plausible-looking numbers without a
  retrieval anchor, the sampled values will differ between runs.
- Therefore: a number that appears in all three runs of the same
  prompt is consistent with retrieval; a number that varies is
  consistent with fabrication.

The proxy isn't identical to ground-truth fabrication. A number could
be stable AND wrong (a memorized falsehood) or unstable AND real (if
sampled differently each time). The published post discusses both
directions. The ground-truth check (23 cited sources audited against
their actual publications) is a separate experiment and isn't in this
receipts folder.

## What's fixed and what's varied

| Factor | Fixed | Varied |
|---|---|---|
| Prompt text (per topic × condition) | ✓ | |
| Temperature | ✓ (from script: 0.0 for most runs; see `temporal_consistency.py`) | |
| Generator model per experiment | ✓ (Exp 1) | Varied across 3 models (Exp 2) |
| Topic content | | Varied across 20 domains |
| Condition (STANDARD vs BASIC) | | Varied (Exp 1 only) |

The topic list is deliberately broad so the result isn't an artifact of
one content area.