# Analysis: what the numbers show

This document walks the numbers in [`data.json`](./data.json) and maps
them to the claims in the published post.

## Condition-level summary

Derived by grouping `data.json → raw_outputs[condition] → each run's
score` and taking means. Stored directly in `data.json` for anyone who
wants to recompute.

| Condition   | n  | Total score (mean ± sd) | Density / 1kw | Word count |
|:------------|:---|:-----------------------|:-------------:|:----------:|
| SPEC_QUAL   | 10 | 37.10 ± 2.02           | 100.79        | 368        |
| SPEC_ONLY   | 10 | 32.80 ± 3.01           |  81.82        | 402        |
| QUAL_ONLY   | 10 | 31.00 ± 1.33           |  78.67        | 396        |
| BARE        | 10 | 31.60 ± 2.12           |  71.54        | 443        |

## Main effects

Pulled from `data.json → analysis → main_effects`.

| Factor        | Scale       | Hedges' g (d) | 95% CI             |
|:--------------|:------------|:--------------|:-------------------|
| Specificity   | raw total   | **1.3715**    | [0.741, 2.348]     |
| Specificity   | density/1kw | 1.6510        | [1.011, 2.720]     |
| Quality demands | raw total | 0.5944        | [-0.017, 1.277]    |
| Quality demands | density   | 1.1887        | [0.595, 2.024]     |

## What the post claims, where it comes from

Three claims in the published post anchor to this data.

### Claim 1: "Specificity d = 1.37, 95% CI excludes zero."

Lives at `data.json → analysis.main_effects.specificity_raw`.
The main effect of specificity on the raw-total score across all 40
runs is d = 1.3715, 95% CI [0.741, 2.348]. The lower bound is well above
zero, so the direction is supported at the 95% confidence level.

### Claim 2: "Quality demands main effect is near zero at raw score but inflates at density."

`analysis.main_effects.quality_raw.d` = 0.5944 with CI crossing zero at
the lower bound (-0.017). At density, the quality-demands main effect
jumps to d = 1.1887. This is called out in the post's "What didn't
survive" section because it signals that density partially conflates
specificity with output brevity (more density for shorter outputs).

### Claim 3: "Three confounded variables across the two prior replications."

This isn't a number in `data.json`; it's a design claim. You can verify
it by reading [`conditions.md`](./conditions.md) and comparing to the
original EXP-025 design described in the published post. The 2×2
factorial separates specificity content, quality demands, and output
length; each factor can be isolated in the data by comparing cells.

## How to recompute Hedges' g

The analysis uses Hedges' g, the small-sample-corrected form of Cohen's
d.

Cohen's d for two groups is the mean difference divided by the pooled
standard deviation:

```
d = (mean_A - mean_B) / s_pooled
```

where `s_pooled` is the pooled standard deviation of the two groups (the
usual formula with n-1 weights).

Hedges' g applies a correction factor to Cohen's d that matters for
small samples:

```
g = d * (1 - 3 / (4 * (n_A + n_B) - 9))
```

For this experiment, "specificity effect" pools the two specificity-
present cells (SPEC_QUAL + SPEC_ONLY, n=20) against the two
specificity-absent cells (QUAL_ONLY + BARE, n=20). Bootstrap 95% CIs
used 10,000 resamples; see `bootstrap_ci` in the script import.

A Python reconstruction using only stdlib + scipy:

```python
import json
from statistics import mean, stdev
from math import sqrt

d = json.load(open('data.json'))

def scores(conds):
    out = []
    for c in conds:
        for r in d['raw_outputs'][c]:
            out.append(r['score']['total'])
    return out

spec_on = scores(['SPEC_QUAL', 'SPEC_ONLY'])
spec_off = scores(['QUAL_ONLY', 'BARE'])

m_diff = mean(spec_on) - mean(spec_off)
sp = sqrt(((len(spec_on) - 1) * stdev(spec_on)**2 +
           (len(spec_off) - 1) * stdev(spec_off)**2) /
          (len(spec_on) + len(spec_off) - 2))
cohen_d = m_diff / sp
n = len(spec_on) + len(spec_off)
hedges_g = cohen_d * (1 - 3 / (4*n - 9))
print(f'Hedges g = {hedges_g:.4f}')  # expect ~1.37
```

## Validation

The scoring rubric's reliability was checked against a 10-output
human-labeled subset, stored in [`validation.json`](./validation.json).
The key maps each output to its condition, total score, density, and a
HIGH/LOW label. Readers who want to audit the rubric can score the same
outputs independently and compare.