# Frame Check: named-pattern detector validation

The post body claims macro-F1 0.157 (v1), 0.274 (v2), 0.360 (v3) against a pre-registered useful threshold of 0.4. This directory has the corpus, the labels, the scripts that produced those numbers, the result JSONs, the pre-registration documents, and the three reports written before the next iteration was tested.

A reader who wants to recompute the numbers can. The pipeline runs from `scripts/01_assemble_corpus.py` through `scripts/11_compute_v3_metrics.py` against the corpus and labels in this directory.

## What you will find here

### Reports (the analysis)

[REPORT.md](./REPORT.md) is the v1 report (n=12, 2026-04-18). The first scaled test of the named-pattern detector against pre-registered thresholds. Macro-F1 0.157. The bar was set in writing before the test ran. The detector did not clear it.

[REPORT_V2.md](./REPORT_V2.md) is the v2 same-day rule audit (n=12, 2026-04-18). Three v1 detection rules were retired (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) for firing on cases they should not flag. Macro-F1 moved to 0.274. Still below the floor.

[REPORT_V3_TRACK_A.md](./REPORT_V3_TRACK_A.md) is the v3 follow-on study (n=28, 2026-04-19). Signal-level additions (S-1 hedge, S-2 citation, S-3 growth, S-4 efficiency vocabularies). Macro-F1 0.360. Two retired frames reintroduced with revised signal substrate; FVS-001 stayed retired. Two of the four per-frame "passes" are statistically underpowered at low n.

### Pre-registration (the bar)

[DESIGN.md](./DESIGN.md) is the v1 pre-registration. Hypotheses, thresholds, and corpus design fixed before any data was collected.

[DESIGN_v2.md](./DESIGN_v2.md) is the v2 pre-registration that locked the H-A1 through H-A7 hypotheses + thresholds for the v3 expanded study. Authored before v3 corpus assembly began.

[DESIGN_v3.md](./DESIGN_v3.md) is the v3 corpus design. Stratified sampling, label-source rules, and the rationale for n=28.

### Audit (the v1 to v2 transition)

[RULE_AUDIT.md](./RULE_AUDIT.md) is the same-day audit that retired three v1 rules (FVS-001, FVS-008, FVS-015) and re-ran the metrics. Documents which corpus cases triggered the audit and what each rule was firing on.

### Results (the numbers)

[results.json](./results.json) — v1 metrics output (macro-F1 0.157, per-frame F1 distribution, confusion matrices).

[results_v3.json](./results_v3.json) — v3 metrics output (macro-F1 0.360, per-frame results, n-of-positives per frame).

[v1_vs_v2_comparison.json](./v1_vs_v2_comparison.json) — the v1 to v2 delta, generated by the same-day rule audit.

### Pipeline (how to recompute)

The 12-step pipeline plus three frame-library implementations live in [`scripts/`](./scripts/). Steps run in order:

- `scripts/01_assemble_corpus.py` builds the v1 n=12 stratified corpus.
- `scripts/02_llm_judge_labels.py` runs the LLM-judge labeler over that corpus.
- `scripts/03_detector_labels.py` runs the v1 deterministic detector over the corpus.
- `scripts/04_compute_metrics.py` reads `labels/labels_curator.json`, `labels/labels_llm.json`, `labels/labels_detector.json` and produces `results.json`.
- `scripts/05_rerun_v2_detector.py` retires the three audited rules and reruns; `scripts/04_compute_metrics.py` then produces `v1_vs_v2_comparison.json`.
- `scripts/06_expand_corpus.py` and `scripts/07_assemble_stratum_a_v3.py` build the v3 n=28 corpus.
- `scripts/08_llm_judge_labels_v3.py` and `scripts/09_detector_v1_baseline_v3corpus.py` produce v3 baseline labels for comparison.
- `scripts/10_run_v3_detector.py` runs the v3 detector with S-1 through S-4 signal additions.
- `scripts/11_compute_v3_metrics.py` reads the v3 labels and produces `results_v3.json`.
- `scripts/12_assemble_heldout.py` assembles the held-out corpus for Track B (the next study, not yet run).

The frame definitions and signal regexes live in `scripts/frame_library_v2.py`, `scripts/frame_library_v3.py`, and `scripts/framing_v2.py`.

### Corpus (the inputs)

The 28 corpus documents live in [`corpus/`](./corpus/). Filenames are stratum-prefixed: `a*` are essays / opinion / methodological documents (a01_altman_intelligence_age.txt, a02_fomc_march_2026.txt, etc.), `b*` are press / report / news (b01_nvidia_investment.txt, etc.), `c*` are encyclopedia / reference (c01_wikipedia_semaglutide.txt, etc.). Stratum boundaries and selection criteria are in `DESIGN_v3.md`.

### Labels (the ground truth)

The eight label files live in [`labels/`](./labels/). Three labelers (curator, LLM-judge, detector) across two corpus versions (v1 n=12, v3 n=28):

- `labels/labels_curator.json` — curator labels on v1 corpus.
- `labels/labels_curator_v3.json` — curator labels on v3 corpus.
- `labels/labels_llm.json` — LLM-judge labels on v1 corpus.
- `labels/labels_llm_v3.json` — LLM-judge labels on v3 corpus.
- `labels/labels_detector.json` — v1 detector labels on v1 corpus.
- `labels/labels_detector_v2.json` — v2 detector labels on v1 corpus (post-audit).
- `labels/labels_detector_v1_v3corpus.json` — v1 detector run against the v3 corpus (baseline for v3 comparison).
- `labels/labels_detector_v3.json` — v3 detector labels on v3 corpus.

## How to read these

The labelers across all three iterations were two coders: the curator (the author) and an LLM-judge. The LLM-judge is permissive by construction (it flagged 78 percent of available slots positive against the curator's 30 percent), so the majority-union ground truth rewards detectors that over-fire. The reports compute against both majority-union and a stricter intersection.

Track B, validation against independent human annotators, has not yet run. All current numbers are tuning-set or two-coder numbers.

## What this evidence supports

That the named-pattern layer ships as hypothesis-with-evidence, not verdict, and that the rest of the Frame Check stack (deterministic structural floor, structured-API verification, multi-stage claim cascade, interpretation isolated from verification) does not depend on the named-pattern layer to function. The construct-honesty surfaces in the tool ("low structural coverage of X" rather than "does not address X") were added in the same-day v2 audit.

## What this evidence does not support

That the named-pattern detector is reliable on novel documents. Held-out generalization is the open empirical question. Track B with independent human annotators is the next study.

The MCP server source, the FVS library entries, the corpus, and the rule definitions live at [github.com/Clarethium/frame-check-mcp](https://github.com/Clarethium/frame-check-mcp). The web app is at [frame.clarethium.com](https://frame.clarethium.com).
