Skip to content
Back to the finding

Receipts

Receipts: Frame Check

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

The post body claims macro-F1 0.157 (v1), 0.274 (v2), 0.360 (v3) against a pre-registered useful threshold of 0.4. This directory has the corpus, the labels, the scripts that produced those numbers, the result JSONs, the pre-registration documents, and the three reports written before the next iteration was tested.

A reader who wants to recompute the numbers can. The pipeline runs from scripts/01_assemble_corpus.py through scripts/11_compute_v3_metrics.py against the corpus and labels in this directory.

What you will find here

Reports (the analysis)

REPORT.md is the v1 report (n=12, 2026-04-18). The first scaled test of the named-pattern detector against pre-registered thresholds. Macro-F1 0.157. The bar was set in writing before the test ran. The detector did not clear it.

REPORT_V2.md is the v2 same-day rule audit (n=12, 2026-04-18). Three v1 detection rules were retired (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) for firing on cases they should not flag. Macro-F1 moved to 0.274. Still below the floor.

REPORT_V3_TRACK_A.md is the v3 follow-on study (n=28, 2026-04-19). Signal-level additions (S-1 hedge, S-2 citation, S-3 growth, S-4 efficiency vocabularies). Macro-F1 0.360. Two retired frames reintroduced with revised signal substrate; FVS-001 stayed retired. Two of the four per-frame "passes" are statistically underpowered at low n.

Pre-registration (the bar)

DESIGN.md is the v1 pre-registration. Hypotheses, thresholds, and corpus design fixed before any data was collected.

DESIGN_v2.md is the v2 pre-registration that locked the H-A1 through H-A7 hypotheses + thresholds for the v3 expanded study. Authored before v3 corpus assembly began.

DESIGN_v3.md is the v3 corpus design. Stratified sampling, label-source rules, and the rationale for n=28.

Audit (the v1 to v2 transition)

RULE_AUDIT.md is the same-day audit that retired three v1 rules (FVS-001, FVS-008, FVS-015) and re-ran the metrics. Documents which corpus cases triggered the audit and what each rule was firing on.

Results (the numbers)

results.json — v1 metrics output (macro-F1 0.157, per-frame F1 distribution, confusion matrices).

results_v3.json — v3 metrics output (macro-F1 0.360, per-frame results, n-of-positives per frame).

v1_vs_v2_comparison.json — the v1 to v2 delta, generated by the same-day rule audit.

Pipeline (how to recompute)

The 12-step pipeline plus three frame-library implementations live in scripts/. Steps run in order:

  • scripts/01_assemble_corpus.py builds the v1 n=12 stratified corpus.
  • scripts/02_llm_judge_labels.py runs the LLM-judge labeler over that corpus.
  • scripts/03_detector_labels.py runs the v1 deterministic detector over the corpus.
  • scripts/04_compute_metrics.py reads labels/labels_curator.json, labels/labels_llm.json, labels/labels_detector.json and produces results.json.
  • scripts/05_rerun_v2_detector.py retires the three audited rules and reruns; scripts/04_compute_metrics.py then produces v1_vs_v2_comparison.json.
  • scripts/06_expand_corpus.py and scripts/07_assemble_stratum_a_v3.py build the v3 n=28 corpus.
  • scripts/08_llm_judge_labels_v3.py and scripts/09_detector_v1_baseline_v3corpus.py produce v3 baseline labels for comparison.
  • scripts/10_run_v3_detector.py runs the v3 detector with S-1 through S-4 signal additions.
  • scripts/11_compute_v3_metrics.py reads the v3 labels and produces results_v3.json.
  • scripts/12_assemble_heldout.py assembles the held-out corpus for Track B (the next study, not yet run).

The frame definitions and signal regexes live in scripts/frame_library_v2.py, scripts/frame_library_v3.py, and scripts/framing_v2.py.

Corpus (the inputs)

The 28 corpus documents live in corpus/. Filenames are stratum-prefixed: a* are essays / opinion / methodological documents (a01_altman_intelligence_age.txt, a02_fomc_march_2026.txt, etc.), b* are press / report / news (b01_nvidia_investment.txt, etc.), c* are encyclopedia / reference (c01_wikipedia_semaglutide.txt, etc.). Stratum boundaries and selection criteria are in DESIGN_v3.md.

Labels (the ground truth)

The eight label files live in labels/. Three labelers (curator, LLM-judge, detector) across two corpus versions (v1 n=12, v3 n=28):

  • labels/labels_curator.json — curator labels on v1 corpus.
  • labels/labels_curator_v3.json — curator labels on v3 corpus.
  • labels/labels_llm.json — LLM-judge labels on v1 corpus.
  • labels/labels_llm_v3.json — LLM-judge labels on v3 corpus.
  • labels/labels_detector.json — v1 detector labels on v1 corpus.
  • labels/labels_detector_v2.json — v2 detector labels on v1 corpus (post-audit).
  • labels/labels_detector_v1_v3corpus.json — v1 detector run against the v3 corpus (baseline for v3 comparison).
  • labels/labels_detector_v3.json — v3 detector labels on v3 corpus.

How to read these

The labelers across all three iterations were two coders: the curator (the author) and an LLM-judge. The LLM-judge is permissive by construction (it flagged 78 percent of available slots positive against the curator's 30 percent), so the majority-union ground truth rewards detectors that over-fire. The reports compute against both majority-union and a stricter intersection.

Track B, validation against independent human annotators, has not yet run. All current numbers are tuning-set or two-coder numbers.

What this evidence supports

That the named-pattern layer ships as hypothesis-with-evidence, not verdict, and that the rest of the Frame Check stack (deterministic structural floor, structured-API verification, multi-stage claim cascade, interpretation isolated from verification) does not depend on the named-pattern layer to function. The construct-honesty surfaces in the tool ("low structural coverage of X" rather than "does not address X") were added in the same-day v2 audit.

What this evidence does not support

That the named-pattern detector is reliable on novel documents. Held-out generalization is the open empirical question. Track B with independent human annotators is the next study.

The MCP server source, the FVS library entries, the corpus, and the rule definitions live at github.com/Clarethium/frame-check-mcp. The web app is at frame.clarethium.com.

Files in this folder