Receipts
Receipts: Frame Check
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
The post body claims macro-F1 0.157 (v1), 0.274 (v2), 0.360 (v3) against a pre-registered useful threshold of 0.4. This directory has the corpus, the labels, the scripts that produced those numbers, the result JSONs, the pre-registration documents, and the three reports written before the next iteration was tested.
A reader who wants to recompute the numbers can. The pipeline runs from scripts/01_assemble_corpus.py through scripts/11_compute_v3_metrics.py against the corpus and labels in this directory.
What you will find here
Reports (the analysis)
REPORT.md is the v1 report (n=12, 2026-04-18). The first scaled test of the named-pattern detector against pre-registered thresholds. Macro-F1 0.157. The bar was set in writing before the test ran. The detector did not clear it.
REPORT_V2.md is the v2 same-day rule audit (n=12, 2026-04-18). Three v1 detection rules were retired (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) for firing on cases they should not flag. Macro-F1 moved to 0.274. Still below the floor.
REPORT_V3_TRACK_A.md is the v3 follow-on study (n=28, 2026-04-19). Signal-level additions (S-1 hedge, S-2 citation, S-3 growth, S-4 efficiency vocabularies). Macro-F1 0.360. Two retired frames reintroduced with revised signal substrate; FVS-001 stayed retired. Two of the four per-frame "passes" are statistically underpowered at low n.
Pre-registration (the bar)
DESIGN.md is the v1 pre-registration. Hypotheses, thresholds, and corpus design fixed before any data was collected.
DESIGN_v2.md is the v2 pre-registration that locked the H-A1 through H-A7 hypotheses + thresholds for the v3 expanded study. Authored before v3 corpus assembly began.
DESIGN_v3.md is the v3 corpus design. Stratified sampling, label-source rules, and the rationale for n=28.
Audit (the v1 to v2 transition)
RULE_AUDIT.md is the same-day audit that retired three v1 rules (FVS-001, FVS-008, FVS-015) and re-ran the metrics. Documents which corpus cases triggered the audit and what each rule was firing on.
Results (the numbers)
results.json — v1 metrics output (macro-F1 0.157, per-frame F1 distribution, confusion matrices).
results_v3.json — v3 metrics output (macro-F1 0.360, per-frame results, n-of-positives per frame).
v1_vs_v2_comparison.json — the v1 to v2 delta, generated by the same-day rule audit.
Pipeline (how to recompute)
The 12-step pipeline plus three frame-library implementations live in scripts/. Steps run in order:
scripts/01_assemble_corpus.pybuilds the v1 n=12 stratified corpus.scripts/02_llm_judge_labels.pyruns the LLM-judge labeler over that corpus.scripts/03_detector_labels.pyruns the v1 deterministic detector over the corpus.scripts/04_compute_metrics.pyreadslabels/labels_curator.json,labels/labels_llm.json,labels/labels_detector.jsonand producesresults.json.scripts/05_rerun_v2_detector.pyretires the three audited rules and reruns;scripts/04_compute_metrics.pythen producesv1_vs_v2_comparison.json.scripts/06_expand_corpus.pyandscripts/07_assemble_stratum_a_v3.pybuild the v3 n=28 corpus.scripts/08_llm_judge_labels_v3.pyandscripts/09_detector_v1_baseline_v3corpus.pyproduce v3 baseline labels for comparison.scripts/10_run_v3_detector.pyruns the v3 detector with S-1 through S-4 signal additions.scripts/11_compute_v3_metrics.pyreads the v3 labels and producesresults_v3.json.scripts/12_assemble_heldout.pyassembles the held-out corpus for Track B (the next study, not yet run).
The frame definitions and signal regexes live in scripts/frame_library_v2.py, scripts/frame_library_v3.py, and scripts/framing_v2.py.
Corpus (the inputs)
The 28 corpus documents live in corpus/. Filenames are stratum-prefixed: a* are essays / opinion / methodological documents (a01_altman_intelligence_age.txt, a02_fomc_march_2026.txt, etc.), b* are press / report / news (b01_nvidia_investment.txt, etc.), c* are encyclopedia / reference (c01_wikipedia_semaglutide.txt, etc.). Stratum boundaries and selection criteria are in DESIGN_v3.md.
Labels (the ground truth)
The eight label files live in labels/. Three labelers (curator, LLM-judge, detector) across two corpus versions (v1 n=12, v3 n=28):
labels/labels_curator.json— curator labels on v1 corpus.labels/labels_curator_v3.json— curator labels on v3 corpus.labels/labels_llm.json— LLM-judge labels on v1 corpus.labels/labels_llm_v3.json— LLM-judge labels on v3 corpus.labels/labels_detector.json— v1 detector labels on v1 corpus.labels/labels_detector_v2.json— v2 detector labels on v1 corpus (post-audit).labels/labels_detector_v1_v3corpus.json— v1 detector run against the v3 corpus (baseline for v3 comparison).labels/labels_detector_v3.json— v3 detector labels on v3 corpus.
How to read these
The labelers across all three iterations were two coders: the curator (the author) and an LLM-judge. The LLM-judge is permissive by construction (it flagged 78 percent of available slots positive against the curator's 30 percent), so the majority-union ground truth rewards detectors that over-fire. The reports compute against both majority-union and a stricter intersection.
Track B, validation against independent human annotators, has not yet run. All current numbers are tuning-set or two-coder numbers.
What this evidence supports
That the named-pattern layer ships as hypothesis-with-evidence, not verdict, and that the rest of the Frame Check stack (deterministic structural floor, structured-API verification, multi-stage claim cascade, interpretation isolated from verification) does not depend on the named-pattern layer to function. The construct-honesty surfaces in the tool ("low structural coverage of X" rather than "does not address X") were added in the same-day v2 audit.
What this evidence does not support
That the named-pattern detector is reliable on novel documents. Held-out generalization is the open empirical question. Track B with independent human annotators is the next study.
The MCP server source, the FVS library entries, the corpus, and the rule definitions live at github.com/Clarethium/frame-check-mcp. The web app is at frame.clarethium.com.
Files in this folder
- DESIGN.md10.1 KB
Pre-registration. Hypotheses, thresholds, methods, and falsification conditions, all fixed before any data was collected.
- DESIGN_v2.md17.4 KB
v2 pre-registration. H-A1 through H-A7 hypotheses and thresholds for the v3 expanded study, authored before v3 corpus assembly.
- DESIGN_v3.md9.8 KB
v3 corpus design. Stratified sampling, label-source rules, rationale for n=28.
- README.md6.5 KB
Overview and how to read these artifacts.
- REPORT.md21.1 KB
FVS-EVAL Track A v1 report (n=12, 2026-04-18). First scaled test of the named-pattern detector. Macro-F1 0.157 against a pre-registered useful threshold of 0.4.
- REPORT_V2.md14.4 KB
FVS-EVAL Track A v2 same-day rule audit (n=12, 2026-04-18). Three v1 rules retired (FVS-001, FVS-008, FVS-015). Macro-F1 0.274.
- REPORT_V3_TRACK_A.md22.2 KB
FVS-EVAL Track A v3 follow-on study (n=28, 2026-04-19). Signal-level additions (S-1 through S-4). Macro-F1 0.360. Two retired frames reintroduced with revised substrate; FVS-001 stayed retired.
- RULE_AUDIT.md22.3 KB
Same-day v1 to v2 audit. Retired three v1 rules (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) for firing on cases they should not flag. Documents which corpus cases triggered the audit.
- results.json13.6 KB
v1 metrics output. Macro-F1 0.157, per-frame F1 distribution, confusion matrices.
- results_v3.json16.5 KB
v3 metrics output. Macro-F1 0.360, per-frame results, n-of-positives per frame, statistical-power assessment.
- v1_vs_v2_comparison.json10.4 KB
v1 to v2 delta from the same-day rule audit. Per-frame change after retiring FVS-001, FVS-008, FVS-015.