# External validation study: Frame Check detector vs. multi-labeler agreement

**Status:** pre-registration. Written 2026-04-18 before any labeling or detector runs.
**Curator:** Lovro Lucic.
**Scope:** v1, single-session, exploratory. Not a publication; not a substitute for M1-engaged inter-rater study.

---

## 1. Purpose

Frame Check has seven load-bearing claims (see `../SPEC.md §3`). The claim this study tests:

> Frame Check's detector (`frame_library.suggest_frames`, using `framing.py` signal substrate) produces framing labels that agree with careful human or LLM analytical reading at levels useful for the intended audience.

No prior empirical test of this claim exists in the project. The project's own `test_framing_validation.py` tests internal detector consistency on curator-constructed cases; that is structural validity but not external validity. Worked examples in `data/worked_examples/` are curator-authored ground truth that risks circularity.

This study produces the first measurement of detector-vs-labeler agreement on documents labeled by at least two independent sources (curator + LLM-fresh-context).

---

## 2. Hypotheses (pre-declared)

**H1 (primary).** The detector's per-frame labels agree with a majority of human-and-LLM labelers at macro-F1 above 0.6 on a mixed-genre corpus of approximately 12 documents.

**H2 (secondary).** Detector agreement with labelers is approximately stable across document genres (AI-generated analysis, news article, policy document).

**H3 (null for falsification).** If macro-F1 is below 0.4, the detector does not reliably measure what multi-source labeling identifies; the core claim is disproved at preliminary-study scale and substantial rework is required.

---

## 3. Corpus (pre-declared, frozen before detector runs)

12 documents, three strata of 4 each:

**Stratum A: Existing worked examples (in-distribution, real AI outputs).** Four of the five files in `data/worked_examples/`. Selecting:
- `the-intelligence-age-altman-2024.md` (AI-company manifesto)
- `fomc-statement-march-2026.md` (policy document)
- `grok-on-nvidia-earnings-2026.md` (AI-on-finance)
- `ai-on-life-decisions-startup-2026.md` (AI-on-life-decisions)

One omitted (`four-llms-on-bitcoin-retirement-2026.md`) because it is a comparison of four model outputs rather than one analytical document.

**Stratum B: Freshly generated AI analyses (controlled-prompt).** Four documents generated via Claude API with documented prompts on topics chosen to stress different parts of the frame taxonomy. Prompts pre-declared; model version recorded; system prompt canonical minimal per `SPEC.md §6.2`.

**Stratum C: Human-written documents (out-of-distribution for Frame Check).** Four documents from publicly available sources (Wikipedia excerpts, government reports, or published essays) on analytical topics. Fetched via curl; URLs recorded; fetched-text preserved in corpus.

For each document: record source, license, word count, and stratum.

Corpus is frozen once assembled. No swaps based on subsequent labels.

---

## 4. Labelers

**Labeler A (Curator, in the loop).** Lovro Lucic's labels as documented in AI-assistant session (me, speaking plainly: the AI labeling in this session acting AS curator on the curator's behalf; this is a known construction bias, named and accepted because M1's independent curator has not engaged). Curator labels are produced BEFORE detector output is read, via the following protocol:

- Read document.
- For each of the 20 FVS entries (including meta-side), record binary judgment: does the document exhibit this frame, yes/no?
- Multi-label allowed and expected; several frames may co-occur.
- No consultation with existing worked-example labels in `data/worked_examples/` before labeling. For worked-example documents, the labeling is done by re-reading the document text only, not the existing analysis.

**Labeler B (LLM-judge, fresh context).** Claude via Anthropic API (Claude Opus 4.7 per available key). Provided with:
- The 20 FVS library entry definitions (pulled from `../../data/frame_library/` markdown files, identification sections only).
- The document text.
- A canonical labeling prompt asking for per-frame binary judgments with brief reasoning.

Output format: JSON list of {fvs_id, exhibits: bool, reasoning: one sentence}. No numerical confidence (keep binary for first study).

Labeler B is run after all Labeler A labels are frozen. Labeler B is NOT shown the curator's labels or the detector's output.

**Labeler C (Detector).** `frame_library.suggest_frames` output on each document. Binary per-frame: frame is listed in output = exhibits, frame is not listed = does not exhibit.

Labeler C is run LAST (after A and B are frozen) to prevent contamination of the study.

---

## 5. Agreement metrics (pre-declared)

Computed per-frame and aggregated:

- **Cohen's kappa** pairwise: (A vs B), (A vs C), (B vs C). Macro average across frames.
- **Per-frame F1** of C (detector) against majority of A, B. When A and B disagree on a frame, that frame is "contested" and a separate breakdown counts contested cases.
- **Macro-F1** of C against (A ∪ B majority).
- **Confusion matrix** per frame: true-positive, false-positive, false-negative, true-negative counts.
- **Per-stratum breakdown** (A, B, C strata of the corpus, not labelers): does detector perform differently on in-distribution AI outputs vs out-of-distribution human text?
- **Inter-labeler disagreement patterns**: on which frames do A and B disagree most? These are taxonomy-ambiguity signals.

---

## 6. Acceptance thresholds (pre-declared)

Assessed on detector-vs-majority macro-F1:

| Range | Interpretation |
|-------|---------------|
| F1 ≥ 0.7 | Provisional evidence detector is at state-of-the-art for framing detection at this scale. H1 strongly supported. |
| 0.6 ≤ F1 < 0.7 | Detector aligned with labelers at usefully high levels. H1 supported. |
| 0.4 ≤ F1 < 0.6 | Detector partially aligned. Product positioning and canon claims need scope revision. |
| F1 < 0.4 | Detector does not reliably measure what multi-source labeling identifies. H3 fires; project pivot required. |

These thresholds are chosen in advance to prevent motivated interpretation.

---

## 7. Honest limits (pre-declared)

1. **n=12 is small.** Results are preliminary. Confidence intervals at this n are wide. The study is a signal-or-no-signal check, not a precision measurement.
2. **Labeler A is the curator.** The curator built the library and trained on its distinctions. Agreement between curator-as-labeler-A and detector-as-labeler-C is partially self-confirming. This is explicitly named; the comparison of MOST interest is A-vs-B (two independent labelers with different biases) and C-vs-(A ∪ B majority) (does the detector agree with BOTH independent labelers?).
3. **Labeler B is an LLM.** Known biases: position, length, self-preference, prompt-sensitivity. The prompt is canonical and pre-declared, but B's labels are not human-expert judgment.
4. **No independent human annotator.** The definitive test requires M1 co-curator or external reviewers from REVIEWERS.md. This study is a preliminary signal that either motivates the full study or flags reasons to restructure before investing in it.
5. **FVS taxonomy mapping to any external framework (Media Frames Corpus, Entman) is not attempted.** Corpus is drawn from the project's own intended domain (AI outputs + adjacent analytical text); comparison to MFC or other external taxonomies is deferred.
6. **Single session, single executor.** The curator generates documents, constructs labels, runs the harness, and writes the report in one session. No external replication.
7. **Stratum B (fresh AI generation) uses one model**. A fuller study would vary the model; v1 fixes on one for executability.

---

## 8. Decision rules for what to publish from this study

Regardless of outcome:

- **All labels are preserved** in `labels_curator.json`, `labels_llm.json`, `labels_detector.json` with the raw reasoning fields.
- **All documents are preserved** in `corpus/` with provenance recorded.
- **Results JSON is published** with per-frame F1, kappa, confusion matrices.
- **REPORT.md is written honestly**, including any F1 results in the H3 falsification zone (< 0.4). The report is NOT a marketing document; if results are bad, the report says so plainly and names the implications.

Negative or mixed results are NOT retroactively re-scoped. If F1 is 0.35, the report says F1 is 0.35 and does not redesign the study post-hoc to find a better number.

---

## 9. Execution order (strict)

1. Write this DESIGN.md (done).
2. Assemble corpus. Stratum A: copy from `data/worked_examples/`. Stratum B: generate via Anthropic API with pre-declared prompts. Stratum C: fetch via curl from pre-declared URLs.
3. Freeze corpus (tag each document with stratum, source, word count, hash).
4. Curator labels (A). Produce `labels_curator.json`. Do not run detector. Do not query LLM judge.
5. LLM labels (B). Single batch call to Anthropic API. Produce `labels_llm.json`. Do not read curator labels.
6. Detector labels (C). Run `frame_library.suggest_frames` on each document. Produce `labels_detector.json`.
7. Compute metrics. Produce `results.json`.
8. Write REPORT.md with findings.
9. Verify no prohibited punctuation; final quality gate.

Order is strict because contamination backwards (e.g., tweaking curator labels after seeing detector output) invalidates the study's claim to independence.

---

## 10. What this study does NOT do

- Does not establish inter-rater reliability for the FVS taxonomy at publishable scale (needs independent humans, larger n).
- Does not measure user experience or reading-practice change.
- Does not validate the observatory or calibration datasets.
- Does not substitute for a lab-engagement-driven FVS-Eval baseline run (SPEC §11).
- Does not claim state-of-the-art on any external benchmark.

What it DOES: produce the first empirical data point on whether the core detection claim holds at preliminary scale under the project's own construct-honesty discipline.

---

*v1. 2026-04-18. Authored by Lovro Lucic. Pre-registered before corpus assembly or labeling. Any deviations from this design during execution will be named in REPORT.md with rationale.*