# FVS External Validation Study v1: Report

**Date:** 2026-04-18.
**Executed by:** curator + AI assistant (single session).
**Status:** Preliminary. n=12. Results are load-bearing for Frame Check's core detection claim; further replication with independent human annotators remains necessary.

---

## 1. Headline

**Primary metric: Macro-F1 of detector vs majority-union-of-labelers = 0.157.**

Per pre-registered DESIGN.md §6 thresholds:

| Zone | F1 Range | Pre-declared interpretation |
|------|----------|---------|
| State-of-art | ≥ 0.7 | H1 strongly supported |
| Useful | 0.6-0.7 | H1 supported |
| Partial | 0.4-0.6 | Scope revision required |
| **Falsification** | **< 0.4** | **H3 fires; project pivot required** |

**The pre-registered falsification threshold (H3) has fired.** The preliminary evidence is that the Frame Check detector, as currently implemented, does not reliably produce framing labels that agree with careful multi-source labeling on a mixed-genre corpus.

This finding is load-bearing and uncomfortable. It is reported per the pre-registered DESIGN.md §8 rule: "Negative or mixed results are NOT retroactively re-scoped."

---

## 2. Study summary

Per DESIGN.md §3: 12 documents across 3 strata of 4 each. Per DESIGN.md §4: three labelers. Per DESIGN.md §9 execution order: curator labels (A) produced first and frozen, then LLM-judge (B) via Anthropic API at temperature 0.0 with FVS library reference in prompt, then detector (C) via `frame_library.suggest_frames`.

**Corpus (12 documents):**

- Stratum A (n=2, reduced from 4 after v1 corpus-assembly failures, documented in `01_assemble_corpus.py` v2 docstring): human-authored public documents (Altman essay, FOMC March 2026 statement).
- Stratum B (n=6, expanded from 4 to compensate for Stratum A reduction): freshly generated AI analyses via Anthropic API, pre-declared prompts, Claude Sonnet 4.6 served.
- Stratum C (n=4): Wikipedia articles (semaglutide, EU AI Act, quantum supremacy, universal basic income).

**Labelers:**

- **A (Curator):** labels against all 20 FVS frames, binary, with reasoning per positive.
- **B (LLM-judge):** Claude Sonnet 4.6, temperature 0.0, provided with FVS identification-section reference for all 20 frames plus document text.
- **C (Detector):** `frame_library.suggest_frames` output; 11 text-side frames only (the frames with detection rules).

---

## 3. Primary results

### 3.1 Macro metrics (detector vs labelers)

| Metric | Value |
|--------|-------|
| Macro-F1 C vs A (curator) | **0.128** |
| Macro-F1 C vs B (LLM-judge) | **0.223** |
| Macro-F1 C vs majority-union (A∪B) | **0.157** (primary) |
| Macro-F1 C vs majority-intersection (A∩B) | **0.226** |
| Macro-kappa A vs B (11 detectable frames) | +0.288 |
| Macro-kappa A vs C | +0.031 |
| Macro-kappa B vs C | -0.008 |
| Mean kappa A vs B (all 20 frames, per-doc) | +0.279 |

The detector's kappa with LLM-judge is approximately zero (-0.008). The detector's kappa with curator is barely positive (+0.031). These are essentially chance-level agreements, not evidence of measurement alignment.

### 3.2 Per-frame agreement (detector-side-only frames, 11 text-side)

| Frame | n_A | n_B | n_C | κ_AB | κ_AC | κ_BC | F1_C vs union | F1_C vs intersection |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| FVS-001 Frame Amplification | 1 | 0 | 4 | +0.000 | -0.154 | -0.000 | 0.000 | 0.000 |
| FVS-002 Fluency-Quality Illusion | 1 | 7 | 1 | +0.122 | +1.000 | +0.122 | 0.250 | 1.000 |
| FVS-007 Failure Framing | 1 | 0 | 3 | +0.000 | -0.143 | +0.000 | 0.000 | 0.000 |
| FVS-008 Growth Frame | 2 | 3 | 3 | +0.750 | -0.250 | -0.333 | 0.000 | 0.000 |
| FVS-009 Risk Frame | 3 | 9 | 2 | +0.200 | +0.250 | +0.125 | 0.364 | 0.400 |
| FVS-010 Completeness Illusion | 2 | 8 | 2 | +0.182 | +0.400 | -0.091 | 0.200 | 0.500 |
| FVS-011 Stakeholder Frame | 3 | 2 | 7 | +0.250 | +0.385 | +0.250 | **0.727** | 0.250 |
| FVS-012 Uncertainty Frame | 4 | 9 | 2 | +0.286 | +0.143 | -0.125 | 0.182 | 0.333 |
| FVS-014 Temporal Anchoring | 4 | 7 | 0 | +0.211 | -0.000 | +0.000 | **0.000** | 0.000 |
| FVS-015 Efficiency Frame | 1 | 1 | 2 | +1.000 | -0.125 | -0.125 | 0.000 | 0.000 |
| FVS-016 Authority by Citation | 6 | 3 | 0 | +0.167 | +0.000 | +0.000 | **0.000** | 0.000 |

Key observations:

- **Only FVS-011 Stakeholder Frame achieved F1 > 0.5** against majority-union. This is the single frame where the detector's pattern-matching appears to capture what readers see.
- **FVS-014 Temporal Anchoring and FVS-016 Authority by Citation had ZERO detector fires across all 12 documents**, despite curator flagging FVS-014 in 4 documents and FVS-016 in 6 documents (the most-flagged frame by curator). This indicates a capability gap: the detector does not fire on these frames on the corpus observed, suggesting either the rules have extremely tight thresholds or the rules effectively never trigger on typical analytical text.
- **FVS-001 Frame Amplification and FVS-007 Failure Framing were detector over-fires**: detector flagged FVS-001 in 4 documents and FVS-007 in 3, while curator flagged each exactly once (the Altman essay). This points to rule thresholds that are too permissive for the signal they are meant to detect.
- **FVS-002 Fluency-Quality Illusion shows a curious pattern**: κ_AC = +1.00 (curator and detector perfectly agree, both flag exactly 1 doc, the Altman essay), while κ_BC = +0.122 (LLM-judge flagged 7 docs; detector flagged 1). Here the detector is CONSERVATIVE and the LLM-judge is permissive.

### 3.3 Per-stratum macro-F1 (C vs majority-union)

| Stratum | Macro-F1 | n |
|---------|---------:|--:|
| A (human-authored published) | 0.525 | 2 |
| B (fresh AI-generated) | 0.194 | 6 |
| C (Wikipedia encyclopedic) | 0.100 | 4 |

**Detector performance is substantially better on human-authored public documents (A) than on AI-generated analytical text (B) and Wikipedia encyclopedic text (C).** The FOMC statement alone carries stratum A; the detector agrees with labelers on FVS-009 and FVS-012 (the two frames the labelers identified) for a near-perfect match on that document. The Altman essay carries the other A slot and produces F1 = 0.250 (detector catches 1 of 8 frames curator identified).

### 3.4 Per-document label tables

Side-by-side labels for the 11 detectable frames per document:

| Doc | Stratum | Curator (A) | LLM-judge (B) | Detector (C) | F1_u |
|-----|---------|-------------|---------------|--------------|------|
| a01_altman | A | 001,002,007,008,010,014 | 002,008,010,012,014 | 002 | 0.250 |
| a02_fomc | A | 009,012 | 009,012,014 | 009,012 | 0.800 |
| b01_nvidia | B | 008,009,012 | 002,008,009,010,012,014,016 | 001,011 | 0.000 |
| b02_automation | B | 010,011 | 002,009,010,011,014 | 009,010,011,012 | 0.667 |
| b03_social_media | B | 012,016 | 002,009,010,012,014,016 | 011 | 0.000 |
| b04_llm_support | B | 011,015 | 002,008,009,010,015 | 001,011 | 0.250 |
| b05_remote_work | B | 011 | 002,010,012 | 001,007,008,011 | 0.250 |
| b06_quantum_outlook | B | 012,014,016 | 002,009,010,012,014,016 | 001,015 | 0.000 |
| c01_semaglutide | C | 016 | 009,012 | 015 | 0.000 |
| c02_eu_ai_act | C | 009,016 | 009,011 | 010,011 | 0.400 |
| c03_quantum_wiki | C | 014,016 | 009,012,014 | 007,008 | 0.000 |
| c04_ubi | C | 014,016,017 | 010,012 | 007,008,011 | 0.000 |

Seven of twelve documents have F1 = 0.00 or 0.25 between detector and majority. One document (FOMC) shows high agreement. Three documents (automation, EU AI Act, nvidia) show moderate agreement.

---

## 4. What this reveals

### 4.1 The detector does not reliably measure what readers see

Both curator and LLM-judge, labeling independently with the FVS library reference in hand, produce labels that substantially disagree with the detector's output on 10 of 12 documents. The disagreement is not random noise. It follows patterns.

### 4.2 The disagreement is not just "curator construction bias"

A common defense against curator-vs-detector disagreement is that the curator built the library and knows it best; disagreement is a training-distribution mismatch. This defense is testable because we have the LLM-judge as a second labeler. **The LLM-judge agrees with the detector (κ_BC = -0.008) even LESS than the curator does (κ_AC = +0.031).** The detector disagrees with BOTH independent labelers roughly equally.

### 4.3 The FVS taxonomy has real inter-labeler ambiguity

Curator-vs-LLM-judge kappa (κ_AB mean over all 20 frames = +0.279) is in the "fair" range per Landis and Koch 1977 convention (0.21-0.40 = fair agreement), not "substantial" (> 0.61) or "almost perfect" (> 0.81). This means:

- The detector is the weakest element (κ with either independent labeler near zero).
- But the taxonomy itself has real ambiguity that even two careful labelers disagree on.
- The LLM-judge is notably more permissive than the curator: it flagged 78 of 240 possible (doc × frame) slots (32%) vs curator's 34 of 240 (14%) on the 20-frame universe. Permissiveness is a known LLM-as-judge bias and partially explains the κ gap.

### 4.4 Specific capability gaps in the detector

- **FVS-014 Temporal Anchoring and FVS-016 Authority by Citation: zero detector fires across 12 documents.** Curator flagged these 4 and 6 times respectively; LLM-judge flagged these 7 and 3 times respectively. The rules either have thresholds that are too tight for typical analytical text, or effective rule coverage is weaker than the README implies. Either way, on this corpus the detector fails to surface these two frames at all.
- **FVS-001 Frame Amplification and FVS-007 Failure Framing over-fire.** Detector flagged FVS-001 in 4 documents, FVS-007 in 3. Curator flagged each in exactly 1. The rule's density threshold (8 markers per 1,000 words + 3 missing categories) appears to match documents that careful reading does not identify as amplification.
- **FVS-008 Growth Frame and FVS-015 Efficiency Frame: substantial within-frame disagreement.** κ_AC is negative for both. The detector fires on documents where labelers do not see the frame, and vice versa.

### 4.5 The one success case: FOMC

The FOMC March 2026 press release is 315 words, technical register, explicitly names "risks to both sides of its dual mandate" and "uncertainty about the economic outlook remains elevated." All three labelers flag FVS-009 and FVS-012. The LLM-judge adds FVS-014 (temporal anchoring), which detector and curator do not. F1 = 0.800 on this document.

**The detector works when the text is short, structurally explicit, and uses framing-relevant vocabulary directly in the sentences the frame names.** When the text requires inference about framing from narrative arc, selection, or emphasis, the detector does not keep up with labelers.

---

## 5. Hypothesis tests vs pre-registered

From DESIGN.md §2:

**H1 (primary).** Detector macro-F1 ≥ 0.6 on mixed-genre corpus of ~12 documents. **Outcome: NOT SUPPORTED.** Observed F1 = 0.157.

**H2 (secondary).** Detector agreement is approximately stable across document genres. **Outcome: NOT SUPPORTED.** Per-stratum F1 ranges from 0.525 (A, n=2) to 0.100 (C, n=4). Three-fold variation across strata indicates genre-sensitivity the project has not previously characterized.

**H3 (falsification).** If F1 < 0.4, project pivot required. **Outcome: FIRED.** Observed F1 = 0.157 is well below threshold.

---

## 6. What this does NOT establish

Honest limits are part of the study, not post-hoc softening:

1. **n=12 is small.** The confidence interval on macro-F1 at n=12 is wide. The effect observed is large enough that it is unlikely to be noise (F1 = 0.157 vs threshold 0.4 is a ~0.25 gap), but a larger study might shift the number.
2. **Curator is the library builder.** Curator's labels may reflect the taxonomy's ideal application rather than how independent readers would see the frames. Curator-vs-detector agreement is partially self-confirming; the key signal is LLM-judge-vs-detector agreement, which is near zero.
3. **LLM-judge is an LLM, not a human expert.** LLM-as-judge has known biases: position, length, self-preference, and in this case permissiveness. However, the LLM-judge disagreeing with the detector is still informative: two automated systems with different construction principles do not converge on the same labels.
4. **The study does not establish what the RIGHT labels are.** It establishes that the detector disagrees with both available labelers. A world in which the detector is right and both labelers are wrong is logically possible but requires argument the study does not provide.
5. **The study does not test reader experience.** Whether users of Frame Check find the detector's output helpful is a separate question; detection-vs-labeler-agreement is one dimension of usefulness, not the only one.
6. **No external corpus mapping.** The study does not compare Frame Check's labels to Media Frames Corpus, FrameAxis, or other external framing frameworks. Such comparison was deemed too taxonomy-bridge-dependent for v1 but remains valuable.
7. **The LLM-judge labels were produced with FVS library definitions in the prompt.** This means B is not entirely independent of the taxonomy the curator authored. The study's A-vs-B kappa is an upper bound on what a fully-independent human annotator trained from scratch might produce; independent human kappa could be lower, meaning the "taxonomy ambiguity" finding is conservative.
8. **Single session, single executor.** Replication by an independent party would strengthen these findings.

---

## 7. Implications

Following from the results, the implications are forced:

### 7.1 The detector's role in Frame Check must be re-scoped

The current product framing ("structural framing analysis") implies that the detector's output is a meaningful read of a document's framing. On this preliminary evidence, that implication is overclaim. Options:

- **Re-scope as vocabulary-providing tool.** Frame Check surfaces SIGNALS (coverage density, voice, temporal, epistemic) and the library provides NAMES for framing patterns; the user does the interpretive labeling. The detector's output moves from "this document exhibits FVS-008" to "this document has growth-vocabulary density above threshold; readers may want to check whether this is Growth Frame."
- **Narrow to cases where detection is reliable.** FVS-011 Stakeholder Frame F1 = 0.727 in this study; it may be reliable enough to keep as a primary suggestion. Other frames with F1 < 0.4 in this study should be de-emphasized or removed from default suggestion output pending detector revision.
- **Explicitly document the fragility.** METHODOLOGY.md §2.3 "Limitations" should be expanded to name this study's findings. The current "honest limits" on individual FVS entries do not reach the operational level of "the detection rule for this frame fires at XX% precision against multi-source labels."

### 7.2 The FVS-Eval specification is built on an uncertain foundation

SPEC.md §6.1 describes the scoring coupling honestly ("eval precision is capped by detector precision on model outputs specifically, which is not yet measured"). This study MEASURES that ceiling for model-output framing specifically. The ceiling is lower than SPEC.md §13 acceptable-threshold language implies. v0.2 inter-annotator validation (SPEC.md §8) moves from hypothetical to essential: without it, any FVS-Eval baseline result is noise on top of a detector whose validity is not yet established.

### 7.3 Canon promotion of FVS-001 is now more uncertain

The FVS-001_v1 dossier named W5 (detection rule edge cases) as one of seven weaknesses. This study empirically confirms W5 is a substantial weakness, not a hypothetical one. A reviewer reading FVS-001 after this study has legitimate grounds to push back on the detection-rule coverage and the N=1 mechanism evidence combined. The dossier's "Promote contingent on specific fixes" verdict becomes the most likely real-world outcome, and the specific fixes must include detector revision or detection-rule retirement for the frame.

### 7.4 The project's core strategic claim needs reassessment

STRATEGY.md §9 positions Frame Check as "Track B: free credibility layer" on the premise that the detection engine produces credible framing labels. On this preliminary evidence, the engine's labels are not credible in the sense of multi-source agreement. The credibility story for the next year has to shift from "our detector agrees with expert readers" to one of:

- "Our VOCABULARY is the asset; the detector is one implementation of the vocabulary, among several that could be built."
- "Our DISCIPLINE (construct honesty, worked examples, transparent reporting) is the asset; the detector is a measurement tool that produces interpretable signals."
- "Our COMBINED SYSTEM (detector + library + worked examples + MCP) is the asset even if any individual component is weaker than claimed."

The canon-play (library becomes canonical reference) is compatible with any of these. But the detector-as-measurement-tool has to either be empirically rescued (higher F1 through rule revision) or re-scoped (from "labels framing" to "surfaces signals").

---

## 8. What to do next

Concrete, ordered by leverage:

### 8.1 Highest-leverage: rule-level post-hoc audit (executable now)

For each of the 11 detectable frames, look at the specific documents where detector disagrees with both labelers. Identify whether the rule:

- Fires on vocabulary that is present but not framing-indicative (false positive).
- Has thresholds that miss real framing instances (false negative).
- Is structurally incapable of detecting the frame as the library defines it (capability gap, like FVS-014 and FVS-016 which did not fire at all).

Output: a concrete rule-revision plan per frame. Some frames may be deprecated (removed from suggest_frames), others may be tightened or loosened, and some may need new signal dimensions (beyond coverage/voice/temporal/epistemic).

### 8.2 Second: expand the validation study

Independently: recruit 2-3 human annotators (leverages the REVIEWERS.md channel) to label a bigger corpus (30-50 documents) against the FVS taxonomy. Compute real inter-annotator reliability. Re-measure detector against the gold standard.

The current study is a signal; the expanded study produces a publishable reliability claim. This is the v0.2 protocol SPEC.md §8 specifies; this study shows it is not optional.

### 8.3 Third: update strategic documents honestly

- METHODOLOGY.md: add a §2.4 "Limits of detection precision" section with this study's numbers.
- STRATEGY.md §9: name the detector-vs-labeler gap and what the product's claim of value actually rests on.
- SPEC.md: update §6.1 coupling statement to reflect empirical F1 ceiling.
- FVS-001_v1 dossier: add a §W5-evidence note pointing to this study's specific findings.

### 8.4 Fourth: decide whether to ship the detector as-is or revise

The user-facing web app currently suggests frames with detector confidence as high as the spec implies. Either:

- Keep shipping the detector output but add UI that makes the rule-level uncertainty visible.
- Pause detector output pending rule revision.
- Ship only the frames with F1 > 0.5 (FVS-011 in this study; others pending re-measurement).

---

## 9. Honest meta-finding: the value of falsifiable pre-registration

The study's outcome is uncomfortable. The detector did not perform as expected. The pre-registered DESIGN.md made this finding possible: the thresholds were named before running, the corpus was frozen before labeling, the execution order prevented backwards contamination.

If the study had been run without pre-registration, the natural response would have been to adjust the threshold, select a different corpus subset, or find a methodological reason the detector "actually did fine." The pre-registration prevents that. The finding stands as documented.

This is the value of the construct-honesty discipline on the project itself. It is painful. It is also the reason Frame Check can continue to claim it applies its own standards to its own work.

---

## 10. Data availability

All study artifacts are in `fvs_eval/validation_study/`:

- `DESIGN.md`: pre-registration (v1).
- `corpus/`: 12 documents with provenance (`manifest.json`).
- `labels/labels_curator.json`: curator labels with reasoning per positive.
- `labels/labels_llm.json`: LLM-judge labels with reasoning.
- `labels/labels_detector.json`: detector labels with signal strings.
- `results/results.json`: computed metrics.
- `01_assemble_corpus.py`, `02_llm_judge_labels.py`, `03_detector_labels.py`, `04_compute_metrics.py`: executable scripts.

Anyone may re-run the harness with a modified corpus or modified detector and report updated numbers. The study is intended to be replicable.

---

## 11. One-sentence summary

The Frame Check detector, as implemented in `frame_library.suggest_frames`, produces framing labels on a 12-document mixed-genre corpus at macro-F1 = 0.157 against majority of two independent labelers, well below the pre-registered useful-threshold of 0.4; the pre-registered H3 falsification fires and triggers the DESIGN.md-specified project-pivot response; the pivot shape is named in §7 and §8 but the choice among pivots is the curator's.

---

*v1. 2026-04-18. Authored by Lovro Lucic with AI assistance. Pre-registered study; results reported without post-hoc re-scoping per DESIGN.md §8. First external empirical test of Frame Check's detection-vs-reader-judgment claim; preliminary scale; replication encouraged.*