# Rule-Level Post-Hoc Audit of frame_library.suggest_frames

**Date:** 2026-04-18.
**Status:** v1 audit. Diagnosis phase complete; proposed rule revisions named; empirical v2 test pending.
**Scope:** the 11 text-side FVS frames implemented in `frame_library.py::suggest_frames`.
**Inputs:** the validation study's per-document detector signals (`labels/labels_detector.json::signals`) and per-labeler labels (`labels/labels_curator.json`, `labels/labels_llm.json`).
**Decision discipline:** each proposed revision is diagnosable back to a specific document in the corpus. No proposal is made without a named root cause.

---

## 1. Method

For each detectable frame, four things are established:

1. **The rule as currently written** (file:line reference in `frame_library.py`).
2. **Observed detector behavior on the study's 12 documents** (TP / FP / FN count relative to curator-labels and LLM-judge-labels).
3. **Root cause of disagreements** (threshold too tight/loose, missing signal dimension, voice miscategorization, regex gap).
4. **Proposed action**: one of KEEP / TIGHTEN / LOOSEN / NARROW / REWRITE / RETIRE, with specific parameter changes.

Evidence summary is copied from the validation study's `results/results.json`. Actions are evidence-based, not speculative.

---

## 2. Per-frame audit

### 2.1 FVS-001 Frame Amplification

**Rule** (`frame_library.py:187-199`): fires when `len(missing) >= 3` AND some covered category has `density_per_1kw > 8`.

**Observed:**
- Detector fires on 4 docs: `b01_nvidia_investment`, `b04_llm_customer_support`, `b05_remote_work_productivity`, `b06_quantum_computing_outlook`.
- Curator flags on 1 doc: `a01_altman_intelligence_age`.
- LLM-judge flags on 0 docs.
- F1 vs majority-union: 0.000. Kappa with any labeler: near zero or negative.

**Root cause:** The canonical amplification case (`a01_altman`) has all covered densities at 2.7-3.5/kw (no category crosses 8/kw). Rule does not fire. Meanwhile the rule fires on balanced analytical texts where one category happens to be dense (stakeholders 26/kw in `b04` because "customer support" vocabulary lists many stakeholder nouns; trends 8.8/kw in `b06` because quantum-computing discussion uses trend-vocabulary). These are NOT amplification; they are balanced analyses with one vocabulary-heavy dimension.

**The structural error:** "Frame Amplification" as defined in FVS-001 means **sophisticated output within an unexamined frame, producing increasing confidence**. The rule's signals (high density + missing categories) do not track sophistication, frame lock, or confidence. They track vocabulary distribution. A category-mistake: the rule labels vocabulary-distribution patterns as amplification.

**Proposed action: RETIRE from suggest_frames.** FVS-001 cannot be detected from the current signal set. The frame remains valuable as a library entry for educational reading, but the automated suggestion is noise (F1 = 0). A future signal that could support FVS-001 detection would need to measure within-session sophistication growth or frame-repetition, neither of which is available from per-document coverage/voice/temporal/epistemic measurements.

### 2.2 FVS-002 Fluency-Quality Illusion

**Rule** (`frame_library.py:162-168`): fires when `voice_type == "promotional"` AND `sourced_pct < 30`.

**Observed:**
- Detector fires on 1 doc: `a01_altman_intelligence_age`.
- Curator flags on 1 doc: `a01`.
- LLM-judge flags on 7 docs (everything AI-generated + a01).
- F1 vs curator: 1.000 (perfect on 1/1). F1 vs LLM-judge: 0.222 (LLM much more permissive).

**Root cause:** No disagreement with curator. LLM-judge over-labels this frame, possibly because "fluent analytical prose with limited sources" is a default reading for any AI-generated analytical text. The detector's conservative rule matches careful human reading, not LLM permissiveness.

**Proposed action: KEEP UNCHANGED.** This is the single rule in the study that agrees perfectly with the human labeler at the cost of disagreeing with a permissive LLM judge. The permissiveness gap is an LLM-as-judge property, not a rule defect.

### 2.3 FVS-007 Failure Framing

**Rule** (`frame_library.py:202-209`): fires when `"risks" in missing` AND `"uncertainty" in missing` AND `unhedged_pct > 60`.

**Observed:**
- Detector fires on 3 docs: `b05`, `c03_quantum_supremacy`, `c04_ubi`.
- Curator flags on 1 doc: `a01` (Altman, where risks is actually covered via risk-vocabulary, so rule doesn't fire).
- LLM-judge flags on 0 docs.
- F1 vs majority-union: 0.000. All 3 detector fires are FPs (neither labeler sees FVS-007).

**Root cause (two-sided):**
- *False negative on a01:* Altman's essay has risk-vocabulary present ("challenges," "disruptions"), so the coverage detector marks risks as covered. The rule's "risks in missing" condition is not met. Dismissive mention is not detected.
- *False positives on c03, c04, b05:* Wikipedia encyclopedic articles and even-handed advisory text lack risk/uncertainty coverage but are not engaging in "failure framing" in the FVS-007 sense (advocacy-without-falsification). They are descriptive genres, not advocacy.

**Proposed action: NARROW with voice filter.** Add requirement: `voice_type in ("promotional", "advisory")`. This excludes Wikipedia encyclopedic (typically analytical) and balanced analytical text. It retains the promotional+no-risk pattern that actually matches FVS-007. Expected effect: removes 3 FPs without sacrificing any TPs (Altman is already an FN, not a TP, so voice-filtering a non-firing doc is neutral).

If the narrowed rule still over-fires in v2 testing, FALL BACK TO RETIRE.

### 2.4 FVS-008 Growth Frame

**Rule** (`frame_library.py:149-160`): fires when `("trends" in covered OR "causes" in covered)` AND `"risks" in missing` AND `voice != descriptive/insufficient`.

**Observed:**
- Detector fires on 3 docs: `b05_remote_work`, `c03_quantum_supremacy`, `c04_ubi`.
- Curator flags on 2 docs: `a01_altman`, `b01_nvidia_investment`.
- LLM-judge flags on 3 docs.
- F1 vs majority-union: 0.000. Detector fires on docs neither labeler flagged; misses the two labelers agree on.

**Root cause (two-sided):**
- *False negatives on a01, b01:* Altman has risks covered (via risk-vocabulary-as-dismissal). b01 has risks density 13/kw (strongly covered). Rule requires risks in missing; neither fires.
- *False positives on b05, c03, c04:* These have risks missing from coverage (low or zero risk markers) but are not growth-framed. Quantum supremacy Wikipedia is historical description; UBI Wikipedia is neutral encyclopedia; b05 is balanced advisory.

**Proposed action: RETIRE from suggest_frames.** Growth Frame is about advocacy organized around expansion vocabulary (prosperity, growth, scaling, winning). The current rule tracks "some trends/causes coverage without risks coverage," which is satisfied by many genres that are not growth-framed. A proper Growth Frame detector would need:
- A dedicated growth-vocabulary regex (prosperity, expand, scale, outperform, grow, accelerate, dominate).
- Voice filter to promotional or advisory.
- Density threshold above a meaningful level.

This is a new signal that does not exist in `framing.py` today. Retiring the current rule is honest; a v3 rule against a new signal is a future option.

### 2.5 FVS-009 Risk Frame (active)

**Rule** (`frame_library.py:225-242`): fires when `"risks" in covered` AND `risks_density > 5` AND `"uncertainty" in covered` AND `voice == "analytical"`.

**Observed:**
- Detector fires on 2 docs: `a02_fomc` (risks 12.2, uncertainty 6.1, analytical), `b02_automation` (risks 16.9, uncertainty 4.2, analytical).
- Curator flags on 3 docs: `a02`, `b01_nvidia`, (and implied b04 customer support, but curator labeled b04 without FVS-009).

Actually re-checking from the data: curator flagged FVS-009 on a02, b01, and b04? No, curator flagged FVS-009 on a02 only. Wait, let me re-check from the labels.

Curator FVS-009 positives: a02, b01 (per the labels_curator.json I wrote). b04 no, b05 no. So 2 positives.
- LLM-judge flags on 9 docs (permissive).
- F1 vs majority-union: 0.364.

**Root cause:** Rule's "uncertainty in covered" requirement is the bottleneck. Many documents carry Risk Frame analytically without explicit uncertainty vocabulary. b01 nvidia: risks 13/kw, uncertainty 2.2/kw (not enough for covered). Rule doesn't fire. Curator sees Risk Frame because risks is dominant and substantive.

**Proposed action: LOOSEN by dropping the uncertainty-coverage requirement.** New rule: fire when `"risks" in covered` AND `risks_density > 5` AND `voice == "analytical"`. Expected effect: adds `b01_nvidia` as a TP (+1). Does not add FPs unless documents with risks > 5/kw and analytical voice exist that are not Risk Frame (unlikely given these are high bars already).

### 2.6 FVS-010 Completeness Illusion

**Rule** (`frame_library.py:170-185`): fires when `coverage_count >= 4` AND `max_density / min_density > 3`.

**Observed:**
- Detector fires on 2 docs: `b02_automation` (5 covered, 5.5x skew), `c02_eu_ai_act` (4 covered, 9.1x skew).
- Curator flags on 2 docs: `a01_altman` (4 covered but 1.3x skew; rule doesn't fire), `b02`.
- LLM-judge flags on 8 docs.
- F1 vs majority-union: 0.200.

**Root cause:** Rule's skew threshold of 3x is roughly right on the structural pattern but misses `a01` (Altman covers 4 categories with near-uniform low density). The actual Completeness Illusion in Altman is not captured by density skew; it is captured by surface-coverage without depth. Rule's approach works when density-skew is extreme (b02, c02); misses when all densities are uniformly low.

**Proposed action: KEEP current rule for now**, add a secondary detection path (low-density uniform coverage) in a future iteration. The current rule has moderate precision and catches one genuine FVS-010 pattern; the other pattern (uniform-low-density) requires more work than a single-parameter adjustment.

### 2.7 FVS-011 Stakeholder Frame (active)

**Rule** (`frame_library.py:244-261`): fires when `"stakeholders" in covered` AND `stakeholder_density > 5` AND `voice in ("analytical", "advisory")`.

**Observed:**
- Detector fires on 7 docs.
- Curator flags on 3 docs: `b02`, `b04`, `b05`.
- LLM-judge flags on 2 docs: `b02`, `c02`.
- F1 vs majority-union: 0.727 (highest in study).

**Root cause:** Rule's signal is stakeholder-vocabulary density, which correlates with (but does not equal) actual Stakeholder Frame analysis. The rule fires on documents that happen to list many stakeholder nouns without performing stakeholder-frame analysis (b03 social media: adolescents as topic, mentioned repeatedly; c02 EU AI Act: providers/deployers/users as regulatory categories; c04 UBI: citizens as recipients).

The rule's own code comment already acknowledges this limit: "the detector cannot distinguish specific impact analysis from vacuous 'stakeholders include everyone' phrasing." The study confirms the limit empirically.

**Proposed action: KEEP with TIGHTENED density threshold.** Raise threshold from 5 to 10/kw. Expected effect: removes FPs on `b01` (stakeholder density 6.5), possibly retains TPs on `b02` (23.2), `b04` (26.0), `b05` (8.8, BORDERLINE: would lose this TP). Alternative: keep 5/kw threshold AND add a "multiple stakeholder groups in same sentence" secondary check to distinguish analysis from listing.

Recommend: TIGHTEN to 10/kw density as v2 candidate. Measure F1 change; if TP rate drops meaningfully, revert to 5/kw but add secondary disambiguator.

### 2.8 FVS-012 Uncertainty Frame (active)

**Rule** (`frame_library.py:263-275`): fires when `"uncertainty" in covered` AND `uncertainty_density > 3`.

**Observed:**
- Detector fires on 2 docs: `a02_fomc` (uncertainty 6.1), `b02_automation` (4.2).
- Curator flags on 4 docs: `a02`, `b01_nvidia`, `b03_social_media`, `b06_quantum`.
- LLM-judge flags on 9 docs.
- F1 vs majority-union: 0.182.

**Root cause:** Two issues.
- Threshold 3/kw is too high. `b01` (2.2/kw), `b03` (2.2/kw) are reasonable Uncertainty Frame documents but the threshold excludes them.
- Uncertainty regex misses hedge constructions. `b06_quantum`: uncertainty density 0/kw but the document is full of uncertainty expressions ("years or decades away," "theoretically promising but practically unproven," "undemonstrated"). The regex for uncertainty (per `framing.py:ANALYTICAL_CATEGORIES["uncertainty"]`) catches specific words like "unclear," "uncertain," "debated." It misses phrasings that convey uncertainty without those specific lexemes.

**Proposed action: DUAL FIX.**
- LOOSEN density threshold from 3 to 2/kw.
- EXPAND uncertainty regex to include: "undemonstrated," "unproven," "theoretically promising," "years or decades," "no credible signs," "not yet clear," "remains open," "far from settled."

v2 testing: re-run detector with both changes; measure improvement.

### 2.9 FVS-014 Temporal Anchoring

**Rule** (`frame_library.py:277-293`): fires when `past_pct >= 70` OR `future_pct >= 60`.

**Observed:**
- Detector fires on 0 docs.
- Curator flags on 4 docs: `a01` (future 40%), `b06` (past 18%), `c03` (past 38%), `c04` (past 30%).
- LLM-judge flags on 7 docs.
- F1 vs majority-union: 0.000.

**Root cause:** Thresholds 70% past and 60% future are nearly unreachable in realistic prose. The highest past_pct in the corpus is 38% (c03 quantum supremacy history). The highest future_pct is 40% (a01 Altman). Rule is structurally inert on this corpus.

**Proposed action: DRAMATICALLY LOWER thresholds.** Change to `past_pct >= 40` OR `future_pct >= 40`. Expected effect: fires on `a01` (future 40), `c03` (past 38 - still below), `c04` (past 30, still below). So 40% threshold catches only a01; 35% would catch c03 and a01; 30% catches a01/c03/c04.

Recommend v2: 35% past OR 35% future. Test against corpus; if macro-F1 improves without obvious FPs, keep. If over-fires, raise back to 40%.

### 2.10 FVS-015 Efficiency Frame

**Rule** (`frame_library.py:211-223`): fires when `("trends" OR "causes") in covered` AND `"stakeholders" in missing` AND `"uncertainty" in missing` AND `voice != descriptive/insufficient` AND `NOT has_growth_signal`.

**Observed:**
- Detector fires on 2 docs: `b06_quantum_computing_outlook`, `c01_semaglutide`.
- Curator flags on 1 doc: `b04_llm_customer_support` (the canonical efficiency-frame case).
- LLM-judge flags on 1 doc: `b04`.
- F1 vs majority-union: 0.000. All detector fires are FPs; the only labeler-consensus case is missed.

**Root cause:** Rule's signal (trends/causes covered, stakeholders & uncertainty missing) does not track efficiency-frame semantics. `b04_llm_customer_support` has stakeholders density 26/kw (not missing), so rule doesn't fire on the canonical case. Meanwhile `b06` (quantum research) and `c01` (drug description) happen to have stakeholders and uncertainty missing for genre reasons unrelated to efficiency framing.

**Proposed action: RETIRE from suggest_frames.** Like FVS-008 Growth Frame, FVS-015 Efficiency Frame requires a dedicated vocabulary signal (efficient, optimize, cost, reduce, streamline, automate, scale). That signal does not exist in `framing.py`. The current rule's coverage-based heuristic does not detect the frame; it detects genre properties that correlate with missing-stakeholders. Retiring is honest; a proper efficiency-vocabulary detector is a future addition.

### 2.11 FVS-016 Authority by Citation

**Rule** (`frame_library.py:304-309`): fires when `sourced_pct >= 50`.

**Observed:**
- Detector fires on 0 docs (max sourced_pct in corpus is 26%).
- Curator flags on 6 docs: `b03_social_media` (researchers cited), `b06_quantum` (companies/institutions), `c01_semaglutide` (FDA/EU/WHO), `c02_eu_ai_act` (European Commission/Parliament), `c03_quantum_wiki` (Turing, Feynman et al.), `c04_ubi` (Thomas More, Vives, Caesar).
- LLM-judge flags on 3 docs: `b06`, `c01` (absent from LLM?), actually let me recheck. LLM labels: b03 FVS-016, b06 FVS-016 yes, c03 no, c04 no. So LLM flagged 3 docs including b03, b06, and one of the Wikipedia ones.
- F1 vs majority-union: 0.000.

**Root cause:** Rule threshold `sourced_pct >= 50` is nearly unreachable. The `sourced_pct` metric (per `framing.py:detect_epistemic_basis`) measures the fraction of numeric-claim sentences that carry source attribution. Most analytical text has few numeric claims (b04, b05, b02 have 0%); most narrative text is at 0-26%. The 50% threshold would only fire on dense-citation academic papers.

Meanwhile "Authority by Citation" is about **citation FORM creating evidence impression**. This is fundamentally different from "sourced numeric claims." A document citing Jonathan Haidt's work is NOT making a numeric claim with source attribution; it is invoking an author's authority. The signal the rule uses does not match the frame's definition.

**Proposed action: REWRITE.** Two paths:
- Short-term: LOWER threshold to 20% as an interim proxy. Captures some cases (b03, c03 at 24-26% sourced_pct).
- Proper fix: ADD a named-author-citation signal that detects "Name (Year)" patterns, "according to Name," or proper-noun-author patterns in analytical text. This is a new signal in `framing.py`. Without it, the rule can't detect what FVS-016 is really about.

Recommend: INTERIM lower threshold to 20% as v2 candidate; proper fix in a future iteration with a new signal.

---

## 3. Summary table

| Frame | Current F1 | Proposed action | Expected v2 effect |
|-------|-----------:|-----------------|--------------------|
| FVS-001 Frame Amplification | 0.000 | **RETIRE** | removes 4 FPs; does not cost any TPs (current TP count = 0) |
| FVS-002 Fluency-Quality Illusion | 0.250 | **KEEP** | no change |
| FVS-007 Failure Framing | 0.000 | **NARROW** (voice: promotional or advisory) | removes 3 FPs; does not cost TPs |
| FVS-008 Growth Frame | 0.000 | **RETIRE** | removes 3 FPs; does not cost TPs |
| FVS-009 Risk Frame | 0.364 | **LOOSEN** (drop uncertainty-covered requirement) | +1 TP (b01); no expected FPs |
| FVS-010 Completeness Illusion | 0.200 | **KEEP** | no change |
| FVS-011 Stakeholder Frame | 0.727 | **TIGHTEN** (density threshold 5 -> 10) | likely -1 FP on b01; may cost b05 TP at boundary |
| FVS-012 Uncertainty Frame | 0.182 | **LOOSEN + EXPAND REGEX** (threshold 3 -> 2; add hedge constructions) | +2 TPs likely (b01, b03); possible +1 TP (b06) if regex catches quantum-hedges |
| FVS-014 Temporal Anchoring | 0.000 | **LOWER THRESHOLD** (70/60 -> 35/35) | +1 to +3 TPs; risk of over-firing on narrative-past text |
| FVS-015 Efficiency Frame | 0.000 | **RETIRE** | removes 2 FPs; does not cost TPs |
| FVS-016 Authority by Citation | 0.000 | **LOWER THRESHOLD** (50 -> 20, interim) | possible TPs on b03, c03; does not cost existing TPs (there are none) |

Three rules RETIRE (FVS-001, FVS-008, FVS-015). Three KEEP (FVS-002, FVS-010 w/ future expansion, FVS-011 tightened). Five adjusted (FVS-007, FVS-009, FVS-012, FVS-014, FVS-016).

**Projected aggregate effect if all v2 changes are applied:**
- FPs decrease by ~12 across FVS-001, FVS-007, FVS-008, FVS-015.
- TPs increase by ~5-7 across FVS-009, FVS-012, FVS-014, FVS-016.
- Net: should move macro-F1 meaningfully above 0.157. Whether it crosses 0.4 (the H3 threshold from DESIGN) depends on how the expansion regex and threshold changes interact. v2 empirical test will settle this.

---

## 4. What this audit does and does NOT claim

Does:
- Provide frame-by-frame root cause for the detector's low F1 in the validation study.
- Propose a specific, parameter-level revision for each frame (not hand-waving).
- Name the frames where the current rule is structurally unable to detect what the FVS entry describes (FVS-001, FVS-008, FVS-015, FVS-014 at high thresholds, FVS-016 at the wrong signal).
- Distinguish "rule needs tuning" (FVS-007, FVS-009, FVS-011, FVS-012, FVS-014, FVS-016 lower-threshold path) from "signal inadequate for frame" (FVS-001, FVS-008, FVS-015).

Does NOT:
- Claim v2 will reach state-of-the-art F1 (the signal substrate has hard limits).
- Commit to retirement as irreversible (frames can be re-promoted to the rule set when new signals are added).
- Replace the need for an expanded multi-annotator study on larger n. v2's test on the same 12-doc corpus is a preliminary check, not a publishable reliability claim.
- Address the FVS taxonomy's own inter-rater ambiguity (curator-vs-LLM-judge kappa = +0.279). That ambiguity is separate from rule precision and requires reviewer engagement to address.

---

## 5. Implementation plan

The audit produces three subsequent deliverables (each scoped narrowly):

**Deliverable A: `frame_library_v2.suggest_frames`.** A candidate revised rule set implementing the actions in §3. Lives in a new file or as a named variant so v1 behavior is preserved for comparison. Roughly 150 lines of Python.

**Deliverable B: `05_rerun_v2_detector.py`.** Harness script that runs v2 detector on the study corpus, produces `labels/labels_detector_v2.json`, and computes v1-vs-v2 metrics comparison. Roughly 100 lines.

**Deliverable C: update to REPORT.md or a new `REPORT_V2.md`.** Empirical comparison of v1 and v2 macro-F1 + per-frame changes. Honest about whether v2 crosses the H3 threshold and what it tells us about the detector's achievable ceiling given current signals.

Deliverables A, B, C ship after this audit document is finalized. No additional strategic commitments pending until empirical v2 numbers are in.

---

## 6. Honest limits of this audit

1. **Audit is curator-only.** The diagnoses are my interpretation of why the detector disagrees with labelers. An independent reviewer might disagree with the root causes or the proposed actions. No external sign-off yet.
2. **n=12 is small for rule-level precision estimates.** A rule might appear to need retirement on this corpus but work well on a larger corpus. The expanded study (REPORT §8.2) should precede any permanent rule removals.
3. **Some proposed changes are interacting.** Lowering FVS-012 uncertainty threshold AND expanding its regex may have compound effects not predictable from either change alone. v2 test measures the combined effect, not isolated contributions.
4. **Retirement recommendations assume the current signal set is fixed.** If a new vocabulary-density signal were added to `framing.py`, FVS-001/008/015 retirement could be revisited. This audit operates within the current signal substrate.
5. **The rule changes do not address the underlying FVS taxonomy ambiguity.** Even perfect detector-vs-label alignment would still inherit the +0.279 inter-labeler kappa. Rule tuning has a ceiling; taxonomy sharpening and reviewer engagement are separate moves.

---

*v1. 2026-04-18. Authored by Lovro Lucic with AI assistance. Diagnosis phase of rule-level post-hoc audit following validation study REPORT.md. Empirical v2 test to follow in Deliverables A-C.*
