# Validation Study REPORT v2: Post-Audit Detector Rule Revisions

**Date:** 2026-04-18.
**Relationship to REPORT.md:** REPORT.md documents the v1 validation study (canonical, pre-registered). This document reports the v2 follow-on: `frame_library_v2.suggest_frames_v2` applied to the same 12-document corpus, with the same curator + LLM-judge labels, testing whether the audit's proposed rule revisions (RULE_AUDIT.md) actually improve detector F1.
**v1 study is NOT overturned by v2.** v1 remains the canonical pre-registered result. v2 is a post-hoc engineering check to see whether the diagnosed gaps are fixable at the rule level.

---

## 1. Headline

| Metric | v1 (canonical) | v2 (post-audit) | Delta |
|--------|---------------:|----------------:|------:|
| **Macro-F1 vs majority-union** | **0.157** | **0.274** | **+0.118** |
| Stratum A (human-authored) | 0.525 | 0.622 | +0.097 |
| Stratum B (fresh AI-generated) | 0.195 | 0.310 | +0.116 |
| Stratum C (Wikipedia encyclopedic) | 0.100 | 0.434 | +0.335 |

**v2 delivered a 75% relative improvement in macro-F1 from rule-level revisions alone. v2 still remains in the pre-registered H3 falsification zone (< 0.4).** The proposed rule revisions work as predicted on specific frames, but the gap to "useful" (0.6) or state-of-the-art (0.7) is too wide to close with rule-level tuning on the current signal substrate. Closing that gap requires new signals in `framing.py` (growth-vocabulary, efficiency-vocabulary, named-author-citation, expanded uncertainty regex), not further rule adjustment.

---

## 2. Per-frame results

| Frame | Action (RULE_AUDIT §3) | nC1→nC2 | F1 v1→v2 | ΔF1 | Notes |
|-------|-----------------------|--------:|---------:|----:|-------|
| FVS-001 Frame Amplification | RETIRED | 4→0 | 0.000→0.000 | 0.000 | Removes 4 FPs; F1 unchanged because curator TP count was 0 on detectable FVS-001 anyway. |
| FVS-002 Fluency-Quality Illusion | KEEP | 1→1 | 0.250→0.250 | 0.000 | Unchanged. |
| FVS-007 Failure Framing | NARROW voice filter | 3→1 | 0.000→0.000 | 0.000 | Removes 2 FPs; one FP remains on `c03_quantum_supremacy` where voice was classified "advisory" (unexpectedly). |
| FVS-008 Growth Frame | RETIRED | 3→0 | 0.000→0.000 | 0.000 | Removes 3 FPs; F1 unchanged because TP count was 0. |
| FVS-009 Risk Frame | LOOSEN (drop uncertainty req) | 2→7 | 0.364→**0.875** | **+0.511** | Largest single gain. Adds true positives on b01, b03, b04, b05, c01. |
| FVS-010 Completeness Illusion | KEEP | 2→2 | 0.200→0.200 | 0.000 | Unchanged. |
| FVS-011 Stakeholder Frame | TIGHTEN density 5→10 | 7→5 | 0.727→0.667 | -0.061 | Lost b05 TP (density 8.8 below 10); removed b01 FP. Trade lost slightly more than it gained at this n. |
| FVS-012 Uncertainty Frame | LOOSEN threshold 3→2 | 2→2 | 0.182→0.182 | 0.000 | **No change.** The binding constraint was the upstream "uncertainty in covered" gate (requires minimum marker count), not the density threshold. Rule-level loosening did not propagate. Regex expansion (deferred) is required. |
| FVS-014 Temporal Anchoring | LOWER 70/60→35/35 | 0→2 | 0.000→**0.400** | **+0.400** | Fires on a01 (future 40) and c03 (past 38). Both are curator TPs. |
| FVS-015 Efficiency Frame | RETIRED | 2→0 | 0.000→0.000 | 0.000 | Removes 2 FPs; F1 unchanged because TP count was 0. |
| FVS-016 Authority by Citation | LOWER threshold 50→20 | 0→2 | 0.000→**0.444** | **+0.444** | Fires on b03 (26%) and c03 (24%). Both are curator TPs. |

**Wins:** FVS-009 (+0.511), FVS-016 (+0.444), FVS-014 (+0.400). Three rules moved from broken to usable.

**Non-events:** FVS-001, FVS-008, FVS-015 retirements cleaned up 9 FPs without costing any TPs (curator TP counts were 0 on detectable cases). FVS-002 and FVS-010 kept unchanged; FVS-007 narrowing cleaned 2 of 3 FPs.

**Non-delivering change:** FVS-012 loosening did nothing. Diagnosed: the rule requires `"uncertainty" in covered`, where `covered` is set upstream by a minimum-marker count in `detect_coverage`. Changing the density threshold from 3 to 2 has no effect if the document was already excluded at the coverage-classification step. Proper fix requires expanding the uncertainty regex in `framing.py` to catch hedge constructions ("undemonstrated," "theoretically promising but practically unproven," "years or decades away"), deferred in the audit.

**Minor cost:** FVS-011 tightening lost one TP (b05, stakeholder density 8.8) while removing one FP (b01, density 6.5). Net -0.061 F1. At n=12 the loss-of-1-TP is noisy; at larger n the trade-off might net positive. Could be reverted.

---

## 3. Per-document changes

Changed (5 of 12 documents show different v1 and v2 labels; 7 unchanged):

| Doc | Stratum | v1 labels | v2 labels | F1 v1→v2 |
|-----|---------|-----------|-----------|---------:|
| a01_altman | A | FVS-002 | FVS-002, FVS-014 | 0.250→0.444 |
| b01_nvidia | B | FVS-001, FVS-011 | FVS-009 | 0.000→0.250 |
| b03_social_media | B | FVS-011 | FVS-009, FVS-011, FVS-016 | 0.000→0.444 |
| b04_llm_support | B | FVS-001, FVS-011 | FVS-009, FVS-011 | 0.250→0.500 |
| b05_remote_work | B | FVS-001, FVS-007, FVS-008, FVS-011 | (empty) | 0.250→0.000 |
| b06_quantum | B | FVS-001, FVS-015 | (empty) | 0.000→0.000 |
| c01_semaglutide | C | FVS-015 | FVS-009 | 0.000→0.500 |
| c02_eu_ai_act | C | FVS-010, FVS-011 | FVS-009, FVS-010, FVS-011 | 0.400→0.667 |
| c03_quantum_wiki | C | FVS-007, FVS-008 | FVS-007, FVS-014, FVS-016 | 0.000→0.571 |
| c04_ubi | C | FVS-007, FVS-008, FVS-011 | FVS-011 | 0.000→0.000 |

Unchanged (7 of 12): a02 FOMC (F1 = 0.800 both v1 and v2 because it was already at near-perfect agreement), b02 automation (F1 = 0.667 both versions).

**b05_remote_work is the v2 regression case.** v1 fired 4 frames on b05 (001 retired, 007 retired via narrow, 008 retired, 011 lost via tightening). v2 fires zero. Curator labeled b05 with just FVS-011. v2 removed an FP-heavy false-positive set (good) but also removed the one TP (bad). Net F1 dropped 0.250→0.000 on this document. At larger corpus this document might not be representative; at n=12 it is a 1/12 penalty.

---

## 4. What this tells us about the detector's ceiling

The rule-level revisions took macro-F1 from 0.157 to 0.274. The revisions were the ones RULE_AUDIT.md said should be applied. They worked approximately as predicted.

The remaining gap (0.274 to 0.4 to cross the H3 threshold, 0.274 to 0.6 to reach "useful") cannot be closed by rule-level tuning on the current signal substrate. The specific blockers:

- **FVS-001 Frame Amplification and FVS-008 Growth Frame and FVS-015 Efficiency Frame are retired, not fixed.** Bringing them back requires new vocabulary signals in `framing.py` (growth-vocabulary regex, efficiency-vocabulary regex, within-session-repetition detection). Each of these is real engineering work.
- **FVS-012 Uncertainty Frame cannot be fixed at the rule layer.** The uncertainty regex in `framing.py:ANALYTICAL_CATEGORIES["uncertainty"]` misses hedge constructions that curator and LLM-judge both identify as uncertainty. Expanding the regex is the path; it is a framing.py change not attempted in v2.
- **FVS-016 Authority by Citation is partially addressed.** Lowering the threshold helped, but the `sourced_pct` metric itself is about numeric-claim-source-attribution, not about named-author citation in analytical prose. A proper fix needs a named-author-citation detector (proper-noun + year patterns, "according to X," etc.) in framing.py.
- **FVS-014 Temporal Anchoring's v2 threshold (35%) works on this corpus** but may over-fire on historical-narrative text not present in this sample. Larger study would refine.

If all three deferred framing.py changes were implemented (growth-vocabulary signal, expanded uncertainty regex, named-author-citation signal), the detector's ceiling might move to macro-F1 ~ 0.4-0.5 on this corpus. Crossing 0.6 on multi-source labels probably requires either a different detection substrate (neural classifier) or taxonomy-sharpening work that closes the curator-LLM-judge kappa gap.

---

## 5. What v2 does and does NOT establish

Does:
- Shows the audit's diagnosis was broadly correct. Proposed actions moved F1 in the predicted direction.
- Identifies which rule changes deliver most (FVS-009 loosening, FVS-016 lowering, FVS-014 lowering) and which were load-bearing misidentifications (FVS-012 rule-level change was blocked by the upstream coverage gate).
- Demonstrates the detector has some rule-level headroom (+0.118 macro-F1 from rule changes alone) but not enough to reach "useful" thresholds without signal-level work.
- Produces a concrete v2 candidate that could be merged if curator approves (code in `fvs_eval/validation_study/frame_library_v2.py`; would require adaptation to replace `frame_library.suggest_frames` in production).

Does NOT:
- Cross the H3 falsification threshold (0.4). v2 is at 0.274. The REPORT.md conclusion that the project's core detection claim did not survive preliminary empirical test remains.
- Replace the need for an expanded multi-annotator study on larger n. v2 is still only the 12-doc corpus with curator + LLM-judge.
- Address the FVS taxonomy's own inter-rater ambiguity (curator-vs-LLM-judge kappa = +0.279).
- Commit to any specific rule-level change in the live codebase. v2 is a candidate; whether to adopt it is the curator's decision based on this data and on impact on the user-facing web surface.
- Validate the retired frames as permanently gone. FVS-001/008/015 retirements hold pending new signals in framing.py, not indefinitely.

---

## 6. Options for the curator

The audit + v2 test produces four actionable paths. Named without committing; choice is curator's.

**Option P1. Adopt v2 rules as-is.** Replace `frame_library.suggest_frames` with `suggest_frames_v2`. Macro-F1 goes from v1's 0.157 to v2's 0.274. Update METHODOLOGY.md and product UI to reflect the retirements. Ship.

Pro: simple change; measurable improvement; honest about retirements. Con: v2 still below "useful" threshold; b05 regression means some real-case worsening.

**Option P2. Adopt v2 rules minus FVS-011 tightening.** Revert the FVS-011 density threshold to 5/kw. Expected macro-F1 ~0.28 (retain the FVS-011 gains). Other changes as in P1.

Pro: preserves the strongest rule (FVS-011 was the only F1 > 0.5 in v1). Con: retains the FVS-011 false positives we tried to clean up.

**Option P3. Adopt v2 + expand uncertainty regex (framing.py change).** Would likely move FVS-012 from its stuck 0.182 to something meaningfully higher. Requires testing against existing framing test cases to confirm no regression. Roughly 30 minutes of engineering.

Pro: one more meaningful F1 gain. Con: changes affect the broader detector behavior beyond the validation study corpus.

**Option P4. Defer all adoption until the expanded multi-annotator study runs.** v2 is a preliminary signal on n=12. An expanded study on 30-50 docs with 2-3 independent human annotators would establish whether v2 is durably better.

Pro: fewer premature commitments; better evidence. Con: blocks any product-surface update pending external annotation.

**Recommended path: P2 (v2 minus FVS-011 tightening) immediately, then P3 (uncertainty regex) after a short framing.py test pass, then P4 (expanded study) on the resulting v2.x.** Curator decides on the specific sequence.

---

## 7. Implications updated (from REPORT.md §7)

REPORT.md §7 named three pivot options:
- Re-scope detector as signal-surfacing tool, not label-producing tool.
- Narrow to frames with F1 > 0.5.
- Document the fragility.

v2 changes those options slightly:

- **Re-scope is still valid.** v2 at 0.274 is still below "useful," so the detector's output should still be presented as "signals + possible frame," not "the document exhibits FVS-N."
- **"Frames with F1 > 0.5" now includes FVS-009 (0.875) and FVS-011 (0.667) in v2.** If the user-facing UI surfaces only high-F1 frames, v2 doubles the set from one (FVS-011 alone at v1) to two (FVS-009 + FVS-011 at v2).
- **Documenting fragility becomes operationally specific.** METHODOLOGY.md can say: "Rule precision at 12-document preliminary scale: FVS-009 F1 0.875, FVS-011 F1 0.667, FVS-016 F1 0.444, FVS-014 F1 0.400, FVS-002 F1 0.250, FVS-010 F1 0.200, FVS-012 F1 0.182; other frames below 0.15 or retired pending new signals." This is an honest precision manifest the product can display per frame.

---

## 8. Honest limits of v2 testing

1. **Same 12-doc corpus as v1.** v2 is evaluated on the training distribution (the cases the audit diagnosed against). Holdout evaluation on a new corpus would be stronger. Not performed.
2. **Single executor.** Same person authored the audit and ran v2. Replication by an independent party would strengthen conclusions.
3. **n=12 is small.** The +0.118 delta has wide confidence intervals. The per-frame changes (FVS-009 +0.511 especially) have even wider ones; a one-doc shift in the corpus could materially change per-frame numbers.
4. **v2 touches only rule-layer (frame_library.py). Framing.py signals unchanged.** Three diagnosed signal-level issues (FVS-012 uncertainty regex, FVS-008 growth vocabulary, FVS-016 named-author citation) remain unfixed.
5. **Non-rule-level changes not evaluated.** Options like ensemble scoring, LLM-as-second-scorer, trained classifier are not part of v2. v2 is a rule-tuning study, not a substrate change.

---

## 9. Data availability

All v2 artifacts are in `fvs_eval/validation_study/`:

- `RULE_AUDIT.md`: per-frame diagnosis of v1 disagreements and proposed revisions.
- `frame_library_v2.py`: revised `suggest_frames_v2` implementing the audit's proposed actions.
- `05_rerun_v2_detector.py`: harness that runs v2 detector, computes v1-vs-v2 metrics.
- `labels/labels_detector_v2.json`: v2 detector labels on the 12-doc corpus.
- `results/v1_vs_v2_comparison.json`: computed comparison metrics.

---

## 10. One-sentence summary

Rule-level revisions diagnosed in RULE_AUDIT.md moved macro-F1 from 0.157 to 0.274 on the same 12-document validation corpus (75% relative improvement), confirming the diagnosis was broadly correct, but the v2 detector still does not cross the pre-registered H3 falsification threshold (0.4) and cannot, with rule-level tuning alone, reach the "useful" threshold (0.6) without signal-level additions to framing.py (growth-vocabulary, expanded uncertainty regex, named-author-citation) that were deliberately out of v2 scope.

---

*v1. 2026-04-18. Authored by Lovro Lucic with AI assistance. Post-audit empirical test of rule revisions proposed in RULE_AUDIT.md. Does not modify v1 canonical result; provides candidate revision numbers for curator decision.*