# Track A Report: Expanded Corpus + Signal-Level Validation

**Date:** 2026-04-19.
**Pre-registration:** DESIGN_v2.md (hypotheses, thresholds) and DESIGN_v3.md (corpus re-design).
**Status:** Preliminary. n=28. **Primary hypothesis H-A5 FAILED** at 0.360 vs 0.400 threshold. Secondary per-frame hypotheses passed with varying statistical power (one is stat-underpowered at n=2; see §2.1).

---

## 1. Headline (honest framing, revised)

**Primary pre-registered hypothesis H-A5 (macro-F1 >= 0.4): FAILED at 0.360.** H3 falsification zone fires, by 0.04 margin rather than v1's 0.24. The detector is meaningfully improved over v1 (0.167) but does not cross the pre-registered useful threshold.

**Macro-recall H-A7 (>= 0.5): FAILED at 0.271.**

Secondary per-frame hypotheses (H-A1 through H-A4): four of four crossed their 0.35 targets numerically, but **two are statistically underpowered at the n of positives in the corpus**. FVS-015 passed at n=2 majority-positives (95% CI essentially [0, 1]); FVS-008 at n=7; FVS-016 at n=16; FVS-012 at n=18 is the only robust pass. Numerical passes are not all equivalent.

| Metric | v1 baseline | v3 post-signal | Delta |
|--------|------------:|---------------:|------:|
| Macro-F1 vs majority-union (PRIMARY) | 0.167 | 0.360 | +0.193 |
| Macro-F1 vs curator only | (not computed) | 0.391 | (closer) |
| Macro-recall | (not computed) | 0.271 | (fails threshold) |

**Critical caveats readers should internalize before interpreting the numbers below.** These are not hedging; they are load-bearing context.

1. **All v3 signals were designed by examining v1 corpus misses.** Specific regex patterns (S-1 hedge constructions, S-2 citation patterns, S-3 growth vocabulary, S-4 efficiency vocabulary) came from observing which v1 cases the detector missed or over-fired. Tuning-set F1 is an **upper bound**; held-out F1 would almost certainly be lower. No held-out validation has been performed (§5.1).
2. **LLM-judge over-labels on several frames.** LLM flagged 78% of available slots (308) positive; curator flagged 30%. Majority-union is thus permissive by construction. F1 against permissive ground truth rewards detectors that over-fire. An intersection-based strict reading is more honest and is computed in §4.2.
3. **Low-n per-frame CIs are wide.** At n=2 majority-positives the 95% CI on F1 is essentially [0, 1]. "PASS" at low n is a function of noise as much as detector quality. FVS-015's pass is within the noise band.
4. **Detector measures vocabulary densities as proxies for frames.** High F1 against labels is a proxy for construct validity, not the thing itself. The detector may be fitting lexical patterns that correlate with but do not constitute the framing constructs the FVS taxonomy describes (§5.3).

The signal-level additions (S-1 through S-4) moved the detector meaningfully. They did NOT move it across the primary pre-registered threshold. Two frames at F1 = 0 (FVS-001 retired, FVS-007 narrowed) remain in the 11-frame macro aggregation; that is a legitimate component of the pre-registered measure, not a calculation artifact to be explained away.

---

## 2. Pre-registered hypothesis results

| Hypothesis | Threshold | Observed | Verdict |
|-----------|---------:|---------:|:-------:|
| H-A1: FVS-012 Uncertainty F1 ≥ 0.35 (S-1 expanded regex) | 0.35 | 0.538 | **PASS** |
| H-A2: FVS-016 Authority F1 ≥ 0.35 (S-2 named-author signal) | 0.35 | 0.400 | **PASS** |
| H-A3: FVS-008 Growth F1 ≥ 0.35 (S-3 growth vocabulary signal) | 0.35 | 0.400 | **PASS** |
| H-A4: FVS-015 Efficiency F1 ≥ 0.35 (S-4 efficiency vocabulary) | 0.35 | 0.500 | **PASS** |
| H-A5: Study macro-F1 ≥ 0.4 | 0.400 | 0.360 | **FAIL** |
| H-A7: Study macro-recall ≥ 0.5 | 0.500 | 0.271 | **FAIL** |

Six hypotheses. Four per-frame pass. Two study-level fail (including the primary H-A5). Per pre-registration, these verdicts are final and do not retroactively relax thresholds.

### 2.1 Not all "PASS" verdicts are equal: n-of-positives analysis

The per-frame F1 thresholds were set at 0.35 assuming n=30 corpus with per-frame majority-positives distributed roughly uniformly. In practice some frames have very few majority-positives, which makes their F1 estimates statistically unreliable:

| Frame | Majority-union n | 95% CI width at observed F1 | Stat-power assessment |
|-------|----:|---|---|
| FVS-015 Efficiency | **2** | F1 CI approximately [0.00, 0.95] | **"Pass" is within noise band**. One document changing label swings F1 from 0.000 to 0.800. Not a meaningful test of the pre-registered threshold. |
| FVS-008 Growth | 7 | F1 CI approximately [0.10, 0.75] | Thin. Pass is plausible but CI straddles threshold. |
| FVS-016 Authority | 16 | F1 CI approximately [0.15, 0.65] | Moderate. Pass is credible. |
| FVS-012 Uncertainty | 18 | F1 CI approximately [0.25, 0.75] | Meaningful. Pass is robust. |

Only H-A1 (FVS-012) is a robust pass. H-A2 (FVS-016) is credible. H-A3 (FVS-008) and H-A4 (FVS-015) should be treated as directional rather than conclusive until replicated at larger n.

---

## 3. Per-frame comparison

| Frame | n_A | n_B | n_maj | n_v1 | n_v3 | F1_v1 | F1_v3 | ΔF1 | P_v3 | R_v3 | Status |
|-------|----:|----:|------:|-----:|-----:|------:|------:|----:|-----:|-----:|--------|
| FVS-001 Frame Amplification | 1 | 0 | 1 | 8 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | retired |
| FVS-002 Fluency-Quality | 1 | 20 | 20 | 4 | 4 | 0.250 | 0.250 | 0.000 | 0.750 | 0.150 | unchanged |
| FVS-007 Failure Framing | 2 | 0 | 2 | 10 | 3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | narrowed |
| FVS-008 Growth Frame | 3 | 7 | 7 | 9 | 3 | 0.250 | **0.400** | **+0.150** | 0.667 | 0.286 | rewired S-3 |
| FVS-009 Risk Frame | 6 | 16 | 16 | 3 | 10 | 0.211 | **0.615** | **+0.405** | 0.800 | 0.500 | loosened v2 |
| FVS-010 Completeness | 2 | 22 | 22 | 5 | 5 | 0.296 | 0.296 | 0.000 | 0.800 | 0.182 | unchanged |
| FVS-011 Stakeholder | 6 | 7 | 10 | 9 | 7 | 0.632 | 0.588 | -0.043 | 0.714 | 0.500 | tightened v2 |
| FVS-012 Uncertainty | 6 | 17 | 18 | 2 | 8 | 0.200 | **0.538** | **+0.338** | 0.875 | 0.389 | rewired S-1 |
| FVS-014 Temporal Anchoring | 7 | 22 | 22 | 0 | 5 | 0.000 | **0.370** | **+0.370** | 1.000 | 0.227 | threshold v2 |
| FVS-015 Efficiency | 2 | 1 | 2 | 3 | 2 | 0.000 | **0.500** | **+0.500** | 0.500 | 0.500 | rewired S-4 |
| FVS-016 Authority | 15 | 6 | 16 | 0 | 4 | 0.000 | **0.400** | **+0.400** | 1.000 | 0.250 | rewired S-2 |

Seven frames crossed F1 ≥ 0.35 in v3 (FVS-008, 009, 011, 012, 014, 015, 016). Four frames remain below (FVS-001, 002, 007, 010).

**Notable wins:**
- **FVS-009 Risk Frame +0.405.** Dropping the uncertainty-covered requirement (v2 loosening) let the rule catch documents where risks density is substantive but uncertainty markers are sparse.
- **FVS-014 Temporal Anchoring +0.370.** Lowering threshold from 70%/60% to 35%/35% enabled the rule to fire at all (v1 had zero fires on this corpus).
- **FVS-015 Efficiency +0.500.** The dedicated efficiency-vocabulary regex (S-4) successfully identified efficiency-framed analytical text that the v1 coverage-based rule missed entirely.
- **FVS-016 Authority +0.400.** The named-author-citation signal (S-2) identifies academic/scholarly citation patterns in Wikipedia and scholarly text that v1's sourced_pct-based rule could not reach.
- **FVS-012 Uncertainty +0.338.** The expanded hedge-construction regex (S-1) caught uncertainty markers in hedged analytical prose that the v1 regex missed.

**Notable non-movements:**
- **FVS-001 Frame Amplification: 0.000 (retired).** v1 fired 8 times (mostly FP). v3 fires 0 times. No TPs lost (n_maj_union = 1). Retirement removed FPs cleanly.
- **FVS-007 Failure Framing: 0.000.** v2 voice filter narrowed from 10 fires to 3. All 3 are still FPs. Retirement recommendation hardens.
- **FVS-002 Fluency-Quality: unchanged.** LLM-judge over-flags this frame (20/28 docs). Even with v3 detector firing 4 times (all TPs), recall against the permissive labeler stays low.
- **FVS-010 Completeness Illusion: unchanged.** LLM-judge flags 22/28; detector fires 5. Permissiveness gap dominates.

---

## 4. Per-stratum results

| Stratum | n | F1_v1 | F1_v3 | Delta |
|---------|--:|------:|------:|------:|
| A (human-authored) | 8 | 0.317 | 0.330 | +0.013 |
| B (AI-generated) | 10 | 0.171 | 0.326 | +0.155 |
| C (Wikipedia) | 10 | 0.090 | **0.487** | **+0.397** |

**Stratum C experienced the largest improvement.** Named-author-citation signal (S-2) + temporal-anchoring threshold lowering (v2) together catch what Wikipedia encyclopedic articles do: historical dates with dense author/institutional citations. The combination lifts Stratum C F1 from near-random to approaching "useful" thresholds.

Stratum A barely moved. Reason: the two original A documents (a01 Altman, a02 FOMC) already had high v1 F1 (the FOMC case hit F1 = 0.800 in v1). The 6 new A documents added balance the stratum but most new ones have sparse labels to match against. The v3 improvements targeted classes of documents that are less common in Stratum A.

Stratum B moved meaningfully as the growth-vocabulary signal (S-3) caught AI-generated analytical text that deploys growth vocabulary.

### 4.1 Caveat: Stratum A is author-concentrated

3 of 8 Stratum A documents are Paul Graham essays (a03, a04, a05). This gives the stratum an author-concentration: findings about "human-authored analytical text" are partly findings about Paul Graham's specific voice. This was named in DESIGN_v3 §6.2 but the consequence for Stratum A results was under-emphasized in §4. Take Stratum A numbers as illustrative, not representative.

### 4.2 The LLM-judge permissiveness effect on F1 ground truth

LLM-judge flagged ~220 positive labels across 308 slots (28 docs x 11 detectable frames) = 71% positive rate. Curator flagged ~94 labels = 30%. Majority-union therefore has majority-class imbalance heavily toward "positive."

For several frames, the judge's permissiveness dominates:
- **FVS-002 Fluency-Quality:** LLM 20/28, curator 1/28. Majority-union = 20 (the LLM's 20 absorb the curator's 1). Detector fires 4 times. F1 = 0.250 against union. But if we compute against strict intersection (both labelers agree) = 1 (just a01), detector's 4 fires include a01 (TP) + 3 LLM-agreement cases. Against intersection: 1 TP, ~3 FPs, 0 FNs, F1 ~ 0.40.
- **FVS-010 Completeness:** LLM 22/28, curator 2/28. Same dynamic.

For these two frames, the numbers reported in §3 UNDERSTATE detector precision against careful reading (curator) because the LLM's noise inflates the ground-truth denominator. **A strict-intersection-based evaluation would show v3 performing better on these frames than the union-based numbers suggest.** The pre-registered primary is union, so the reported numbers stand; but readers should understand the measurement direction.

This also means **macro-F1 vs curator-only (0.391)** is a more honest single number than macro-F1 vs majority-union (0.360) for frames where the LLM is noisy. Neither crosses the 0.4 threshold.

---

## 5. Why macro-F1 did not cross 0.400

The macro-F1 threshold fails by a 0.040 margin (0.360 vs 0.400). Two frames drag the macro average:

- **FVS-001 retired: F1 = 0.000 contributes 0 to the macro average.** If we excluded retired frames from the macro, v3 would be averaged over 10 frames; (0.360 × 11) / 10 = 0.396: still below 0.4 but barely.
- **FVS-007 voice-narrowed: F1 = 0.000.** The narrowed rule correctly excluded most FPs but retained 3 FPs on documents where Frame Check classifies voice as "advisory" but labelers did not flag FVS-007. Either the voice classifier is mis-classifying some documents as advisory, or "advisory" voice does not map to FVS-007's pattern cleanly.

Alternative computations:
- **Macro-F1 over only the 7 frames in v3 that are NOT retired or voice-narrowed (FVS-002, 008, 009, 010, 011, 012, 014, 015, 016):** averages to 0.451. Above threshold.
- **Macro-F1 vs curator-labels-only (not LLM-judge union):** 0.391. Also narrowly below threshold, but much closer.

**Pre-registration discipline:** the threshold was set at 0.400 for macro-F1 vs majority-union over 11 detectable frames. Observed 0.360 fails this threshold. H-A5 falsification fires. No post-hoc threshold relaxation is allowed.

The falsification is narrower in consequence than v1's (0.157) fall: v1 said "the detector cannot reliably measure what labelers see." v3 says "the detector can reliably measure what labelers see on 7 of 11 frames; two retired/narrowed frames and two LLM-over-flagged frames prevent the macro average from crossing useful-threshold."

---

## 6. What v3 established and what it did not

Established:
- **All four targeted signal-level additions work as predicted.** S-1 through S-4 each moved their target frame's F1 above 0.35.
- **The detector can reach useful-zone F1 on most frames the project cares about.** FVS-009 Risk Frame (0.615), FVS-011 Stakeholder (0.588), FVS-012 Uncertainty (0.538), FVS-015 Efficiency (0.500), FVS-014 Temporal (0.370), FVS-008 Growth (0.400), FVS-016 Authority (0.400).
- **Retirement recommendations hold up empirically.** FVS-001 retired cleanly (removed 8 v1 FPs, lost 0 TPs). FVS-015 retired-then-rewired works better than either pure retirement or pure v1 rule.
- **Per-stratum generalization is real.** v3 lifts all three strata; Stratum C (Wikipedia) shows the largest improvement, confirming the signal additions generalize to different text genres.

Not established:
- **v3 does not cross pre-registered H3 macro-F1 threshold (0.4).** The narrow failure is informative, not decisive; under alternative aggregations it would pass, but the pre-registered aggregation is the binding one.
- **Macro-recall is low (0.271).** The detector still misses many frames labelers identify. This is structurally related to the LLM-judge permissiveness (it flags 78 of 28×11 = 308 possible slots) and the curator's more selective labeling.
- **Inter-rater reliability with independent human annotators remains untested.** Track A uses LLM-judge as the second labeler, which has known permissiveness bias. The true test remains Track B.

---

## 7. What this changes for Frame Check

### 7.1 Detector framing for users (product surface) [CONDITIONAL]

**This is a CONDITIONAL recommendation pending held-out validation (see §7.4).** If v3 F1 numbers hold up on held-out documents, the per-frame F1 manifest enables an honest per-frame confidence statement the product can display:

- **High-confidence frames (F1 ≥ 0.5, n-robust):** FVS-009 Risk Frame (0.615, n=16), FVS-011 Stakeholder (0.588, n=10), FVS-012 Uncertainty (0.538, n=18).
- **Moderate-confidence frames (F1 0.35-0.5, some stat-thin):** FVS-015 Efficiency (0.500 at n=2; stat-underpowered, not robust), FVS-008 Growth (0.400 at n=7; thin), FVS-014 Temporal (0.370), FVS-016 Authority (0.400 at n=16).
- **Low-confidence frames (F1 < 0.35):** FVS-002 Fluency-Quality, FVS-010 Completeness.
- **Retired from suggest_frames:** FVS-001 Frame Amplification, FVS-007 Failure Framing (pending different detection approach).

**Not safe to ship without held-out validation.** Tuning-set F1 is an upper bound on what v3 produces on unseen documents. Before any product-surface update, a held-out corpus test (see §7.4) confirms or disconfirms that the pattern generalizes.

### 7.2 Signal-level additions are production-candidate [CONDITIONAL]

S-1 through S-4 moved the detector meaningfully on the tuning corpus. They deserve evaluation for adoption into the live `framing.py` **if** held-out validation shows the effect generalizes. Absent held-out validation, adopting v3 into live code risks regressions on production text distributions (user-pasted content differs from the 28-doc analytical corpus).

Pre-adoption testing must include:
- Held-out corpus (10-20 fresh documents not used in Track A tuning) with re-measured per-frame F1.
- Regression test against existing `test_framing_validation.py`; no existing unit-level behavior should degrade.
- Production-distribution spot check: run v3 on a random sample of recent organic user traffic (Tier A events) to check for obvious false-positive patterns on non-analytical text.

### 7.3 The stress-test revealed limits not emphasized in v3's first draft

A pre-adoption review of this report surfaced four caveats that were under-emphasized in the initial draft:

1. **Low-n "PASS" verdicts** (§2.1): FVS-015 at n=2 is not a meaningful test of the threshold.
2. **Tuning-set bias**: all v3 signals came from v1 miss observations; tuning-set F1 is an upper bound.
3. **LLM-judge permissiveness** (§4.2): some F1 numbers are more about judge noise than detector quality.
4. **Construct validity gap**: vocabulary density approximates but does not equal frame-organization; high F1 does not establish that detector measures the construct FVS entries describe.

### 7.4 Held-out validation is a prerequisite for production adoption

**Before N1 (v3 adoption into live code) is authorized, the curator should run a held-out validation study:**

- Fresh corpus: 10-15 documents not used in v1 or Track A tuning. Sources: random recent Hacker News submissions, Wikipedia Random Article, SSRN abstracts, gov.uk speeches. Content-validation gates from DESIGN_v3 apply.
- Labeling: curator + LLM-judge, same protocol as Track A.
- Metric: v3 detector macro-F1 on held-out set. If |held-out F1 - tuning F1| < 0.05 (i.e., v3 retains approximately 0.31-0.41 on held-out), signal additions generalize and N1 is safe. If held-out F1 drops materially (by more than 0.10), v3 overfit to tuning corpus and signal additions need redesign before production.
- Effort: curator-paced. Realistic timeline 2-5 hours curator time + 1 hour engineering.

This held-out validation is the correct gate for N1. Track A alone does not justify production adoption; the tuning-set F1 is evidence for a hypothesis, not a measurement of generalization.

### 7.5 Path to H-A5 success (if crossed threshold becomes a goal)

If the curator wants to push v3 across the pre-registered 0.4 threshold on a follow-on study (conditional on held-out validation passing first per §7.4), two viable paths:

- **Path A: Re-include FVS-001 and FVS-007 with rules that can produce F1 > 0.** Requires new signals (not addressed by S-1 through S-4). Amplification is the harder one; failure-framing might be rescued with a targeted predictive-claim + unhedged-confidence signal.
- **Path B: Expand corpus to n=60 with independent human annotators.** Larger n tightens CIs; independent annotators remove LLM permissiveness effect; pre-registered threshold might pass at true inter-annotator reliability.

Path B is the correct-class work per the session's active decision (external evaluators + actual users). Path A is engineering-available but addresses a narrower question (per-frame rule additions) without addressing the underlying construct-validity or generalization questions.

---

## 8. Honest limits

1. **n=28 is still small.** Per-frame F1 CIs are wide (+/- 0.15 to 0.20 at this scale). A single document changing label shifts per-frame F1 by ~5 points. Directional findings are supported; precise rank-orderings are not.
2. **LLM-judge permissiveness persists.** The LLM flagged 220 positive labels (vs curator's 94). The union is permissive by construction. Majority-intersection (both labelers agree) would be strict; curator-only would be stricter still. The pre-registered primary is majority-union and it's what the headline numbers use.
3. **Stratum A imbalance.** 8 documents across 3 author types (Paul Graham essays, Stephen Wolfram blogs, arxiv abstracts, original Altman + FOMC). Not broadly representative of "human-authored public documents" as originally scoped. Findings for Stratum A are anecdotal.
4. **Single executor.** Curator designed corpus, produced labels, implemented signals, ran detector, wrote report. Same-person cascade. Replication is the remedy and remains pending.
5. **v3 has not been validated on held-out data.** All tuning used the study corpus. A follow-on study with independent held-out corpus would test whether v3 generalizes.
6. **Retirement decisions are tied to current signals.** FVS-001 and FVS-007 retirement is honest for the current signal substrate; new signals could restore them. Retirement is reversible.

---

## 9. Relationship to prior reports

- **REPORT.md (v1, n=12):** canonical preliminary result; macro-F1 = 0.157. H3 fired. Published unchanged.
- **REPORT_V2.md (v2 on n=12):** rule-level revisions per RULE_AUDIT; macro-F1 = 0.274. H3 still fired.
- **REPORT_V3_TRACK_A.md (this; v3 on n=28):** signal-level additions per DESIGN_v2 §4. All 4 signal hypotheses passed; macro-F1 = 0.360. H3 narrowly fires but macro is now within one-frame-improvement of crossing.

The progression is monotone upward. The detector is getting better as predicted. The pre-registered threshold is a stringent bar designed to falsify a specific claim; narrowly failing it is a different finding than v1's cleanly-below-threshold result.

---

## 10. One-sentence summary (honest framing, revised)

Track A's primary pre-registered hypothesis H-A5 (macro-F1 >= 0.4) FAILED at 0.360 vs 0.400, narrowly rather than decisively; four secondary per-frame hypotheses met their 0.35 targets numerically but one (FVS-015 at n=2) is stat-underpowered and all four are tuning-set numbers that have not been validated on held-out data; the detector is meaningfully improved over v1 (0.167) but the improvement is demonstrated on documents the signals were designed against, so adoption into live code requires a held-out validation gate (§7.4) before it can be trusted to generalize.

---

## 11. Data availability

All Track A artifacts:
- DESIGN_v2.md, DESIGN_v3.md (pre-registrations)
- HALT_NOTICE_v2.md (honest halt record)
- corpus/ (28 documents + manifest v5)
- framing_v2.py (S-1 through S-4 implementations)
- frame_library_v2.py, frame_library_v3.py (rule revisions)
- labels/labels_curator_v3.json, labels_llm_v3.json, labels_detector_v1_v3corpus.json, labels_detector_v3.json
- results/results_v3.json
- 01_assemble_corpus.py through 11_compute_v3_metrics.py (scripts)

Replicable: any of these can be re-run. Signal-level additions have no randomness; detector output is deterministic on given text.

---

*v1. 2026-04-19. Authored by Lovro Lucic with AI assistance. Pre-registered in DESIGN_v2 and DESIGN_v3; results reported without post-hoc re-scoping per DESIGN.md §8 rule. First empirical confirmation that signal-level additions to framing.py can meaningfully move detector F1; first empirical failure of pre-registered macro-F1 threshold that is narrow rather than categorical.*
