# Track A: expanded corpus + signal-level validation (pre-registration)

**Status:** pre-registration. Written 2026-04-18 before corpus expansion, before re-labeling, before signal-level implementation.
**Relationship to DESIGN.md:** DESIGN.md pre-registered the v1 study (n=12, rule-level detector). This document pre-registers Track A: n=30 expanded corpus + signal-level framing.py changes + pre-declared per-frame effect sizes for the signal-level changes. DESIGN.md's H3 macro-F1 threshold (0.4) is RETAINED, not replaced.
**Scope:** Track A only. Track B (reader-aid external study) is deferred pending Track A outcomes and will be pre-registered separately when it begins.

---

## 1. What this document pre-registers

Three things, locked before any data is touched:

1. **The expanded corpus composition.** Exactly 30 documents, 10 per stratum, with document selection criteria declared. Assembly fails if any selection cannot be made without post-hoc sampling.
2. **The signal-level changes to framing.py.** Four changes, each with a specific purpose and predicted effect size on a specific frame.
3. **The acceptance thresholds per change and for the study overall.** If predicted effect sizes are not met, retirement recommendations harden. The overall macro-F1 = 0.4 threshold from DESIGN.md H3 remains the project-level falsification line.

---

## 2. Hypotheses (pre-declared)

### Per-signal-change hypotheses

**H-A1 (uncertainty regex expansion).** Adding 15 or more hedge constructions to the uncertainty category in framing.py will move FVS-012 Uncertainty Frame F1 (vs majority-union of curator + LLM-judge) from its v1 baseline of 0.182 to **≥ 0.35 on the expanded corpus (n=30)**. Null: if F1 < 0.35, the expanded regex is insufficient to capture how careful readers identify uncertainty in text, and FVS-012 rule-level adjustment is recommended for retirement pending different detection methodology.

**H-A2 (named-author-citation signal).** Adding a new `framing.py` signal that detects `Name (Year)`, `according to Name`, and proper-noun-author-citation patterns, and rewriting FVS-016 Authority by Citation to use this new signal, will move FVS-016 F1 from its v1 baseline of 0.000 to **≥ 0.35 on the expanded corpus**. Null: if F1 < 0.35, FVS-016 is retired.

**H-A3 (growth-vocabulary signal).** Adding a new framing.py regex for growth/expansion vocabulary (prosperity, expand, scale, outperform, grow, accelerate, boom, surge, dominate, breakthrough), and rewriting FVS-008 Growth Frame to use this new signal in combination with voice gate (promotional or advisory), will move FVS-008 F1 from its v1 baseline of 0.000 to **≥ 0.35 on the expanded corpus**. Null: if F1 < 0.35, FVS-008 is retired.

**H-A4 (efficiency-vocabulary signal).** Adding a new framing.py regex for efficiency vocabulary (efficient, optimize, streamline, cost reduction, automate, productive, throughput, lean), and rewriting FVS-015 Efficiency Frame to use this new signal, will move FVS-015 F1 from its v1 baseline of 0.000 to **≥ 0.35 on the expanded corpus**. Null: if F1 < 0.35, FVS-015 is retired.

### Study-level hypotheses

**H-A5 (primary).** The combined Track A changes (v2 rules from RULE_AUDIT + four signal-level additions) will move macro-F1 on the expanded corpus from the v1 baseline expected ~0.15-0.20 (same-range on n=30 as n=12) to **≥ 0.4 (the H3 threshold from DESIGN.md)**. Null: if macro-F1 < 0.4 on expanded corpus, the project's classifier claim remains in the H3 falsification zone even after signal-level work. This falsification is MORE load-bearing than v1's because the signal substrate has been meaningfully extended, not just rule-tuned.

**H-A6 (secondary).** Effect sizes for Stratum C (Wikipedia encyclopedic) will be larger than for Stratum B (fresh AI-generated). Rationale: the signal-level additions target citation patterns and temporal anchoring that are denser in encyclopedic text. If this does not hold, the signal additions are not doing what they were designed to do.

### Asymmetric reasoning on false positives vs false negatives

Per the stress-test refinement: FN (missing frames actually present) pollutes downstream (teaching questions don't fire, MCP output misses the signal) more than FP. The signal-level additions target predominantly-FN frames (FVS-014, FVS-016, FVS-008, FVS-015 in v1; FVS-012 constrained by coverage gate). Therefore the primary metric is **recall** on frames the curator flagged, not just F1.

**H-A7 (recall-specific).** Macro-recall on frames the curator flagged will move from the v1 baseline to ≥ 0.5 on the expanded corpus. If recall does not cross 0.5, the detector is systematically blind to what careful readers see, regardless of FP cleanup, and the signal-substrate hypothesis is wrong.

---

## 3. Corpus expansion plan (pre-declared)

Expand from n=12 to n=30: 10 per stratum. The original 12 documents are retained unchanged; 18 new documents are added.

### Stratum A: human-authored public documents (n=10 target; currently 2)

Add 8. Selection criteria (pre-declared):
- Public, URL-addressable.
- Author-attributed (known human author or institutional byline).
- English.
- Mix of: essay/op-ed (2), speech or testimony transcript (2), policy document (2), business press release or earnings-call excerpt (1), public academic abstract (1).
- At least 300 words each; less than 4000 words.

Specific candidate set (frozen before assembly):
- A03: Paul Graham essay (one of "How to Do Great Work", "Cities and Ambition", "How You Know"); pick first accessible.
- A04: a public Senate testimony transcript (Banking Committee, Judiciary, or similar; 2024-2026 window).
- A05: a UN Secretary-General speech transcript (public address, 2024-2026 window).
- A06: a White House or equivalent press statement (any administration, public domain).
- A07: a central bank speech (ECB or Bank of England governor; public).
- A08: a business earnings-call opening statement (SEC-filed 8-K; public).
- A09: a public journal abstract from PubMed (recent, English, not subscription-gated).
- A10: a NYT or equivalent op-ed (2024-2026, accessible).

If any URL returns 404 at fetch time, the fallback is a previously-pre-specified alternate URL from the same genre. Fallback list documented before fetch. No post-hoc substitution based on content.

### Stratum B: fresh AI-generated analyses (n=10 target; currently 6)

Add 4. Pre-declared prompts (in same voice as v1's B01-B06):
- B07: creative/cultural: "Discuss the state and outlook of traditional publishing in 2026."
- B08: legal/ethical: "Analyze the ethical considerations of AI systems in medical diagnosis."
- B09: educational: "Describe how universities should evaluate online-only degree programs."
- B10: environmental: "Discuss the economic case for corporate carbon-accounting disclosure."

Generation via Anthropic API, same canonical system prompt as v1, temperature 0.7, max tokens 1200. Model: whichever is served first among candidate list (opus-4-7, sonnet-4-6, haiku-4-5); served model recorded.

### Stratum C: Wikipedia encyclopedic articles (n=10 target; currently 4)

Add 6. Selection criteria:
- English Wikipedia article.
- Topic diverse from existing c01-c04.
- Extract via MediaWiki action API (prop=extracts, explaintext=1), max ~6000 chars on paragraph boundary.

Specific titles (frozen):
- C05: "Nuclear_fusion" (technical-scientific, different from C03 quantum-supremacy)
- C06: "Inflation" (economic, policy-adjacent)
- C07: "Climate_change_mitigation" (policy, contested)
- C08: "Generative_artificial_intelligence" (technical, AI-adjacent)
- C09: "Gerrymandering" (political, contested)
- C10: "Microplastic" (health, scientific)

If any title returns 404, fallback is alternate from same genre (fallback map declared before fetch).

### Frozen before proceeding

Once the 30 documents are assembled and written to `corpus/`, the corpus is frozen. Any subsequent edits would constitute a corpus swap and invalidate pre-registration. The manifest records provenance, sha256, fetch timestamp per document.

---

## 4. Signal-level changes to framing.py (pre-declared)

All changes implemented in a new file `framing_v2.py` (not modifying the live `framing.py`); the v2 study runs against v2 signals + v2 rules. Live framing.py is unchanged until the curator decides to adopt.

### S-1. Uncertainty regex expansion

Extend `ANALYTICAL_CATEGORIES["uncertainty"]` regex to include at least 15 additional hedge constructions:
- "undemonstrated", "unproven", "theoretically promising but practically unproven"
- "years or decades", "years away", "decades away"
- "no credible signs", "not yet clear", "remains open"
- "far from settled", "difficult to predict", "subject to revision"
- "hard to say", "too early to", "cannot yet"
- "remains to be seen", "is unclear whether", "debate continues"
- "no consensus", "disputed among"

Existing regex patterns are preserved; new patterns are added. If adding these causes the uncertainty coverage to be marked "covered" (meeting the minimum-marker threshold) on documents where it was previously absent, the coverage-gate issue from v1 is resolved.

**Test against v1 corpus evidence:** `b06_quantum_computing_outlook` and `b01_nvidia_investment` should newly register uncertainty coverage with these expansions. If they do not, the regex expansion is insufficient and H-A1 fails.

### S-2. Named-author-citation signal

New function in framing_v2.py: `detect_named_author_citation(text) -> dict`. Returns `{matches: int, density_per_1kw: float, examples: list[str]}`.

Detects:
- `Name (Year)` patterns: `[A-Z][a-z]+\s+(\([12]\d{3}\))`
- `according to Name` patterns: case-insensitive "according to" + proper noun
- `Name's Name research` or `Name's Name work` patterns
- Proper-noun-author sequences like `Jean Twenge` or `Jonathan Haidt` (two+ capitalized names in proximity to research-verbs like "found", "argued", "published", "identified", "showed")
- Institutional author patterns: "FDA", "WHO", "CDC", "European Commission", "SEC", "IMF", "World Bank" near claim-verbs

FVS-016 Authority by Citation rule in v2 uses this signal: fires when `named_author_citation_density > 5 per 1kw` (tunable at implementation; pre-declared as 5/kw for this study; if mis-tuned, noted honestly).

### S-3. Growth-vocabulary signal

New function: `detect_growth_vocabulary(text) -> dict`. Regex for expansion/prosperity vocabulary:
- "prosperity", "expand(?:ing|ed|s)?", "grow(?:ing|n|s|th)?", "scaling", "scales"
- "outperform", "surge(?:d|s)?", "boom(?:ing|s)?", "boost(?:ed|ing|s)?"
- "dominate(?:s|d)?", "accelerate(?:d|ing|s)?"
- "breakthrough", "record\s+(?:revenue|earnings|growth)"
- "compounds?", "compound(?:ed|ing)", "exponential"

Returns density per 1kw.

FVS-008 Growth Frame rule in v2 uses this signal: fires when `growth_vocabulary_density > 6 per 1kw` AND `voice_type in (promotional, advisory, analytical)` AND `"risks" density < 5 per 1kw`.

The voice gate allows analytical voice because some growth-framed analysis reads as analytical (e.g., b01 nvidia). The risks-density-low gate replaces the v1 "risks in missing" binary, which false-negatives on dismissive-mention cases.

### S-4. Efficiency-vocabulary signal

New function: `detect_efficiency_vocabulary(text) -> dict`. Regex for efficiency/optimization vocabulary:
- "efficient(?:cy|ly)?", "optim(?:al|ize|ization|um)"
- "streamline(?:d|s)?", "productive", "productivity"
- "cost\s+(?:reduction|saving|cutting|control|efficien\w*)"
- "automate(?:d|s)?", "automation", "lean", "throughput"
- "scale", "scales" (overlap with growth; that's acceptable since efficiency and growth can co-occur)
- "reduce(?:s|d|ing)?\s+(?:cost|time|overhead|friction|latency)"
- "outsource", "offshore"

Returns density per 1kw.

FVS-015 Efficiency Frame rule in v2 uses this signal: fires when `efficiency_vocabulary_density > 5 per 1kw` AND `voice_type in (promotional, advisory, analytical)`.

### Integration with rule set

`frame_library_v2.suggest_frames_v2` is extended to use the new signals via a revised module (`frame_library_v3.py` for this iteration; v3 is a superset of v2 that additionally uses new signals from framing_v2.py). The other v2 changes (retirements, thresholds) are preserved.

---

## 5. Acceptance thresholds (pre-declared; identical for all tests)

| Metric | Null threshold | Decision if null fires |
|--------|---------------:|------------------------|
| FVS-012 F1 (H-A1) | ≥ 0.35 | Retire FVS-012 from suggest_frames |
| FVS-016 F1 (H-A2) | ≥ 0.35 | Retire FVS-016 from suggest_frames |
| FVS-008 F1 (H-A3) | ≥ 0.35 | Retire FVS-008 from suggest_frames |
| FVS-015 F1 (H-A4) | ≥ 0.35 | Retire FVS-015 from suggest_frames |
| Study macro-F1 (H-A5) | ≥ 0.4 | Classifier sub-component falsification hardens; retirement set expands to include all frames below per-frame 0.35 |
| Macro-recall on curator-flagged (H-A7) | ≥ 0.5 | Detector is systematically blind to careful reading; signal-substrate hypothesis is wrong; Track A signal additions do not solve the detection problem |

These thresholds are fixed before data collection. Post-hoc adjustment is not allowed per DESIGN.md §8 rule.

---

## 6. Honest limits (pre-declared)

1. **n=30 is better than n=12 but still not large for per-frame CIs.** Expected 95% CI half-width on per-frame F1 at n=30 is ~0.12-0.18. An observed F1 of 0.40 has a CI like [0.22, 0.58]. Point estimates will be taken at face value, but Track A's results should be treated as directional unless replicated on n=60+.
2. **LLM-judge permissiveness persists.** Expanding the corpus does not fix the LLM's over-labeling tendency. Majority-union will still be a permissive ground truth. Reporting will include majority-intersection as the strict alternative.
3. **Single executor.** Same caveat as v1: curator designs, curator labels, curator implements, curator scores. Independent human annotators remain the gold-standard test (Track B). Track A is a pre-Track-B verification that the signal substrate can in-principle support the classifier claim.
4. **Signal additions inherit regex limits.** Named-author-citation detection via regex catches `Name (Year)` and a few other patterns but misses narrative citation (e.g., "a 2023 study from Johns Hopkins suggested..."). The regex is a first approximation.
5. **New signals may interact unexpectedly.** Growth-vocabulary and efficiency-vocabulary regexes have overlap (e.g., "scale"). Uncertainty-regex expansion may push coverage classification in unexpected ways. Side-effects on frames we are not changing (FVS-002, FVS-010 notably) must be checked; if v2 changes regress those frames, the trade-off is noted honestly.
6. **The "retire" outcome on a frame is not permanent.** A future signal, a better rule, or a different classifier substrate (neural) could restore a frame. Retirement under this study is scoped to "does not meet the study's rule+signal ceiling."
7. **Track A does not test system-level value.** It tests the classifier sub-claim at improved precision. Whether the detector's labels, at whatever F1, translate into reader-aid or agent-aid value is Track B's question.

---

## 7. Decision rules regardless of outcome (per DESIGN.md §8)

- All corpus documents, all labels, all detector outputs, all computed metrics are preserved under `validation_study/` with version tags distinguishing v1 and v2.
- REPORT_V2_TRACK_A.md is written after execution, reporting the outcome honestly. If hypotheses fail, the report says they failed. Post-hoc re-interpretation is not allowed.
- If any corpus document fails to assemble (fetch failure for a URL with both primary and fallback returning 404), the document is excluded from n=30 and the study proceeds at the actual n. n < 27 triggers expansion halt and re-design.
- If macro-F1 is stuck below 0.4 on expanded corpus, that is the finding. Track B may still proceed (testing system value of a weak classifier) but the strategic implications documented in REPORT.md §7 harden.

---

## 8. Execution order (strict)

1. Write this DESIGN_v2.md (done when published).
2. Assemble 18 new documents. Use `06_expand_corpus.py` (to be written). Freeze corpus on completion.
3. Re-label all 30 documents (curator labels frozen first, then LLM-judge, same as v1 discipline).
4. Run v1 detector on expanded corpus for baseline (no v2 rule/signal changes yet). Compute v1 baseline F1.
5. Implement v2 rules + v2 signals (framing_v2.py + frame_library_v3.py).
6. Run v2 detector on expanded corpus.
7. Compute metrics, compare to pre-registered thresholds.
8. Write REPORT_V2_TRACK_A.md.
9. Decision on Track B pre-registration follows from Track A outcome.

Deviation from this order violates pre-registration.

---

## 9. What Track A does NOT test

- Reader-aid utility (Track B).
- Agent-self-audit differential (Track B, different use case).
- Vocabulary propagation (long-horizon, external).
- Inter-labeler reliability with independent human annotators (Track B or separate).
- The system-level value claim that makes Frame Check "what the world has never seen" (Track B).

Track A tests whether the classifier sub-component can reach a meaningful F1 threshold on a larger corpus with signal-level additions. That is a necessary-but-insufficient condition for the system-level claim. If the classifier fails at Track A, system-level claims built on detection output (teaching questions, MCP self-audit) are load-bearing on a broken foundation. If the classifier succeeds at Track A, Track B becomes the next necessary test.

---

*v1. 2026-04-18. Authored by Lovro Lucic with AI assistance. Pre-registered before corpus expansion, re-labeling, signal-level implementation, or any v2 data. Any deviations noted in REPORT_V2_TRACK_A.md with rationale.*