# Track A Re-designed Expansion (pre-registration v3)

**Status:** pre-registration. Written 2026-04-18 after DESIGN_v2 halt (HALT_NOTICE_v2.md).
**Relationship to prior designs:**
- DESIGN.md (v1): pre-registered the n=12 initial study.
- DESIGN_v2.md: pre-registered the Track A expansion to n=30 via generic HTML fetching; halted when 6 of 8 Stratum A URLs returned navigation or 404 content instead of documents.
- DESIGN_v3.md (this document): re-designs Stratum A expansion around source-specific extractors and content-validation gates; preserves Stratum B and Stratum C from the v2 assembly (both completed successfully).

**Preserved verbatim from DESIGN_v2:**
- Hypotheses H-A1 through H-A7 (signal-level + macro-F1 + recall predictions).
- Signal-level changes S-1 through S-4 (uncertainty regex expansion, named-author-citation, growth vocabulary, efficiency vocabulary).
- Acceptance thresholds (per-frame F1 >= 0.35, macro-F1 >= 0.4, macro-recall >= 0.5).
- Honest limits section.
- H3 falsification threshold (macro-F1 < 0.4) from the original DESIGN.md.

**Changed from DESIGN_v2:**
- Stratum A source list (§3).
- Fetching architecture (source-specific extractors instead of generic HTML stripper).
- Content-validation gates pre-declared (§4).
- Halt threshold held (n < 27 still triggers halt).

---

## 1. What this document pre-registers

Locked before any Stratum A fetching:

1. **Stratum A source list and extractor strategy.** 7 new documents from three stable source types (Paul Graham essays, Stephen Wolfram blog posts, arxiv abstracts), each with a source-specific extractor specified in §3. Total Stratum A target after expansion: 10 (existing a01, a02, a03 plus 7 new).
2. **Content-validation gates.** A fetched document passes only if it meets all three gates in §4. Non-compliant fetches are excluded, not substituted. Halt threshold unchanged.
3. **Hypotheses and acceptance thresholds from DESIGN_v2.** These carry forward unchanged.

---

## 2. What does NOT change

The halt discipline. If after v3 expansion + v2 B/C corpus the total is still below 27, halt per DESIGN_v2 §7 applies unchanged. This document is re-designing the corpus SOURCE, not lowering the threshold.

The signal-level work (S-1 through S-4) was designed based on v1's per-frame evidence and does not change based on corpus assembly outcome.

---

## 3. Stratum A new sources (frozen, 7 documents)

| Doc ID | Source type | Primary URL | Extractor | Expected class |
|--------|-------------|-------------|-----------|----------------|
| a04 | Paul Graham essay | https://www.paulgraham.com/cities.html | PG extractor | Reflective personal essay |
| a05 | Paul Graham essay | https://www.paulgraham.com/know.html | PG extractor | Reflective personal essay |
| a06 | Stephen Wolfram blog | https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ | Wolfram extractor | Technical reflective essay |
| a07 | Stephen Wolfram blog | https://writings.stephenwolfram.com/2020/04/finally-we-may-have-a-path-to-the-fundamental-theory-of-physics-and-its-beautiful/ | Wolfram extractor | Scientific reflective essay |
| a08 | arxiv abstract | https://export.arxiv.org/api/query?id_list=2005.14165 (GPT-3 paper abstract) | arxiv extractor | Academic research abstract |
| a09 | arxiv abstract | https://export.arxiv.org/api/query?id_list=2108.07258 (Foundation Models) | arxiv extractor | Academic research abstract |
| a10 | arxiv abstract | https://export.arxiv.org/api/query?id_list=2001.08361 (Neural Scaling Laws) | arxiv extractor | Academic research abstract |

**No fallback URLs.** If any URL returns broken content, the document is excluded from n=30. Fallback-chaining was the mechanism that permitted v2's brittle fetches; v3 commits to primary-URL-or-exclude.

Stratum diversity within A: 2 Paul Graham essays + 2 Stephen Wolfram essays + 3 arxiv abstracts. Acknowledged narrower than v2's intended mix (which included speeches, testimony, press releases). The trade-off is operational reliability; narrower genre coverage is named as a limit in §6.

### 3.1 Extractor specifications

**Paul Graham extractor.** paulgraham.com serves essays as static HTML with minimal markup. Extract by: (a) fetch the page, (b) find `<body>...</body>`, (c) strip `<script>`, `<style>`, `<head>`, (d) replace block-element close tags (`</p>`, `</div>`, etc.) with newlines, (e) strip remaining tags, (f) collapse whitespace. Simple; matches what already works on paulgraham.com/greatwork.html in v2 (which gave 1078 valid words).

**Wolfram extractor.** writings.stephenwolfram.com uses a WordPress theme with identifiable content container. Extract by: (a) fetch page, (b) find `<article>...</article>` or failing that `<div class="entry-content">...</div>`, (c) strip `<script>`, `<style>`, `<nav>`, (d) strip images, (e) replace block-close tags with newlines, (f) strip remaining tags, (g) collapse whitespace. If the article container is not found, treat as fetch failure.

**arxiv extractor.** Use the arxiv export API: `https://export.arxiv.org/api/query?id_list=XXXX.XXXXX` returns Atom XML. Extract: the `<summary>` element contains the abstract in plain text. No HTML stripping needed. Keep only the summary; discard title/author/metadata.

Each extractor is implemented as a distinct function in the fetching script (§8); the source-specific approach replaces v2's generic `_strip_html` that was the root cause of navigation-as-content failures.

---

## 4. Content-validation gates (pre-declared)

Every fetched document is validated against three gates before inclusion. Gates are checked in order; first failure excludes the document.

**Gate G1: word count.** Document must have >= 300 words (simple whitespace split). This gate was present in v2 but was insufficient alone.

**Gate G2: paragraph count.** Document must contain >= 3 "paragraphs" where a paragraph is defined as: text separated from adjacent text by 2+ consecutive newlines, AND containing >= 40 words. The 40-word minimum rules out short navigation fragments (menu items, captions, section headers) that otherwise pass as "paragraphs".

**Gate G3: navigation-ratio.** Count lines in the document. Count lines with 5 or fewer whitespace-separated tokens (these are typical of menus, headers, and nav items). If short-line count > 30% of total non-empty lines, exclude. This gate specifically targets the v2 failure mode where listing pages had mostly short items (a06 WhiteHouse, a10 NBER).

A document passing all three gates is valid. A document failing any gate is excluded. Gate values are locked before fetching; no post-hoc relaxation.

---

## 5. Hypotheses, acceptance thresholds, signal-level changes

All carried forward verbatim from DESIGN_v2 §2, §4, §5. For brevity not repeated here; those sections of DESIGN_v2 remain the authoritative reference for what the v3 test is measuring and what the thresholds are.

**One clarification:** acceptance thresholds were set assuming n=30. At n=28 or n=27 (if 1-2 Stratum A docs fail v3 gates), the per-frame 95% CI half-widths will be modestly wider. This is named as a honest limit in §6 and does not cause threshold relaxation.

---

## 6. Honest limits of v3 (pre-declared)

1. **Stratum A is narrower in genre than DESIGN_v2 intended.** 2 PG + 2 Wolfram + 3 arxiv covers "reflective essay by technologist" and "academic abstract." It does not cover "government speech," "press release," or "congressional testimony." The test's finding is conditioned on this narrower sample.
2. **Paul Graham is one author.** 3 documents from the same author (a03, a04, a05) means stratum A has author concentration. Not ideal; acknowledged.
3. **arxiv abstracts are short (typically 150-300 words).** G1 word-count gate may exclude some; may need relaxation for arxiv specifically. Pre-declared: for arxiv-specific sources, G1 lowered to 150 words (enough for a real abstract to pass, strict enough to exclude navigation).
4. **Extractors are not exhaustively tested.** The Wolfram extractor uses `<article>` or `entry-content` class: correct for the two specific URLs declared, but may fail on other Wolfram essays if the theme differs.
5. **Halt remains possible.** If 3+ Stratum A URLs fail validation, total n drops below 27 and halt fires again. The v3 design is more reliable than v2 but still not guaranteed.

---

## 7. What success and failure look like after this expansion

**Success:** 7 Stratum A documents pass validation. Total n = 30 (A=10, B=10, C=10). Proceed to re-labeling, signal-level implementation, and v1/v2 detector comparison against pre-registered thresholds.

**Partial success (27 <= n < 30):** some Stratum A documents failed validation. Proceed with explicit imbalance noted (A<10); findings are directional with wider CIs.

**Failure (n < 27):** halt per DESIGN_v2 §7; re-design or accept smaller study per curator decision.

---

## 8. Execution order

1. Write this DESIGN_v3.md (completion = publication).
2. Implement `07_assemble_stratum_a_v3.py` with source-specific extractors + content-validation gates.
3. Run the assembly; validate each document against G1-G3.
4. Update manifest to v5.
5. If n >= 27: proceed to Step 6. If n < 27: halt.
6. Label all 27-30 documents (curator + LLM-judge). Previously labeled documents retain their labels; only new ones need fresh labels.
7. Implement signal-level changes in `framing_v2.py` (S-1 through S-4 per DESIGN_v2 §4).
8. Implement v3 detector wiring in `frame_library_v3.py` (extends v2 rules with new signals).
9. Run v1 baseline on full corpus; run v3 detector on full corpus.
10. Compute metrics against pre-registered thresholds.
11. Write REPORT_V2_TRACK_A.md honestly per DESIGN_v2 §7 rules.

Deviation from this order or post-hoc rule relaxation constitutes pre-registration violation.

---

*v3. 2026-04-18. Authored by Lovro Lucic with AI assistance. Pre-registered before Stratum A v3 fetching, before signal-level implementation, before any new labels. Second pre-registration after DESIGN_v2 halt.*
