Frame Check
By Lovro Lucic ·
The Model Mechanics · 5 of 5
Drop any document in. You see what reading does not show: which analytical perspectives the document covers and which it skips. The voice it speaks in. Which numerical claims have sources behind them and which do not. Named framing patterns the text fires on. A structural reading you cannot get by reading the document yourself.
Works on your own drafts. On AI output. On anything you are about to act on.
Live at frame.clarethium.com. Free to use. The MCP server is open source. The methodology, the validation corpus, and the worked examples are all published openly.
The tool exists because reading uses the frame that wrote the document. So does re-reading. So does asking the model to check its own work, which produces fluent agreement with the frame that produced the answer. The audit shares the substrate of the thing being audited. To see the frame, you need evidence the writing did not use.
Frames determine what conclusions can be drawn. They determine what looks obvious, what looks impossible, what shows up as data and what passes as background. Humans inherit them from language, training, the questions they were taught to ask. Models inherit them from training data and the specific shape of the prompt that activated them. From inside, the frame is invisible. The conclusion feels inevitable. The conclusion the reader trusts is the frame the reader is inside.
The skill that compounds is not "find the right frame." That is still capture. The skill is holding multiple frames lightly. Seeing what each one makes visible. Choosing deliberately rather than inheriting.
This is upstream of the verification problem, the fabrication problem, the AI output quality problem. They are all symptoms of frame-capture. Confident-sounding errors come from frames that did not invite hedging. Wrong numbers come from frames that did not invite checking. Defaults come from frames the user did not know they were inside.
Frame Check reads documents through four independent layers, and tags every finding with which kind of evidence produced it. The first layer is structural and deterministic. Regex-and-pattern detection of which analytical perspectives a document covers, the voice it speaks in, what share of claims carry attribution, the unhedged-vs-hedged ratio. Identical inputs return identical measurements. No model is involved in this layer.
The second layer is verification through structured APIs. Numerical claims get checked against authoritative data sources where coverage exists. SEC EDGAR for US public-company filings. FRED for macroeconomic series. World Bank for country statistics. CoinGecko, Wolfram, others. Per-provider precision and recall surfaced. No verdicts. The verifier publishes its own confusion matrix.
The third layer is a multi-stage claim cascade. Claims that pass the structural floor and the structured-API check escalate, if needed, to web-grounded search. Claims that web grounding cannot resolve get a cross-model check: two independent models, asked the same factual question, no search tools. Agreement points to the answer being in training data. Divergence flags at least one of them as unreliable on that claim. Each stage costs more and resolves more. Each claim exits the moment any stage produces a definitive verdict, with the evidence chain attached.
The fourth layer is interpretation. A different model from the verifier produces what computation cannot reach: what the document is about, what perspective it takes, what it assumes without stating, what decisions it affects. Different model. Different job. Interpretation kept structurally separate from verification, because asking one model to do both produces circular evaluation.
Every finding the tool emits carries a tag for which kind of evidence produced it. A regex match. A classifier output. A composed pattern. An agent reading. The reader always knows which kind of evidence is doing the work.
What frame_check looks like in practice. An LLM was asked to summarize NVIDIA's Q4 fiscal-2024 earnings press release in a "neutral business-news register." Frame Check, run on the summary with the press release as source_text, returned: voice classifier flagged "promotional" anyway, because the summary inherited the source's vocabulary ("record," "surging"); analytical coverage 2 of 5 (stakeholders and trends covered, causes and risks and uncertainty absent, genre-appropriate for an earnings release); source-fidelity 92 percent (23 of 25 numbers in the summary appear as literal digit substrings in the press release; the two non-matches are paraphrases of "a year ago"); four named-frame matches (FVS-001 Frame Amplification, FVS-002 Fluency-Quality Illusion, FVS-007 Failure Framing flagged for absence, FVS-008 Growth Frame). The reader sees that the prompt's neutrality ask did not survive, that the numerical fidelity is high but not perfect, and which frames the summary fell into. Captured bytes, hashes, and full payload are reproducible from the worked-examples corpus.
What frame_compare looks like in practice. The same prompt ("Is Bitcoin a good investment for a 35-year-old saving for retirement?") was run against four frontier LLMs on the same afternoon. The four responses, run through Frame Check's deterministic layer, returned four materially different structural signatures. Three of four models treated the reader as an abstract allocator; only one named who is affected differently (risk tolerance, dependents, horizon). Three of four addressed no uncertainty; only one named what the answer depends on being true. Sourcing was zero for three of four. Frame matches differed: Claude reasoned inside a growth frame (FVS-008), GPT-5 and Grok inside an efficiency frame (FVS-015), Gemini opened to uncertainty (FVS-012). Same question, four reader-experiences. The reader, seeing the frames named, can choose rather than inherit. Full per-model table and captured responses.
The named patterns themselves come from the Frame Vocabulary Standard. It is a 20-entry library of identification signals, generation affordances, worked examples, and honest limits, curated from the vault's experiment series (EXP-002 through EXP-094 at curation time) and a public falsification record. The library ships under CC-BY-4.0 so it can be cited, studied, forked, and extended as a research artifact in its own right.
A tool that flags frames without honest evidence chains becomes its own frame-capture. The user trusts the verdict instead of widening their seeing. So the discipline runs through the architecture. When the structural detector finds no markers for stakeholders, the tool says "no markers detected for stakeholders," not "the document does not address stakeholders." When voice falls into the residual analytical bucket because no other rule fired, the tool says so. Named-pattern matches ship with teaching questions, not labels. The reader keeps the judgment.
The receipts go further. The part of the tool that detects named patterns failed its first scaled test against pre-registered thresholds. The bar for what would count as useful was set, in writing, before the test ran. Three iterations followed: v1 (n=12, macro-F1 0.157), a same-day rule audit producing v2 (n=12, 0.274), and a follow-on study with signal-level additions producing v3 (n=28, 0.360). The pre-registered useful threshold was 0.4. The detector improved meaningfully across iterations and did not cross. The labelers were two coders (curator and LLM-judge); validation against independent human annotators is the next study. The tool shipped anyway, with the gap surfaced in every output, because the rest of the stack survives a single weak detector. The deterministic structural floor stands. The verification layers stand. The named-pattern layer ships as hypothesis-with-evidence, not as verdict.
There are three doors into the practice.
Paste any AI agent's last response into the tool and watch which frame the context was amplifying. The agent that runs frame_check on its own response is not auditing itself in the self-check sense. It is sending the response through an external instrument that uses different evidence than the model that produced it.
Run frame_compare on two analyses of the same subject. Two consultants. Two AI models. Two drafts of your own. The structural diff shows what each frame is making invisible. Not which is right. What each costs the reader.
Use the reframe path on the web app. Rewrite a document from a counter-frame using the affordances published in the Frame Vocabulary Standard library, then run the analysis on both. The structural delta is the teaching. The reframed version is a hypothesis about what different framing looks like, not a claim about what the document should say. The MCP currently exposes frame_check and frame_compare; reframe lives at frame.clarethium.com and is on the roadmap for the agent surface.
Personal practice and AI work, same instrument. The cognitive process that generates frames in humans organizes attention around evidence. The generative process that produces frames in models organizes the next token around context. Both produce conclusions that feel inevitable from inside. Both lose what the frame did not invite. The instrument exists because the skill compounds when the lens becomes visible.
Frame-mobility is not only about AI. It is a way of perceiving that holds lenses lightly and notices when one of them is doing the work. The AI use case is concrete because the frames are detectable in text and the patterns are countable. The personal use case is the same skill. The work in front of you, the recommendation about to be followed, the question that stopped being asked. Each of them sits inside a frame.
Live at frame.clarethium.com. Paste a document, see the structural reading, follow the suggested next moves. For agent use: pip install frame-check-mcp, point any MCP-compatible client at it, and frame_check becomes a tool the agent can call inside any conversation.
It is yours to use. If you run it and find something it missed or got wrong, the GitHub issues are open.
What survived testing
- The architecture in pipeline form: deterministic structural floor, structured-API verification, multi-stage claim cascade with cross-model consensus, interpretation isolated from verification. (Two layers in canonical taxonomy, four layers in pipeline form; see METHODOLOGY.md §1.1.)Copy link
- Cross-model consensus as a verification signal: two independent models without search tools, agreement points to training data, divergence flags at least one as unreliable.Copy link
- Controlled frame-transformation: identical data rewritten from different analytical frames produces measurably different structural profiles.Copy link
- Construct-honesty at the rendering layer: under-detection markers, "no markers detected" rather than "does not address," teaching questions instead of verdicts on named-pattern matches.Copy link
- Per-claim evidence chains: every finding carries a tag for which kind of evidence produced it, so the reader knows which layer is doing the work.Copy link
What didn't survive
- The named-pattern detector across three pre-registered iterations against the pre-registered useful floor of 0.4. v1 macro-F1 0.157 (n=12). v2 0.274 (n=12, same-day rule audit). v3 0.360 (n=28, signal-level additions). All three below the floor. The labelers were two coders (curator and LLM-judge); the LLM-judge is permissive by construction (78 percent of slots flagged versus the curator's 30 percent). Track B with independent human annotators is pending. The named-pattern layer ships as hypothesis-with-evidence; the rest of the stack survives the gap.Copy link
- "Detection equals truth" claim, at the publish layer. Surfaces now read "low structural coverage of X" or "no markers detected" rather than "does not address X."Copy link
- Three v1 detection rules (FVS-001 Frame Amplification, FVS-008 Growth, FVS-015 Efficiency) retired in the same-day v2 audit, because they fired on cases they should not flag. The v3 follow-on study reintroduced two of them with revised signal substrate (S-3 growth vocabulary, S-4 efficiency vocabulary). FVS-001 remains permanently retired. The frame concepts stand as library entries; signal-level rebuild replaced the v1 rules for the two recovered.Copy link
Honest limits
- Structural detection is a necessary precondition for frame analysis, not a sufficient one. A document can pass all structural tests while being subtly manipulative at the semantic level.Copy link
- Source Network coverage is strong for US public-company financials and macroeconomic indicators, sparse to absent for medical, legal, niche-industry, and forecast-laden domains. In the calibration corpus, 86 percent of claims were unverifiable through this layer.Copy link
- The named-pattern detector's measured agreement with the two coders (curator and LLM-judge) remains below the pre-registered useful threshold across three iterations. The architecture survives this gap; the named-pattern layer alone does not.Copy link
- The validation against independent human annotators (Track B) has not yet run. All current numbers are tuning-set or two-coder numbers; held-out generalization is the open empirical question.Copy link
- The tool surfaces what its instruments can see. It does not claim to surface everything that is there.Copy link
Explore other threads
The Fabrication Problem
4 findingsMost AI numbers are fabricated. Source material fixes it. Self-checking fails. Trust signals are backwards.
The Evaluation Problem
2 findingsJudgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.
The "It Depends" Problem
3 findingsSame instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.
New findings when they land.
No spam. Just what held up.