Receipts
Receipts: Stop Calling It Hallucination
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
The companion video runs three demos of three different failure modes people lump together as "AI hallucination." Each demo was iterated before it landed; one of them was a complete rebuild from an earlier concept. What follows is the kit to replicate all three yourself, the design history behind why each one is shaped the way it is, and actual outputs from a fresh run so you know what to expect before you run it.
Three demos. Three mechanisms. Three different solutions.
- Beat 2, role framing. Same quarterly data, same task, two roles. Opposite trajectory narratives from the same numbers.
- Beat 4, template contamination. Post-mortem prompt produces ~900 words of generic corporate boilerplate. The labeling follow-up gets the model to mark its own output sentence by sentence.
- Beat 5, confabulation with and without prohibition. Fictional company, five revenue numbers. Three model families invent margins and growth rates. One line added to the prompt redirects them.
The full finding is published at https://blog.clarethium.com/stop-calling-it-hallucination. That piece covers all six types and their mechanisms. These receipts expose the demos that make three of them visible live.
Quick start: run all three in under thirty minutes
All three demos run in any AI chat (ChatGPT, Claude, Gemini, or any other). No setup, no API keys, no Python required.
- Beat 2. Open two chats. Paste Nexacore data plus the CEO prompt in one; paste the same data plus the short-seller prompt in the other. Compare the two trajectory narratives.
- Beat 4. Open one chat. Paste the post-mortem trigger prompt. When the response finishes, paste the labeling follow-up as the next message in the same chat. Watch the model mark most of its own output as generic.
- Beat 5. Open one chat. Paste the Stellex data with the analytical ask. Note any margins or growth rates the output invents. Open a second chat. Paste the same data with the prohibition line added. Compare.
Exact text for all prompts in prompts.md. Data blocks
in data.md. Keep reading below for design rationale,
what was killed along the way, and what the receipts prove.
The design problem
Every demo had to meet one constraint: make a failure mode visible in roughly thirty seconds of screen time. The failure had to happen on camera, not be reported on after the fact. The viewer had to catch the pattern without pausing to interpret.
That constraint forced rebuilds. The demos that shipped are the versions that passed the visibility test. Earlier concepts that didn't pass show up in the beat-by-beat notes below.
Beat 2: role framing
The design problem. Make a frame effect visible without the model doing anything wrong. Both outputs needed to be defensible from the same numbers. The viewer's eye had to catch the flip in seconds, not paragraphs.
What was tried and killed. An earlier version of the prompt asked the model to "build a thesis against [a stated conclusion]." The output flipped, but the demo was unclean: two variables changed between the two runs (the role AND the action). Viewers couldn't tell whether role or action was driving the difference. The prompt was restructured so both runs perform the same action ("prepare the board presentation, write the trajectory narrative") and only the role word changes.
The final shape. Both prompts end with the same task instruction. Only the role word changes: "You are the CEO of Nexacore" versus "You are the short-seller." One variable, maximum flip visible.
Why the Nexacore data. Constructed for defensibility from either frame. Real top-line growth (+31% YoY) for the CEO to anchor on. Real margin compression (71.2% to 68.4%) and net income flip (-$2.1M) for the short-seller to anchor on. Both narratives are honestly defensible from the same numbers. Without that balance, one role would produce an obviously wrong narrative and the demo would fail.
How to replicate. Open two new chats in any AI interface. Paste
the Nexacore data block plus the CEO prompt in one (exact text in
prompts.md). Paste the same data block plus the
short-seller prompt in the other. Compare the trajectory narratives.
For the cross-model version (Beat 3 in the video), run the CEO and
short-seller prompts through xAI, GPT, and Gemini; the flip holds
across all three (full outputs in
samples.md).
What you should see.
- CEO output: treats the +31% YoY revenue growth as the headline. Frames the $2.1M net loss as investment, the margin compression as deliberate scaling, the renewal rate drop as normal expansion friction.
- Short-seller output: treats the margin compression and net loss as the headline. Frames the revenue growth as masking quality-of-growth decline, flags the renewal rate drop as structural, surfaces the cash burn.
- The flip: both outputs reference the same numbers. What reverses is which numbers are foreground and which are footnote.
Sample from a fresh run. On the 94-to-91 renewal rate drop: CEO
framed it as "installed base still resilient, renewal quality
softened, not yet structural." Short-seller framed the same number
as "installed base becoming less sticky while management asks
investors to believe in long-term durability." Full outputs and the
cross-model versions are in samples.md.
What this demo does NOT prove. That frame effects are binary. The demo uses dramatic polarity (CEO vs short-seller) to make the effect visible. Real-world frame effects are continuous and often invisible because only one frame is ever requested. The demo shows that the effect exists; it doesn't measure its magnitude across conditions.
Beat 4: template contamination
The design problem. Make training-data leak visible without the viewer needing prior context. The failure is that the model produces structurally-identical output for any input that fits a familiar pattern. The demo needed to make the structure visible as structure, not as content.
What was tried and killed. An earlier template demo used a task where the model had enough input to actually specify the content. The output was too coherent on that task; contamination and valid response looked identical from the outside. The template layer couldn't be isolated. The task was rebuilt as a thin-data, format-heavy one: the post-mortem with four facts of input that have to fill ~900 words of structure. With the task that thin, the structure dominates the output by construction.
A "strip the template" instruction was also tested as the intervention. The model stripped corporate language but the structural template persisted. The intervention partially failed. The replacement was to ask the model to LABEL its own output sentence by sentence, making the model a witness against itself instead of trying to change what it produced.
The final shape. A thin-data, format-heavy task. The post-mortem prompt gives the model four facts (migration, no backup, three hours, two customers) and asks for a full post-mortem. The model has to fill ~900 words of structure with four facts of input. The structure dominates the output by construction. Then the critical move: a follow-up prompt asks the model to label its own output sentence by sentence, marking each as specific or generic. The model becomes a witness against its own output. No viewer interpretation required.
How to replicate. Open one new chat. Paste the post-mortem trigger prompt. When the response finishes, paste the labeling follow-up as the next message in the same chat. Watch the model mark most of its own output as generic.
What you should see.
- Post-mortem output: ~900 words structured as executive summary, incident timeline, root cause, impact, corrective actions, lessons learned. Most sentences would fit any production outage post-mortem at any company.
- Labeled output: the model marks most sentences as [GENERIC]. The [SPECIFIC] sentences are the four facts you provided (migration, no backup, three hours, two customers) and the immediate scaffolding around them. The corrective actions section tends to be labeled entirely generic.
- The flip: the model becomes a witness against its own output without you needing to argue the point.
Sample from a fresh run. The model's own closing assessment:
"Big picture: most of the response was generic boilerplate. The
truly incident-specific parts were mainly the concrete facts:
database migration, no backup, three hours, and two customers losing
data." Full labeled output in
samples.md.
What this demo does NOT prove. That self-labeling works on tasks where the model has reason to defend its output. The post-mortem labeling works because nothing in the model's training tells it to defend generic post-mortem language. On tasks where the model has picked a side or committed to a framing, self-labeling may produce different results. The Self-Check Illusion covers related failures of model self-evaluation.
Beat 5: confabulation with and without prohibition
The design problem. Show confabulation happening AND show the one-line change working, in one continuous demo. The demo needed to make both the problem and the change visible in under ninety seconds, and it needed to hold up across multiple models so single-model behavior couldn't be argued away.
The data choice. The Stellex block provides five revenue numbers and nothing else. The analytical ask (revenue mix, margins, growth rates, risks) is standard for quarterly analysis, but only the first is answerable from this dataset. Whatever margins or growth rates the model produces are invented by construction, not retrieved from the source. That's what makes the confabulation observable: nothing in the data can defend the numbers the model writes.
The intervention choice. Prohibition was chosen over monitoring based on prior controlled testing documented in Source Conditioning: three generators, ~100 documents, prohibition outperformed monitoring by roughly 5x on unsourced-claim rates. The Beat 5 demo is a live replication of that finding on a single task, not a new claim. The specific line used in the demo is "use only data from the source above, for any metric not provided, explicitly state that it is not available rather than estimating."
The cross-model check. The demo runs the same test through three API families (xAI, GPT, Gemini) so single-model behavior can't be argued away.
The final shape. One data block, one analytical ask, two variants of the prompt (with and without the prohibition paragraph), three model families.
How to replicate. Open one new chat. Paste the Stellex data with
the analytical ask (text in prompts.md). Note any
margins or growth rates the output invents. Open a second chat. Paste
the same data with the prohibition line added. Compare. If you want
the cross-model version, run each in a different AI interface (any
three of ChatGPT, Claude, Gemini, Grok).
What you should see.
- Without prohibition: the output reports operating margins by segment (numbers like "estimated 25-30% for Cloud Platform") and year-over-year growth rates. None of those numbers are in the source data. They were invented during generation.
- With prohibition: the output states explicitly that margins and growth rates are not available in the source. The analytical ask gets reframed as qualitative (risk analysis, segment concentration, revenue mix interpretation) without invented numbers.
- The flip: one paragraph added to the prompt shifts the generation pathway from pattern completion to source grounding. The model compensates by writing more qualitative analysis, not by writing less.
Sample from a fresh run. Gemini without prohibition wrote "Cloud
Platform 28%-34% estimated operating margin. High scalability; high
initial Capex but low marginal cost per user." Gemini with
prohibition wrote "The following financial metrics are not
available: Gross margin, operating margin, and net income figures are
not available." Same model, same data, one added line. Full
cross-model outputs in
samples.md.
The stronger test. A harder version of this demo keeps the full
4-item ask (which explicitly demands estimated margins and growth
rates) AND adds the prohibition line. Direct contradiction: items 2
and 3 demand fabrication, the last sentence forbids it. All three
models refuse cleanly; xAI returns "not available" for items 2, 3,
and 4; GPT reframes the absence as risk-analysis signal; Gemini
produces a structured refusal with named missing inputs. One sentence
at the top beats four enumerated estimation requests below it.
Details in
samples.md.
What this demo does NOT prove. That the prohibition line generalizes to every task. Source Conditioning has the fuller evidence base for reformulation tasks. Reasoning, strategy, and creative tasks are less well characterized. The Beat 5 demo is a live replication on a single task.
What's here
| File | What it is |
|---|---|
prompts.md | The five prompts verbatim. Copy-paste ready. The primary replication kit. |
data.md | The two fictional datasets verbatim. |
samples.md | Actual outputs from a fresh run on 2026-04-24. Full CEO vs short-seller outputs, cross-model excerpts, the labeled post-mortem, before-and-after fabrication pairs, and the Beat 5 stress test result. |
What the receipts prove (and don't)
These receipts prove:
- The prompts are verbatim. The role framing on Beat 2 changes only the role word. Both prompts end with the same task instruction. The demo is clean because only one variable changed.
- The datasets are fictional and sparse enough to make confabulation observable. Stellex provides five revenue numbers and nothing else. Any margins or growth rates the model reports were invented by the model.
- The scripts are the same scripts that ran during filming. Running them against the same model versions should produce outputs that match the beats shown in the video directionally. Exact wording will vary across runs because generation is stochastic.
- Prohibition ("use only data from the source, state when a metric is not available rather than estimating") redirects the output. Confabulation drops. The model compensates by writing qualitative analysis or naming the gaps explicitly.
These receipts do NOT prove:
- That these three failure modes cover every way AI output fails. The published finding names six. The video demonstrates three. Frame effects (Beat 2) are flagged in the finding's honest-limits section as a seventh mechanism tracked for future inclusion.
- That the role dominates the model across every comparison. The three models tested (xAI grok-4-1-fast, GPT gpt-5-mini, Gemini gemini-3-flash-preview) are roughly comparable in capability. For much-less-capable models, the model choice may dominate the role choice. The demo shows role-over-model in this band of models, not as a universal rule.
- That Beat 5's prohibition line generalizes to every generator and every task. Source Conditioning has the fuller evidence base. The Beat 5 demo is a live replication on a single task.
- That self-labeling (Beat 4) works on any output where the model has reason to defend its framing. Tested on a template-dominant task with nothing to defend.
- That the effects have per-run flip rates and variance data. The
demos are reported as directional with one full sample run per
beat (see
samples.md). The Beat 5 stress test adds a second condition showing the prohibition holds against explicit estimation demands across three models. Source Conditioning has wider quantitative evidence for the confabulation mechanism specifically. Running each demo N times across varied datasets to produce rate statistics is a separate experiment. If future work quantifies Beat 2 (frame effects) or Beat 4 (template labeling) at scale, it will earn its own transmission.
Related findings in the canon
The three demos map to existing transmissions on the blog that cover each mechanism in depth.
- Beat 2 (role framing): this mechanism is not yet its own transmission on the blog. Tracked for future inclusion. The closest adjacent finding is The Attribution Error.
- Beat 3 (cross-model): The Attribution Error names the attribution error underneath "which model should I use" as a reframe. The context determines whether behaviors happen. The model adjusts how they express.
- Beat 4 (template contamination): Trust Signals Are Inverted covers why corporate-sounding, citation-heavy, confident output is often the least reliable signal of actual grounding.
- Beat 5 (confabulation): Source Conditioning is the full recipe. Three steps. Source material reduces confabulation from 85% to single digits. Prohibition outperforms monitoring 5x. The Constraint Paradox covers why this kind of constraint works on convergent tasks and can harm exploratory ones.
What the iteration cost
The video took roughly 20 script iterations across 8 days. The template-email demo was rebuilt into the post-mortem demo. The role-framing prompt was restructured after the first version changed two variables at once. The intervention for Beat 4 shifted from "strip the template" to "label the output."
The visible video is three minutes. The work behind it is closer to a week of constrained design. This matters if you try to build similar demos yourself. The first version of a demo is almost never the one that works. The constraint is what makes the demo honest. Iteration until the constraint is satisfied is the job.
Note on model versions
The samples in samples.md were generated against
April 2026 model versions (grok-4-1-fast, gpt-5-mini,
gemini-3-flash-preview). Newer versions may produce different
outputs. The directional findings (role flips the trajectory,
prohibition redirects the confabulation, template contamination is
label-visible) are expected to hold on successor models. The exact
wording will not.
What this opens up next
The three mechanisms shown here are three of the six named in the finding. Drift, interference, and wrong task each have their own visibility constraints and would need their own demo shapes. If future videos cover them, their receipts will follow this same shape: design problem, what was killed, final, replication, what it doesn't prove.
The methodology narrative here, exposing iterations alongside the final kit plus actual output samples, is the first time it has been done in a receipts publication. The earlier receipts for The Fabrication Architecture and Catching Your Own Overclaim are verification-focused. This piece is the first to carry the design history and the empirical samples together. Whether that shape holds for V2 and onward depends on what external readers do with it.
Errata
Found a problem with the demos, the prompts, or the methodology? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.
Related receipts
This kit pairs with the video for the same finding and is the first receipts publication to expose the methodology story (killed approaches, design rationale) alongside the replication kit. Companion receipts for related findings are at Catching Your Own Overclaim and The Fabrication Architecture.