# Receipts: Stop Calling It Hallucination

The companion video runs three demos of three different failure modes
people lump together as "AI hallucination." Each demo was iterated
before it landed; one of them was a complete rebuild from an earlier
concept. What follows is the kit to replicate all three yourself, the
design history behind why each one is shaped the way it is, and
actual outputs from a fresh run so you know what to expect before you
run it.

Three demos. Three mechanisms. Three different solutions.

- **Beat 2, role framing.** Same quarterly data, same task, two roles.
  Opposite trajectory narratives from the same numbers.
- **Beat 4, template contamination.** Post-mortem prompt produces ~900
  words of generic corporate boilerplate. The labeling follow-up gets
  the model to mark its own output sentence by sentence.
- **Beat 5, confabulation with and without prohibition.** Fictional
  company, five revenue numbers. Three model families invent margins
  and growth rates. One line added to the prompt redirects them.

The full finding is published at
<https://blog.clarethium.com/stop-calling-it-hallucination>. That piece
covers all six types and their mechanisms. These receipts expose the
demos that make three of them visible live.

## Quick start: run all three in under thirty minutes

All three demos run in any AI chat (ChatGPT, Claude, Gemini, or any
other). No setup, no API keys, no Python required.

- **Beat 2.** Open two chats. Paste Nexacore data plus the CEO prompt
  in one; paste the same data plus the short-seller prompt in the
  other. Compare the two trajectory narratives.
- **Beat 4.** Open one chat. Paste the post-mortem trigger prompt.
  When the response finishes, paste the labeling follow-up as the next
  message in the same chat. Watch the model mark most of its own
  output as generic.
- **Beat 5.** Open one chat. Paste the Stellex data with the
  analytical ask. Note any margins or growth rates the output invents.
  Open a second chat. Paste the same data with the prohibition line
  added. Compare.

Exact text for all prompts in [`prompts.md`](./prompts.md). Data blocks
in [`data.md`](./data.md). Keep reading below for design rationale,
what was killed along the way, and what the receipts prove.

## The design problem

Every demo had to meet one constraint: make a failure mode visible in
roughly thirty seconds of screen time. The failure had to happen on
camera, not be reported on after the fact. The viewer had to catch the
pattern without pausing to interpret.

That constraint forced rebuilds. The demos that shipped are the
versions that passed the visibility test. Earlier concepts that
didn't pass show up in the beat-by-beat notes below.

## Beat 2: role framing

**The design problem.** Make a frame effect visible without the model
doing anything wrong. Both outputs needed to be defensible from the
same numbers. The viewer's eye had to catch the flip in seconds, not
paragraphs.

**What was tried and killed.** An earlier version of the prompt asked
the model to "build a thesis against [a stated conclusion]." The
output flipped, but the demo was unclean: two variables changed
between the two runs (the role AND the action). Viewers couldn't tell
whether role or action was driving the difference. The prompt was
restructured so both runs perform the same action ("prepare the board
presentation, write the trajectory narrative") and only the role word
changes.

**The final shape.** Both prompts end with the same task instruction.
Only the role word changes: "You are the CEO of Nexacore" versus "You
are the short-seller." One variable, maximum flip visible.

**Why the Nexacore data.** Constructed for defensibility from either
frame. Real top-line growth (+31% YoY) for the CEO to anchor on. Real
margin compression (71.2% to 68.4%) and net income flip (-$2.1M) for
the short-seller to anchor on. Both narratives are honestly defensible
from the same numbers. Without that balance, one role would produce an
obviously wrong narrative and the demo would fail.

**How to replicate.** Open two new chats in any AI interface. Paste
the Nexacore data block plus the CEO prompt in one (exact text in
[`prompts.md`](./prompts.md)). Paste the same data block plus the
short-seller prompt in the other. Compare the trajectory narratives.
For the cross-model version (Beat 3 in the video), run the CEO and
short-seller prompts through xAI, GPT, and Gemini; the flip holds
across all three (full outputs in
[`samples.md`](./samples.md#beat-3-cross-model-role-flip-cli-three-api-families)).

**What you should see.**

- CEO output: treats the +31% YoY revenue growth as the headline.
  Frames the $2.1M net loss as investment, the margin compression as
  deliberate scaling, the renewal rate drop as normal expansion
  friction.
- Short-seller output: treats the margin compression and net loss as
  the headline. Frames the revenue growth as masking quality-of-growth
  decline, flags the renewal rate drop as structural, surfaces the
  cash burn.
- The flip: both outputs reference the same numbers. What reverses is
  which numbers are foreground and which are footnote.

**Sample from a fresh run.** On the 94-to-91 renewal rate drop: CEO
framed it as *"installed base still resilient, renewal quality
softened, not yet structural."* Short-seller framed the same number
as *"installed base becoming less sticky while management asks
investors to believe in long-term durability."* Full outputs and the
cross-model versions are in [`samples.md`](./samples.md#beat-2-role-framing-chatgpt-web).

**What this demo does NOT prove.** That frame effects are binary. The
demo uses dramatic polarity (CEO vs short-seller) to make the effect
visible. Real-world frame effects are continuous and often invisible
because only one frame is ever requested. The demo shows that the
effect exists; it doesn't measure its magnitude across conditions.

## Beat 4: template contamination

**The design problem.** Make training-data leak visible without the
viewer needing prior context. The failure is that the model produces
structurally-identical output for any input that fits a familiar
pattern. The demo needed to make the structure visible as structure,
not as content.

**What was tried and killed.** An earlier template demo used a task
where the model had enough input to actually specify the content. The
output was too coherent on that task; contamination and valid response
looked identical from the outside. The template layer couldn't be
isolated. The task was rebuilt as a thin-data, format-heavy one: the
post-mortem with four facts of input that have to fill ~900 words of
structure. With the task that thin, the structure dominates the
output by construction.

A "strip the template" instruction was also tested as the
intervention. The model stripped corporate language but the
structural template persisted. The intervention partially failed. The
replacement was to ask the model to LABEL its own output sentence by
sentence, making the model a witness against itself instead of trying
to change what it produced.

**The final shape.** A thin-data, format-heavy task. The post-mortem
prompt gives the model four facts (migration, no backup, three hours,
two customers) and asks for a full post-mortem. The model has to fill
~900 words of structure with four facts of input. The structure
dominates the output by construction. Then the critical move: a
follow-up prompt asks the model to label its own output sentence by
sentence, marking each as specific or generic. The model becomes a
witness against its own output. No viewer interpretation required.

**How to replicate.** Open one new chat. Paste the post-mortem trigger
prompt. When the response finishes, paste the labeling follow-up as
the next message in the same chat. Watch the model mark most of its
own output as generic.

**What you should see.**

- Post-mortem output: ~900 words structured as executive summary,
  incident timeline, root cause, impact, corrective actions, lessons
  learned. Most sentences would fit any production outage post-mortem
  at any company.
- Labeled output: the model marks most sentences as [GENERIC]. The
  [SPECIFIC] sentences are the four facts you provided (migration, no
  backup, three hours, two customers) and the immediate scaffolding
  around them. The corrective actions section tends to be labeled
  entirely generic.
- The flip: the model becomes a witness against its own output
  without you needing to argue the point.

**Sample from a fresh run.** The model's own closing assessment:
*"Big picture: most of the response was generic boilerplate. The
truly incident-specific parts were mainly the concrete facts:
database migration, no backup, three hours, and two customers losing
data."* Full labeled output in
[`samples.md`](./samples.md#beat-4-template-contamination-chatgpt-web).

**What this demo does NOT prove.** That self-labeling works on tasks
where the model has reason to defend its output. The post-mortem
labeling works because nothing in the model's training tells it to
defend generic post-mortem language. On tasks where the model has
picked a side or committed to a framing, self-labeling may produce
different results. [The Self-Check Illusion](/self-check-illusion)
covers related failures of model self-evaluation.

## Beat 5: confabulation with and without prohibition

**The design problem.** Show confabulation happening AND show the
one-line change working, in one continuous demo. The demo needed to
make both the problem and the change visible in under ninety seconds,
and it needed to hold up across multiple models so single-model
behavior couldn't be argued away.

**The data choice.** The Stellex block provides five revenue numbers
and nothing else. The analytical ask (revenue mix, margins, growth
rates, risks) is standard for quarterly analysis, but only the first
is answerable from this dataset. Whatever margins or growth rates the
model produces are invented by construction, not retrieved from the
source. That's what makes the confabulation observable: nothing in
the data can defend the numbers the model writes.

**The intervention choice.** Prohibition was chosen over monitoring
based on prior controlled testing documented in [Source Conditioning](/source-conditioning):
three generators, ~100 documents, prohibition outperformed monitoring
by roughly 5x on unsourced-claim rates. The Beat 5 demo is a live
replication of that finding on a single task, not a new claim. The specific line used in the demo is "use only data
from the source above, for any metric not provided, explicitly state
that it is not available rather than estimating."

**The cross-model check.** The demo runs the same test through three
API families (xAI, GPT, Gemini) so single-model behavior can't be
argued away.

**The final shape.** One data block, one analytical ask, two variants
of the prompt (with and without the prohibition paragraph), three
model families.

**How to replicate.** Open one new chat. Paste the Stellex data with
the analytical ask (text in [`prompts.md`](./prompts.md)). Note any
margins or growth rates the output invents. Open a second chat. Paste
the same data with the prohibition line added. Compare. If you want
the cross-model version, run each in a different AI interface (any
three of ChatGPT, Claude, Gemini, Grok).

**What you should see.**

- Without prohibition: the output reports operating margins by segment
  (numbers like "estimated 25-30% for Cloud Platform") and
  year-over-year growth rates. None of those numbers are in the
  source data. They were invented during generation.
- With prohibition: the output states explicitly that margins and
  growth rates are not available in the source. The analytical ask
  gets reframed as qualitative (risk analysis, segment concentration,
  revenue mix interpretation) without invented numbers.
- The flip: one paragraph added to the prompt shifts the generation
  pathway from pattern completion to source grounding. The model
  compensates by writing more qualitative analysis, not by writing
  less.

**Sample from a fresh run.** Gemini without prohibition wrote *"Cloud
Platform 28%-34% estimated operating margin. High scalability; high
initial Capex but low marginal cost per user."* Gemini with
prohibition wrote *"The following financial metrics are not
available: Gross margin, operating margin, and net income figures are
not available."* Same model, same data, one added line. Full
cross-model outputs in
[`samples.md`](./samples.md#beat-5-confabulation-with-and-without-prohibition-cli).

**The stronger test.** A harder version of this demo keeps the full
4-item ask (which explicitly demands estimated margins and growth
rates) AND adds the prohibition line. Direct contradiction: items 2
and 3 demand fabrication, the last sentence forbids it. All three
models refuse cleanly; xAI returns "not available" for items 2, 3,
and 4; GPT reframes the absence as risk-analysis signal; Gemini
produces a structured refusal with named missing inputs. One sentence
at the top beats four enumerated estimation requests below it.
Details in
[`samples.md`](./samples.md#beat-5-stress-test-prohibition-beats-explicit-estimation-requests).

**What this demo does NOT prove.** That the prohibition line
generalizes to every task. [Source Conditioning](/source-conditioning)
has the fuller evidence base for reformulation tasks. Reasoning,
strategy, and creative tasks are less well characterized. The Beat 5
demo is a live replication on a single task.

## What's here

| File | What it is |
|---|---|
| [`prompts.md`](./prompts.md) | The five prompts verbatim. Copy-paste ready. The primary replication kit. |
| [`data.md`](./data.md) | The two fictional datasets verbatim. |
| [`samples.md`](./samples.md) | Actual outputs from a fresh run on 2026-04-24. Full CEO vs short-seller outputs, cross-model excerpts, the labeled post-mortem, before-and-after fabrication pairs, and the Beat 5 stress test result. |

## What the receipts prove (and don't)

These receipts prove:

- The prompts are verbatim. The role framing on Beat 2 changes only
  the role word. Both prompts end with the same task instruction. The
  demo is clean because only one variable changed.
- The datasets are fictional and sparse enough to make confabulation
  observable. Stellex provides five revenue numbers and nothing else.
  Any margins or growth rates the model reports were invented by the
  model.
- The scripts are the same scripts that ran during filming. Running
  them against the same model versions should produce outputs that
  match the beats shown in the video directionally. Exact wording
  will vary across runs because generation is stochastic.
- Prohibition ("use only data from the source, state when a metric is
  not available rather than estimating") redirects the output.
  Confabulation drops. The model compensates by writing qualitative
  analysis or naming the gaps explicitly.

These receipts do NOT prove:

- That these three failure modes cover every way AI output fails. The
  published finding names six. The video demonstrates three. Frame
  effects (Beat 2) are flagged in the finding's honest-limits section
  as a seventh mechanism tracked for future inclusion.
- That the role dominates the model across every comparison. The
  three models tested (xAI grok-4-1-fast, GPT gpt-5-mini, Gemini
  gemini-3-flash-preview) are roughly comparable in capability. For
  much-less-capable models, the model choice may dominate the role
  choice. The demo shows role-over-model in this band of models, not
  as a universal rule.
- That Beat 5's prohibition line generalizes to every generator and
  every task. [Source Conditioning](/source-conditioning) has the
  fuller evidence base. The Beat 5 demo is a live replication on a
  single task.
- That self-labeling (Beat 4) works on any output where the model
  has reason to defend its framing. Tested on a template-dominant
  task with nothing to defend.
- That the effects have per-run flip rates and variance data. The
  demos are reported as directional with one full sample run per
  beat (see [`samples.md`](./samples.md)). The Beat 5 stress test
  adds a second condition showing the prohibition holds against
  explicit estimation demands across three models.
  [Source Conditioning](/source-conditioning) has wider quantitative
  evidence for the confabulation mechanism specifically.
  Running each demo N times across varied datasets to produce rate
  statistics is a separate experiment. If future work quantifies
  Beat 2 (frame effects) or Beat 4 (template labeling) at scale, it
  will earn its own transmission.

## Related findings in the canon

The three demos map to existing transmissions on the blog that cover
each mechanism in depth.

- Beat 2 (role framing): this mechanism is not yet its own
  transmission on the blog. Tracked for future inclusion. The closest
  adjacent finding is [The Attribution Error](/attribution-error).
- Beat 3 (cross-model): [The Attribution Error](/attribution-error)
  names the attribution error underneath "which model should I use"
  as a reframe. The context determines whether behaviors happen. The
  model adjusts how they express.
- Beat 4 (template contamination): [Trust Signals Are Inverted](/trust-signals-are-inverted)
  covers why corporate-sounding, citation-heavy, confident output is
  often the least reliable signal of actual grounding.
- Beat 5 (confabulation): [Source Conditioning](/source-conditioning)
  is the full recipe. Three steps. Source material reduces
  confabulation from 85% to single digits. Prohibition outperforms
  monitoring 5x. [The Constraint Paradox](/constraint-paradox) covers
  why this kind of constraint works on convergent tasks and can harm
  exploratory ones.

## What the iteration cost

The video took roughly 20 script iterations across 8 days. The
template-email demo was rebuilt into the post-mortem demo. The
role-framing prompt was restructured after the first version changed
two variables at once. The intervention for Beat 4 shifted from
"strip the template" to "label the output."

The visible video is three minutes. The work behind it is closer to a
week of constrained design. This matters if you try to build similar
demos yourself. The first version of a demo is almost never the one
that works. The constraint is what makes the demo honest. Iteration
until the constraint is satisfied is the job.

## Note on model versions

The samples in [`samples.md`](./samples.md) were generated against
April 2026 model versions (`grok-4-1-fast`, `gpt-5-mini`,
`gemini-3-flash-preview`). Newer versions may produce different
outputs. The directional findings (role flips the trajectory,
prohibition redirects the confabulation, template contamination is
label-visible) are expected to hold on successor models. The exact
wording will not.

## What this opens up next

The three mechanisms shown here are three of the six named in
[the finding](/stop-calling-it-hallucination). Drift, interference,
and wrong task each have their own visibility constraints and would
need their own demo shapes. If future videos cover them, their
receipts will follow this same shape: design problem, what was
killed, final, replication, what it doesn't prove.

The methodology narrative here, exposing iterations alongside the
final kit plus actual output samples, is the first time it has been
done in a receipts publication. The earlier receipts for
[The Fabrication Architecture](/fabrication-architecture) and
[Catching Your Own Overclaim](/catching-your-own-overclaim) are
verification-focused. This piece is the first to carry the design
history and the empirical samples together. Whether that shape holds
for V2 and onward depends on what external readers do with it.

## Errata

Found a problem with the demos, the prompts, or the methodology? Send
it via LinkedIn DM (linked from
[/about](https://blog.clarethium.com/about)). Corrections get
published on the record at [/record](https://blog.clarethium.com/record),
with attribution.

## Related receipts

This kit pairs with the video for the same finding and is the first
receipts publication to expose the methodology story (killed
approaches, design rationale) alongside the replication kit. Companion
receipts for related findings are at
[Catching Your Own Overclaim](/receipts/catching-your-own-overclaim) and
[The Fabrication Architecture](/receipts/fabrication-architecture).
