Receipts
Receipts: The Most Trustworthy AI Output Is the Least Reliable
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
These files are the raw artifacts behind the finding published at https://blog.clarethium.com/trust-signals-are-inverted.
The published claim is that the signals readers use to judge AI output as trustworthy are the same signals fabrication produces: more citations, more named entities, more confident assertions, less hedging. This folder contains the experiment that measured each of those signals programmatically across 60 documents and the blinded LLM evaluation that confirmed the inversion.
What's here
| File | What it is |
|---|---|
method.md | The experiment design in human-readable form: 10 topics × 2 conditions × 3 versions, the two prompt templates, how trust signals were extracted, how the blinded evaluation was run. |
trust_signals.py | The experiment script that produced the data, verbatim. 348 lines. Three phases: generation (xAI), programmatic measurement (zero LLM), blinded LLM trust evaluation (Gemini). |
trust_signals_results.json | All 60 outputs with per-document signal counts, the blinded trust scores, and full output texts. Nothing summarized away. |
analysis.md | The aggregate numbers and how each maps to a claim in the published post. |
How to read this
- If you want to check the claim: open
analysis.mdfirst. Each row in the claims table cites the path insidetrust_signals_results.jsonthat the number came from. - If you want to replicate:
method.mddescribes the design, the verbatim prompt templates, and what each signal measures.trust_signals.pyholds the procedure exactly as run, including the regex patterns (lines 154-200)._config.pyis a documented stub for the two provider clients the script imports (get_xai_client,get_gemini_client); replace eachNotImplementedErrorbody with a call to your own SDK to reproduce end-to-end. The patterns, prompt strings, and Phase 2 signal extraction are reusable verbatim with no API access at all. - If you want to audit:
trust_signals_results.jsoncontains every output text, every signal count, and every trust score with the topic and condition labels.
What the receipts prove (and don't)
These receipts prove:
- The 60 documents were generated under two conditions that differ only
in whether real source material was provided in the prompt. Both
conditions used the same model (
grok-4-1-fast), same temperature, same word-count target. - Programmatic signal extraction (zero LLM judgment) shows unsourced
output produces 54.5% more citation references, 56.6% more named
entities, and a 35.2% higher confidence-to-hedge ratio than sourced
output. These are objective regex counts; the patterns are in
trust_signals.pylines 154-200. - The blinded LLM trust evaluation (Gemini, no condition labels visible) rated unsourced output higher in 7 of 10 topics, sourced higher in 2, tied in 1. Mean trust score: sourced 4.57, unsourced 4.77.
- The one signal that goes the other way is precise decimal numbers: sourced output has 2.8x more of these (11.33 vs 4.03 per document). This is the exception named in the post — real data has real decimals.
These receipts do NOT prove:
- That a human evaluator would replicate the blinded LLM rating. The
post's domain-expert N=1 result is a separate observation; it informs
the finding but is not in this receipts kit. LLM-as-judge has known
shared bias with the same RLHF-trained mechanism producing the trust
signals, which is acknowledged in the post and in
analysis.md. - That the inversion holds across model families. Generation here is single-generator (xAI). Cross-generator replication is a separate experiment.
- That readers in real reading conditions weight signals the way the programmatic count assumes. The signal-presence / trust-rating correlation at scale is human-untested.
What this kit is for
The point of receipts is verification. If you want to check whether the
55% / 57% numbers in the post are real, open trust_signals_results.json,
group by condition, and recompute. If you want to know whether the
regex patterns are reasonable measures of what the post calls "trust
signals," they are in trust_signals.py and you can challenge them.
Related receipts
The Fabrication Architecture
(../fabrication-architecture/)
covers the foundational temporal-instability claim that explains why
fabricated output exists in the first place. The trust-inversion
finding here is the reading-side consequence of that mechanism.
Source Conditioning
(../source-conditioning/) carries the
receipts for the operational fix. Together, the three kits cover the
loop: fabrication exists, fabricated output reads as more trustworthy
than sourced output, and source grounding plus prohibition is what
makes the output checkable.
Errata
Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.
Files in this folder
- README.md5.3 KB
Overview and how to read these artifacts.
- analysis.md4.6 KB
What the numbers show and where to find them.
- method.md4.9 KB
Experimental design and measurement methodology.
- trust_signals.py20.5 KB
Three-phase script: generation, programmatic signal extraction, blinded LLM trust evaluation.
- trust_signals_results.json285.4 KB
All 60 outputs with per-document signal counts, blinded trust scores, and full text.