Skip to content
Back to the finding

Receipts

Receipts: The Most Trustworthy AI Output Is the Least Reliable

Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.

These files are the raw artifacts behind the finding published at https://blog.clarethium.com/trust-signals-are-inverted.

The published claim is that the signals readers use to judge AI output as trustworthy are the same signals fabrication produces: more citations, more named entities, more confident assertions, less hedging. This folder contains the experiment that measured each of those signals programmatically across 60 documents and the blinded LLM evaluation that confirmed the inversion.

What's here

FileWhat it is
method.mdThe experiment design in human-readable form: 10 topics × 2 conditions × 3 versions, the two prompt templates, how trust signals were extracted, how the blinded evaluation was run.
trust_signals.pyThe experiment script that produced the data, verbatim. 348 lines. Three phases: generation (xAI), programmatic measurement (zero LLM), blinded LLM trust evaluation (Gemini).
trust_signals_results.jsonAll 60 outputs with per-document signal counts, the blinded trust scores, and full output texts. Nothing summarized away.
analysis.mdThe aggregate numbers and how each maps to a claim in the published post.

How to read this

  • If you want to check the claim: open analysis.md first. Each row in the claims table cites the path inside trust_signals_results.json that the number came from.
  • If you want to replicate: method.md describes the design, the verbatim prompt templates, and what each signal measures. trust_signals.py holds the procedure exactly as run, including the regex patterns (lines 154-200). _config.py is a documented stub for the two provider clients the script imports (get_xai_client, get_gemini_client); replace each NotImplementedError body with a call to your own SDK to reproduce end-to-end. The patterns, prompt strings, and Phase 2 signal extraction are reusable verbatim with no API access at all.
  • If you want to audit: trust_signals_results.json contains every output text, every signal count, and every trust score with the topic and condition labels.

What the receipts prove (and don't)

These receipts prove:

  • The 60 documents were generated under two conditions that differ only in whether real source material was provided in the prompt. Both conditions used the same model (grok-4-1-fast), same temperature, same word-count target.
  • Programmatic signal extraction (zero LLM judgment) shows unsourced output produces 54.5% more citation references, 56.6% more named entities, and a 35.2% higher confidence-to-hedge ratio than sourced output. These are objective regex counts; the patterns are in trust_signals.py lines 154-200.
  • The blinded LLM trust evaluation (Gemini, no condition labels visible) rated unsourced output higher in 7 of 10 topics, sourced higher in 2, tied in 1. Mean trust score: sourced 4.57, unsourced 4.77.
  • The one signal that goes the other way is precise decimal numbers: sourced output has 2.8x more of these (11.33 vs 4.03 per document). This is the exception named in the post — real data has real decimals.

These receipts do NOT prove:

  • That a human evaluator would replicate the blinded LLM rating. The post's domain-expert N=1 result is a separate observation; it informs the finding but is not in this receipts kit. LLM-as-judge has known shared bias with the same RLHF-trained mechanism producing the trust signals, which is acknowledged in the post and in analysis.md.
  • That the inversion holds across model families. Generation here is single-generator (xAI). Cross-generator replication is a separate experiment.
  • That readers in real reading conditions weight signals the way the programmatic count assumes. The signal-presence / trust-rating correlation at scale is human-untested.

What this kit is for

The point of receipts is verification. If you want to check whether the 55% / 57% numbers in the post are real, open trust_signals_results.json, group by condition, and recompute. If you want to know whether the regex patterns are reasonable measures of what the post calls "trust signals," they are in trust_signals.py and you can challenge them.

Related receipts

The Fabrication Architecture (../fabrication-architecture/) covers the foundational temporal-instability claim that explains why fabricated output exists in the first place. The trust-inversion finding here is the reading-side consequence of that mechanism.

Source Conditioning (../source-conditioning/) carries the receipts for the operational fix. Together, the three kits cover the loop: fabrication exists, fabricated output reads as more trustworthy than sourced output, and source grounding plus prohibition is what makes the output checkable.

Errata

Found a problem with the data, the method, or the analysis? Send it via LinkedIn DM (linked from /about). Corrections get published on the record at /record, with attribution.

Files in this folder