# Source documents

The five scripts in this kit reference three source documents, one per
topic, that live alongside the original scripts in the vault at
`EXP-081-data/sources/source_1_remote_work.md`,
`source_3_communication.md`, and `source_4_ai_workflows.md`. They are
not bundled here because they are condensed summaries of named
third-party reports, and republishing the condensed versions raises
unclear distribution questions.

To reproduce the experiments, assemble equivalent ~2-4KB
bullet-pointed source documents from the underlying reports listed
below. The shape that worked: report name and publisher in the
header, then bullet lists of statistics grouped by sub-topic. Each
bullet should contain at least one specific number (percentage,
dollar amount, multiplier, count) so the in-source / not-in-source
classifier has something to match against.

## Topic 1: `remote_work`

**Source file referenced:** `source_1_remote_work.md`

Underlying report: **Owl Labs, State of Hybrid Work 2025**
(research partner Vitreous World, N=2,000 full-time US workers,
collected July 2025). Available at
<https://owllabs.com/state-of-hybrid-work/2025>.

Sub-topics covered in the bundled summary: workforce composition,
productivity by location and task type, manager productivity
assessment, flexibility and retention, hybrid commute behavior,
benefit valuations, AI adoption among remote workers.

## Topic 2: `communication`

**Source file referenced:** `source_3_communication.md`

Underlying report: **Staffbase/YouGov, 2025 International Employee
Communication Impact Study**
(N=3,574 employees across six countries, surveyed February 2025,
online interviews via YouGov Panel).

Sub-topics covered: communication satisfaction relative to other
workplace factors, channel-specific trust data, manager-vs-leadership
trust, frontline (non-desk) worker gap, retention and engagement
correlations.

## Topic 3: `ai_workflows`

**Source file referenced:** `source_4_ai_workflows.md`

Underlying report: **Stack Overflow 2025 Developer Survey — AI
Section.** Available at <https://survey.stackoverflow.co/2025/ai/>.

Sub-topics covered: developer adoption rates, trust and accuracy
ratings, debugging overhead vs generation speed, productivity claims
relative to verified outcomes, code quality observations, junior vs
senior developer differences.

## Reproducing the source format

The scripts use simple in-source string matching (regex against the
source text). Two structural details matter for replication:

1. **Specific numbers must appear as written.** The matcher looks for
   `13%`, `$4.88 million`, `2x` style tokens. If your assembled source
   uses different formatting (`13 percent`, `4.88M`, `two times`)
   the matcher will under-count source-grounded numbers in the output.
2. **Bullet structure matters less than density.** The matcher does
   not care whether bullets, paragraphs, or tables are used. It cares
   that the numbers are present in the text.

A useful sanity check after assembly: run the included
`number_match.py` on a hand-written analytical paragraph that quotes
your source. If the matcher classifies your paragraph's numbers as
in-source, the source format is good. If it classifies them as
not-in-source, the source likely uses different number formatting
than the analysis does.
