ExperimentTested March 2026 · xAI / Gemini· 4 min

Same Technique, Opposite Results

By Lovro Lucic · Mar 24, 2026

Two tasks. Same AI model. Same approach. One task needs a precise answer. The other needs creative exploration.

The approach helped the first task immediately. The output got more specific, more grounded, more precise. Measurably so, and the direction held across different AI models. The same approach on the second task made the output worse.

Not "helped less." Damaged on the measures used: narrower range, less discovery. The structured approach that produced precision on convergent problems harmed exploratory ones. Identical technique. Opposite results. The direction replicates on xAI and Gemini Flash.

There is no universally good prompt. No best practice that works everywhere. The task type determines whether a technique helps or hurts, and most people don't distinguish task types before choosing their approach.

The mechanism: when a task has a known answer and needs precision, structure concentrates the model's output distribution. It narrows toward the right region. Focused, specific, hitting the target.

When a task requires exploration, the evaluative kind of structure (score the options, select the best) collapses the search space. The model needs to spread across possibilities, consider non-obvious angles, resist premature convergence. An evaluative frame forces it to organize before it's explored. The model complies with the instruction fully, and the output gets tidier and shallower. Narrower range. Less discovery. Follow-up testing refined the mechanism: not all structure narrows. Organizational structure (map the whole space) expanded exploratory output instead. The harm tracks the type of structure, and an explicit task intent outweighs the structure signal.

Two types of AI users fall on either side of this split.

The first reads every output, approves what they understand, rejects what they don't. Slow. Scales linearly with human attention. But safe when you're the domain expert.

The second builds systems (quality gates, tests, standards) and audits selectively. Faster. Scales with system quality. But only works when the system matches the task type. A quality system built for convergent tasks applied to exploratory work produces compliant mediocrity.

The practical move: before choosing any technique, ask one question. Does this task have a known right answer that needs precision? Or does this task need range and exploration? The technique that's optimal for one is harmful for the other.

Evaluative structure helps convergence. Exploration needs range, and the scoring frame is what takes it away. The costly mismatch runs one way: convergent tasks survive extra structure, open questions don't survive the scoring frame. The same logic applies to context itself: redirecting attention works differently from adding background.

One caveat worth stating here, not just in the evidence section: the effect sizes measure programmatic specificity markers, not quality as a domain expert would judge it. In a blind test (one domain expert, 5 pairs), the expert couldn't distinguish specific from generic outputs on quality. Both conditions produced the same analytical substance. The direction (structure helps convergence, harms exploration) is robust. But what specificity changes is output form (more verifiable references), not substance (same conclusions to a domain expert). The honest effect size is roughly 40 percent smaller than originally claimed, once confounded length and quality-demand variables were removed.

Most people pick one approach and run it on everything. Same level of structure, same specificity, same constraint density regardless of what the task actually needs. The mismatch is invisible because the output always looks competent. You only see it when you run both versions side by side on the same task.

Try this: take one task you do regularly with AI. Run it twice, once with tight constraints and once with loose constraints. Compare the outputs. Which version did you assume would win before you tested? The gap between your assumption and the result is the data.

Test this yourself

Take one prompt technique you use. Run it on a precise analytical task and an open creative task. Compare whether it helps or hurts each.

Run in ChatGPT

What survived testing

Structure helps convergence (large effect; the direction replicates cross-generator on xAI and Gemini, though magnitude across models is not established)Copy link
Same structure hurts exploration (consistent direction across multiple replications)Copy link
Cross-generator: effect replicates on xAI and GeminiCopy link
Compliance with structure on exploratory tasks sat at ceiling. The model follows the instruction, and the output still narrows: the frame, not disobedience, is the problemCopy link

What didn't survive

"Structure always helps" killed. Task-type dependent.Copy link
"The harm is about constraint density" partially killed. It's about concentration vs range, not about how many constraints.Copy link
"Any structure narrows exploration" killed in follow-up testing: organizational structure expanded exploratory output several-fold while evaluative structure compressed it. The compression claim is scoped to evaluative-type structure, and an explicit task intent outweighs the structure signal.Copy link
Quality magnitude claims are LLM-calibrated. Human evaluation shows no holistic agreement with LLM scores.Copy link

Honest limits

Quality scores measure LLM-valued properties. Direction claims hold; magnitude claims are LLM-calibrated.Copy link
"Exploratory" operationalized as open-ended creative/strategic tasks. Other definitions may produce different boundaries.Copy link
March 2026 models. The task-type dependency is likely structural to how attention works. The specific effect sizes will shift.Copy link

Receipts

Cited by

ExperimentMore Context Barely Helps

Next in The "It Depends" Problem

Why 'Don't Be Generic' Doesn't Work

Explore other threads

New findings when they land.

No spam. Just what held up.

Same Technique, Opposite Results

What survived testing

What didn't survive

Honest limits

The Fabrication Problem

The Evaluation Problem

The "What You Think Works" Problem

New findings when they land.