One of our more interesting experiments: we built an agent to evaluate proposals, and then, rather than letting it replace human evaluation, we made it argue with our humans.
The obvious design would have been substitution. The agent reads submissions, scores them against criteria, produces a ranked list, humans skim and sign. Faster, cheaper, and it would have worked, in the narrow sense. It also would have put a rubber stamp with a pulse at the end of the process, and we’d already learned where that leads.
So we designed for friction instead. The agent evaluates independently. The humans evaluate independently. Then they compare, and the interesting work happens exactly where they disagree. The agent has to show its reasoning; the humans have to show theirs. Sometimes the agent caught things tired humans skimmed past, an unsupported claim deep in an appendix, an inconsistency between sections written by different hands. Sometimes the humans knew things no document contained, this supplier’s record on a previous job, why that elegant approach fails in our specific operating context. Neither party was reliably right. That was the point.
What we got wasn’t just better evaluations, though they were better. We got visibility of how each party fails. The agent’s blind spots turned out to be systematic and therefore fixable. The humans’ blind spots, mostly fatigue and anchoring, were exactly the ones the agent doesn’t share. And our people sharpened noticeably: knowing a machine would independently assess the same material, and that divergences would be discussed, does wonders for rigour. Nobody wants to be the reviewer who missed what the agent caught, and nobody forgets the day their contextual knowledge overturned the machine’s tidy conclusion.
Disagreement, it turns out, is a control. Two independent evaluators with different failure modes, forced to reconcile, catch what either alone would miss. Aviation figured this out generations ago.
As you deploy agents into judgement-heavy work, resist the instinct to design the human out or bolt them on at the end as approval. Design the argument in. It’s where the assurance lives.