No automated way to verify an agent completed a task correctly

devtools0 views
After an AI agent finishes a task, the only way to know if it did the right thing is for a human to manually review the output. The code might compile but have subtle logic errors. The research summary might sound authoritative but contain hallucinated facts. The workflow might complete 9 of 10 steps and silently skip the hardest one. So what? If every agent output requires human review, agents do not save time — they shift the work from doing to reviewing, which is often harder because you must verify work you did not produce. You need to understand the entire output deeply enough to catch errors, which requires nearly as much expertise and effort as doing it yourself. Why does this matter in the first place? The core promise of AI agents is autonomous execution — "do this task for me while I do something else." Without automated verification, this promise is a lie. You cannot run agents overnight. You cannot scale agents to 100 parallel tasks. You cannot build products that guarantee quality agent output. Every agent deployment has a human bottleneck that limits throughput to however fast humans can review. The structural reason: automated evaluation is a solved problem for narrow tasks with clear ground truth (unit tests, math problems), but most real agent tasks are open-ended ("refactor this module," "research competitors"). Defining machine-checkable success criteria for open-ended tasks is itself an unsolved AI problem.

Evidence

SWE-bench evaluates coding agents but only on pre-defined bug fixes with existing test suites — not open-ended tasks. No equivalent benchmark exists for general agent work. Companies using agents in production report 30-60% task failure rates caught only by human review: https://www.latent.space/p/ai-engineer-benchmark

Comments