Your agent finished. That doesn't mean it worked.
Task completion is the default metric for AI agents. It's also the most misleading. Here's what matters: trajectory-level evaluation, tool correctness, grounding, and the trust gap no one talks about.
TL;DR: Task completion is the default metric for AI agents. It’s also the most misleading. An agent can score 100% on task completion while hallucinating tool calls, ignoring API errors, and running 5x the expected steps. The real work is trajectory-level evaluation. Measuring tool correctness, grounding, plan coherence, and error recovery across the full execution path. Start with informational tasks, keep generative tasks supervised, and engineer visible citation trails.
Key takeaways:
- Task completion rate measures whether an agent finished. It says nothing about whether it finished correctly or safely
- A 95% per-step success rate across 8 steps yields 66% end-to-end completion
- All 8 major agent benchmarks have been exploited. Lab scores don’t predict production performance
- 54% of users trust manual search more than agent results, even when agents complete 75% of tasks
- The six dimensions that matter: tool selection, argument extraction, grounding, error recovery, plan coherence, and completion
- Start with informational tasks (8.3/10 satisfaction). Keep generative tasks supervised (5.8/10)
I deployed an agent last quarter. It scored 100% on task completion. The dashboard showed green on every metric. Latency was fine. Token usage was normal. Error rate was zero.
The agent was deleting production data.
Every API call returned 200. The monitoring tools reported nothing wrong. By every conventional metric, the system was healthy right up until the moment the data was gone. It took nine seconds. It took thirty hours to recover.
This isn’t a story about bad monitoring. It’s a story about the wrong metric.
Task completion is a vanity metric
Task completion rate is the first thing everyone tracks. Did the agent finish what you asked? Yes or no. It’s simple, it’s intuitive, and it’s dangerously incomplete.
The math is unforgiving. A 95% per-step success rate across eight steps gives you approximately 66% end-to-end completion. Teams don’t even measure per-step accuracy. They check whether the final answer looks right.
A 2026 survey on agent reliability found that 41% of tech leaders cite reliability as the number one barrier to scaling agents in production. The same survey found most teams evaluate at the end-to-end level only. They check the output, not the path. This is the same pattern I covered in dynamic workflows in Claude Code: surface-level evaluation misses the deep failures.
The benchmark industry knows this. UC Berkeley researchers found all 8 major agent benchmarks exploitable. One team gamed 890 SWE-bench tasks with a single character change. OpenAI stopped reporting SWE-bench Verified scores entirely after confirmed training-set leakage. Enterprise data shows a 37% gap between lab scores and production outcomes.
Goodhart’s Law: once a measure becomes a target, it stops being a good measure.
An agent that scores 100% on task completion but 40% on tool correctness isn’t good at its job. It’s getting lucky. And luck doesn’t survive production.
The six things that matter
Agent failures cluster into six independent dimensions. Strong performance on one does not compensate for failure on another. This framework comes from the Confident AI evaluation guide and the Future AGI agent evaluation deep dive.
Tool Selection. Did the agent call the right tool? Did it refrain from calling one when it wasn’t needed? The most common failures here are pattern-matching on keywords instead of intended function, calling tools that don’t exist in the schema, and the irrelevance bucket. Most teams never test cases that require no tool call, giving an optimistically high score.
Argument Extraction. When the agent calls the right tool, does it pass correct arguments? Type mismatches, missing required fields, and semantically wrong values. These are particularly damaging because they are silent. Downstream steps execute against bad data.
Result Use (Grounding). Does the agent use the actual payload returned by the tool, or does it substitute its own knowledge? The tool returns {"balance": 10432.87}. The agent says “$10,400.” The number is close. The precision is lost. Across a multi-step workflow, these errors compound.
Error Recovery. When a tool call fails, what does the agent do? The documented patterns are: hallucinating success (fabricates a result on a 500 error), silent failure (stops making progress without surfacing the failure), retry storm (retries without backoff), and context loss on recovery (restarts the task from scratch). Almost never tested in development environments because mocks return clean responses.
Plan Coherence. Multi-step trajectory efficiency. Loops that revisit completed steps. Dead-ends without escalation. Twelve steps where four would suffice. An agent that solves a problem in fifteen steps when five would work is not fine. It is burning tokens, latency, and error probability.
Task Completion. End-to-end accomplishment of the user’s request. Necessary. It is not sufficient.
The Indian publication Product Leaders Day calls this the metric pair that big tech won’t publish: TCR vs TUC. Task Completion Rate paired with Tool Use Correctness. The second is decomposable into selection accuracy, parameter validity, and constraint adherence. Track both. The one without the other tells you nothing.
Run a Gen 4 deliberation panel for every deployment and it costs about ₹4,000/month ($48) in API calls. Skip it and a single production incident costs ₹8 lakh+ ($9,600+) in downtime and recovery. The math favors evaluation.
The trust gap no one measures
A study of 8,128 users published two weeks ago tracked task completion across five major agent platforms. The mean completion rate was 75.3%. That number is everywhere in the marketing materials. What the marketing materials leave out is the rest of the study.
54% of users trusted manual search results more than agentic results. Only 34% trusted agentic results more. Among technically sophisticated users, the trust gap in favor of manual search widened to 37 percentage points.
The study tracked citation depth as a variable. Devin completed 86% of tasks with a median of 2 citations. OpenClaw completed 81% with a median of 7 citations. Users trusted the agent that cited more, even though it finished slower. Users who can evaluate a citation trail notice when there isn’t one.
Task completion rate measures whether the agent finished. It says nothing about whether the user believed the result. Those are two different metrics and they diverge fast.
The evaluation ladder
There are four generations of agent evaluation, described in detail by LayerLens. Most teams operate at Gen 1.
Gen 1: LLM-as-Judge (70% accuracy). A single LLM call grades the agent’s output against a rubric. Fast, cheap, and misses trajectory-level failures.
Gen 2: Agent-as-Judge (85% accuracy). The evaluator has tool access. It can check the agent’s tool calls against the available tool definitions and verify intermediate states.
Gen 3: Agentic Judge (90% accuracy). Multi-step evaluation that traces the full trajectory. Starts catching phantom value propagation. Hallucinated intermediate values that downstream APIs accept as valid.
Gen 4: Deliberation Panel (96-98% accuracy). Multiple judges cross-validate. Catches all four major failure patterns: phantom value propagation, infinite loops, destructive action chaining, and silent prompt mutation.
The evaluation ladder is like the kitchen expediter. A single expediter at the pass catches obvious mistakes. They miss the dish that looks right but uses the wrong ingredient. A team of expediters, one checking each station, catches everything. The cost goes up. The risk goes down. The 15-point accuracy gap between Gen 1 and Gen 3 is not a calibration issue. It is the difference between catching destructive action chaining before deployment and catching it after the database is gone.
Anthropic’s engineering team recommends the same three-level stack. Code-based graders for what can be deterministically checked. LLM graders for what requires judgment. Human graders for what needs domain expertise. Grade the outcome when you can. Use transcript-level evaluation when you can’t.
When this is overkill
Not every agent needs the full evaluation stack.
A RAG chatbot answering factual questions from a fixed document set does not need trajectory-level evaluation. Single-turn answer relevancy and faithfulness checks are sufficient. The risk profile doesn’t justify the cost.
A customer support agent updating tickets, processing refunds, and writing to a CRM needs tool correctness and argument extraction checks. It does not need a deliberation panel.
An autonomous coding agent that can delete production databases, modify infrastructure, and trigger deployments needs every layer. The cost of evaluation is noise compared to the cost of the incident it prevents.
The decision framework from the n8n blog on agent metrics is the simplest I’ve seen: track tool call count per task as a leading indicator. If a task that normally requires 2 tool calls suddenly needs 8, something is wrong. Start there. Add layers as the risk grows.
The task completion study maps the task types by safety and satisfaction. Automate from the top down. Informational tasks (8.3/10 satisfaction). Comparative tasks (7.8/10, 87% success). Exploratory research (7.1/10, citation depth matters). Transactional tasks with side effects (6.3/10, human-in-the-loop required). Generative creative work (5.8/10, highest distrust, lowest satisfaction).
Start at the top. Stay supervised at the bottom.
Start here
Building evaluations from zero follows a consistent pattern across teams that have done it. Anthropic documents it in their evals guide and Cameron Wolfe’s deep dive covers the same progression.
Start with 20 to 50 simple tasks drawn from real usage data. Use code-based graders first. They are deterministic and fast. They catch the most common regressions. Add LLM graders for the cases where judgment is required. Add human graders when you need gold-standard evaluation for a specific domain.
Track task completion and tool correctness as a paired metric. If one goes up and the other stays flat, you have a problem. Track step efficiency as a leading indicator. If the tool call count per task spikes, inspect the trajectory.
The Cameron Wolfe guide defines four levels of tool calling accuracy. Invocation accuracy (did it decide to call a tool when it should?). Selection accuracy (did it call the right tool?). Structural accuracy (did it pass correct arguments?). Trajectory accuracy (was the sequence of calls correct?). Most teams stop at selection. The failures live in trajectory.
Build regression evals from capability evals. A capability eval asks: what can this agent do well? A regression eval asks: does it still handle everything it used to? Target 100% pass rate on regression evals. Graduate tasks from capability to regression as they stabilize. The same principle applies to agent context window management: what works for short sessions often degrades over longer ones.
And engineer visible citation trails. The data is clear: users trust agents that show their work. A 75% completion rate means nothing if 54% of users don’t trust the result. The agent that finishes slower but cites seven sources is trusted more than the agent that finishes fastest with two. This is one reason why building model-agnostic agents matters. When the model behind the agent changes, your evaluation framework should catch the regression before your users do.
This is the most important sentence in this post: your agent can finish the task and still be failing. Task completion is where evaluation starts. It is not where evaluation ends.
FAQ
Why does task completion rate fail in production? Task completion rate measures the end state. It doesn’t measure the path. An agent can hallucinate a tool call, ignore an API error, and fabricate a success message and still score 100%. The failures hide in the trajectory.
What is the difference between capability evals and regression evals? Capability evals ask what an agent can do well. They start low and give the team a hill to climb. Regression evals ask whether the agent still handles everything it used to. They target 100% pass rate and catch regressions introduced by new features or model changes.
How much does production-grade agent evaluation cost? A Gen 4 deliberation panel costs roughly ₹4,000/month ($48) in API calls for a team running 50 to 100 agent deployments per week. The cost of not having it, a single production incident, starts at ₹8 lakh+ ($9,600+) in downtime, recovery, and reputational damage.
What should I evaluate first? Start with informational tasks. Fact retrieval, data lookup, single-vendor comparisons. These have the highest satisfaction scores and the clearest success criteria. Add evaluation layers as the agent’s autonomy and risk profile grow. Autonomous coding agents need every layer. Simple RAG chatbots need two.
Related Posts
- Dynamic workflows in Claude Code. Three failure modes in complex agent tasks and how workflow orchestration catches what surface-level evaluation misses.
- AI agent context window management. How context degrades over long agent runs and why trajectory-level evaluation is essential for catching drift.
- Model-agnostic agents. When the model behind your agent changes, your evaluation framework should catch the regression before your users do.
- AI agent error handling patterns. Error recovery is one of the six evaluation dimensions — here’s how to build agents that fail gracefully.
This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]