Your AI Agent Passed the Demo. It Will Fail in Production. Here Is How QA Teams Evaluate Agents Before That Happens.

AI agents fail in ways traditional testing cannot catch. They call the right tool with wrong parameters. They hallucinate success after a failed operation. They create brilliant plans and ignore them during execution. They get stuck in infinite reasoning loops. And your pipeline stays green the entire time. This is the QA evaluation guide for the agentic era.


I need to tell you about the worst kind of bug I have ever seen.

It is not a crash. A crash is visible. Your monitoring catches it, your alerting fires, your team fixes it within the hour.

The worst kind of bug is a confident wrong answer.

An AI agent that calls a booking API, receives a timeout error, and then tells the customer “Your reservation is confirmed for March 15.” No error in the logs. No failed assertion. The agent hallucinated a success message because it treated the absence of an explicit failure as confirmation.

Research from AWS on autonomous agents documents this pattern. Agents hallucinate when business rules are expressed only in natural language prompts. Parameter errors — the agent calls book_hotel(guests=15) despite “Maximum 10 guests” in the documentation. Completeness errors — the agent executes bookings without required payment verification. Tool bypass behavior — the agent confirms success without calling validation tools at all.

Prompt engineering does not fix this. Prompts are suggestions, not constraints. The agent sees “Maximum 10 guests” as context, not a hard boundary.

And here is the uncomfortable part for QA teams: none of these failures show up in your existing test suite. Your API tests verify the endpoints work. Your UI tests verify the interface renders. But nobody is testing the agent’s reasoning, tool selection, parameter accuracy, or task completion.

That is what AI agent evaluation is. And in 2026, it is the most important QA skill nobody has.


Contents

Why Agents Fail Differently Than Software

Traditional software is deterministic. Given input X, you always get output Y. Testing is straightforward: define expected outputs, assert against actual outputs, done.

AI agents are probabilistic. The same input can produce different reasoning paths, different tool selections, and different outputs every time. The “correct” answer is not a single string — it is a range of acceptable behaviors across multiple dimensions.

A DEV Community article on agent QA testing identified five structural differences that make agent testing fundamentally harder:

Prompt sensitivity. A change to three words in your system prompt can shift behavior across thousands of scenarios. No compiler warning. No stack trace. The behavior just drifts.

Context window dynamics. Agents that work perfectly with short conversations silently degrade as context grows. The model starts forgetting instructions, misattributing tool outputs, or losing track of its own state.

Tool call cascades. When a tool returns unexpected data — a null, a timeout, a schema mismatch — agents often do not fail loudly. They hallucinate a plausible response and keep going. A crash is visible. A confident wrong answer is invisible until it causes downstream damage.

Non-deterministic execution. Run the same agent with the same input five times and you might get three different tool call sequences, two different final answers, and one infinite reasoning loop.

Evaluation is the hard part. You cannot just compare output strings. You need to evaluate reasoning quality, tool selection accuracy, parameter correctness, plan adherence, task completion, and efficiency — often using another LLM as the judge.


The Seven Ways AI Agents Fail

Confident AI’s research on agent evaluation (the team behind DeepEval) identified a taxonomy of agent failure modes. As a QA engineer, these are your new defect categories:

1. Failed task completion. The agent did not accomplish what the user asked. Could be caused by wrong tools called, API errors, or the agent simply giving up.

2. Infinite reasoning loops. The agent gets stuck in circular thinking. More prominent with newer reasoning models. It reasons about reasoning about reasoning — burning tokens without progress.

3. Wrong tool selection. The agent picks a plausible-looking tool but the wrong one for the task. The call succeeds, returns data, and the agent builds its response on irrelevant information. Nothing in your logs flags it.

4. Wrong parameters. The agent calls the correct tool but passes incorrect arguments. Destination and origin swapped. Date format wrong. Quantity exceeds maximum. The tool executes successfully with bad data.

5. Tool bypass hallucination. The agent claims it called a tool when it did not. It generates a response that looks like a tool result but is entirely fabricated. This is the most dangerous failure mode because the output appears structurally correct.

6. Faulty agent handoffs. In multi-agent systems, the wrong specialist agent receives the task. A billing question gets routed to the shipping agent. The shipping agent answers confidently about billing — incorrectly.

7. Plan abandonment. The agent creates a high-quality plan, then ignores it during execution. It gets distracted by intermediate results, changes strategy mid-task, or simply forgets the plan existed after context window pressure pushes it out of attention.

Every one of these failures produces an output that looks normal. Your API returns 200. Your UI renders the response. Your pipeline stays green.


The Evaluation Framework: Two Layers, Six Metrics

The framework that makes agent evaluation concrete comes from DeepEval, the open-source LLM evaluation framework (v3.8.9, 13K+ GitHub stars, 3M monthly downloads, used by OpenAI, Google, and Microsoft).

Agent evaluation operates across two layers — mirroring how you would evaluate a human employee.

Layer 1: Reasoning — Did the Agent Think Correctly?

PlanQualityMetric — Was the plan logical, complete, and efficient? Did it account for dependencies between sub-tasks?

PlanAdherenceMetric — Did the agent follow its own plan? A brilliant plan that gets ignored is worse than no plan, because it creates false confidence.

These two metrics together catch failure mode #7 (plan abandonment). An agent can score 0.95 on PlanQuality and 0.3 on PlanAdherence — meaning it knows what to do but does not do it.

Layer 2: Action — Did the Agent Do the Right Things?

ToolCorrectnessMetric — Did the agent select the right tools? Compares actual tools called against expected tools. Catches failure mode #3.

ArgumentCorrectnessMetric — Were the parameters correct? Uses LLM-as-a-judge since correct arguments are not always predetermined. Catches failure mode #4.

TaskCompletionMetric — The ultimate metric. Traces the entire execution — every reasoning step, every tool call, every intermediate decision — and scores whether the task was accomplished. Catches failure modes #1, #2, and #5.

AgentEfficiencyMetric — Did the agent complete the task without unnecessary detours? Rewards direct paths, penalizes redundant tool calls and circular reasoning. Two agents can both score 1.0 on task completion, but the one that did it in 3 steps instead of 12 is the better agent.


What This Looks Like in Practice

Here is a concrete test for a customer support agent that should look up order status using an OrderLookup tool:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric, TaskCompletionMetric

test_case = LLMTestCase(
    input="Where is my order #12345?",
    actual_output="Your order #12345 shipped on March 8 and arrives March 12.",
    tools_called=[ToolCall(name="OrderLookup")],
    expected_tools=[ToolCall(name="OrderLookup")]
)

evaluate(
    test_cases=[test_case],
    metrics=[
        ToolCorrectnessMetric(threshold=0.8),
        TaskCompletionMetric(threshold=0.9)
    ]
)

Run it: deepeval test run test_agent.py

If the agent called ProductSearch instead of OrderLookup, ToolCorrectness fails. If the agent made up the shipping date without calling any tool, TaskCompletion fails. If the agent called the right tool but swapped the order ID, ArgumentCorrectness fails.

Each failure points to a specific defect category. Each category has a specific fix. That is QA — not for code, but for AI reasoning.


Component-Level Evaluation: Finding Where It Broke

End-to-end evaluation tells you the agent failed. Component-level evaluation tells you where.

DeepEval’s @observe decorator traces individual components inside your agent pipeline:

from deepeval.tracing import observe, update_current_span

@observe(metrics=[faithfulness_metric])
def retriever(query):
    docs = vector_db.search(query)
    update_current_span(test_case=LLMTestCase(
        input=query, actual_output=str(docs),
        retrieval_context=docs
    ))
    return docs

@observe(metrics=[tool_correctness_metric])
def tool_router(intent):
    tool = select_tool(intent)
    # evaluation happens at this component level
    return tool

When the final output is correct but the retriever returned irrelevant documents (the LLM compensated), component-level evaluation still catches the weak link. When the tool router selected the wrong tool but the agent recovered by calling a second tool, you see both the failure and the recovery.

This is the QA instinct applied to AI: do not just check if it works — check if every part works.


Continuous Evaluation: Agents Drift

This is the part most teams miss.

Your agent works perfectly on launch day. Two weeks later, it starts failing intermittently. Nobody changed anything.

What happened? The model provider pushed an update. Or your data changed. Or edge cases accumulated. Or the context window filled up differently because users started asking longer questions.

A DEV Community article on agent QA put it bluntly: treating evaluation as a one-time task is the number one mistake. Agent behavior drifts. Evaluation must be continuous.

The practical implementation: run your DeepEval test suite weekly on production traffic samples, even when nothing changed. Compare scores week over week. A consistent downward trend is a signal to act before users notice.

In CI/CD, trigger evaluation on every PR that touches prompts, agent definitions, tool configurations, or RAG pipeline changes. Metric drops below threshold — PR fails. Same pattern as Playwright tests for UI changes.


The Real-World Failure Catalogue

Vectara maintains an open-source repository of documented AI agent failures. Some highlights that should keep QA engineers up at night:

A government agency deployed an unvetted chatbot for public nutrition advice. It gave inappropriate responses that contradicted official dietary guidelines.

Google’s AI coding agent, asked to clear a cache, wiped an entire drive. “Turbo mode” allowed execution without confirmation.

Replit’s AI agent deleted a production database during a code freeze, then attempted to hide its actions.

An AI chatbot service faced multiple lawsuits alleging it promoted self-harm and delivered inappropriate content to minors.

These are not hypothetical scenarios. These are documented incidents from production systems. Each one would have been caught by systematic agent evaluation — task completion scoring, tool correctness validation, safety metrics, and output faithfulness checks.


The QA Engineer’s Agent Evaluation Checklist

If your team is shipping AI agents, here is the minimum evaluation stack:

Pre-deployment:

  • Golden dataset of 50+ test cases covering happy paths, edge cases, and adversarial inputs
  • ToolCorrectnessMetric on every agent that uses tools
  • TaskCompletionMetric on every agent workflow
  • FaithfulnessMetric on every RAG-powered agent
  • Safety scan via DeepTeam for customer-facing agents

In CI/CD:

  • deepeval test run on every PR that touches prompts or agent config
  • Threshold-based gates (metric drops = PR fails)
  • Version-locked baselines for comparison

In production:

  • Weekly automated evaluation on sampled traffic
  • Week-over-week score tracking for drift detection
  • Alerting on task completion drops, hallucination spikes, or tool error rates
  • Trace logging for every agent execution (inputs, tool calls, reasoning steps, outputs)

The Bigger Picture

Traditional QA answers: does the code work?

AI agent evaluation answers: does the agent reason correctly, select the right tools, pass the right parameters, follow its own plan, complete the task, and do it efficiently — without hallucinating, drifting, or failing silently?

These are harder questions. They require new metrics, new tools, and new thinking. But the QA engineer who masters agent evaluation becomes the most important person on any team shipping AI.

Because a crash is fixable. A confident wrong answer erodes trust.

And trust, once lost, takes a lot longer than 20 minutes to fix.


Agent evaluation with DeepEval is a core module in my AI-Powered Testing Mastery course. I cover the complete evaluation stack — from your first test case to continuous production monitoring — alongside the Playwright agent trio, multi-agent QA swarms, and the full 2026 AI QA toolchain.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.