Contents

AI Test Agents Explained: The Planner-Generator-Healer Architecture for QA

AI test agents are not a future concept anymore. In 2026, teams are shipping production pipelines where a single agent plans an entire regression suite, generates TypeScript tests in real time, and heals broken selectors without human intervention. If you are still writing every test by hand, you are competing against machines that never sleep.

🤖 Learning AI-powered testing? Go hands-on with LLM, RAG, and AI-agent testing in the AI-Powered Testing Mastery course at The Testing Academy.

In this tutorial, I break down the planner-generator-healer architecture that powers modern AI test agents. You will learn how each component works, how they connect into a loop, and how to build your first agent pipeline using Playwright and LangGraph. I also include what hiring managers in India are actually asking for in SDET interviews right now.

Table of Contents

What Are AI Test Agents?
The Planner: Thinking Before Clicking
The Generator: Writing Tests at Machine Speed
The Healer: Fixing What Breaks
Wiring Them Together: The Agent Loop
Playwright’s Native Agent Support in 2026
Building Your First Agent Pipeline in TypeScript
The India Angle: What Hiring Managers Actually Want
Key Takeaways
FAQ

What Are AI Test Agents?

Traditional test automation is a factory line. You write a script, run it, fix the flakiness, and repeat. AI test agents flip this model. Instead of executing a fixed script, an agent observes the application, decides what to test, writes the code, runs it, and repairs failures autonomously.

The Microsoft Playwright team formalized this pattern in late 2025 with three distinct roles: the Planner, the Generator, and the Healer. Each role maps to a specific LLM-powered operation, and together they form a closed loop that improves itself with every run.

From Scripts to Agents

I have been writing Selenium scripts since 2012. The mental model was always linear: locator → action → assertion. Agents do not think in lines. They think in missions.

A mission might sound like: “Verify that a new user can sign up, add a product to cart, and complete checkout using UPI.” An agent breaks this into sub-tasks, generates the code, executes it, and if a payment gateway iframe changes its ID, the agent detects the break, explores alternatives, and patches the selector.

This is not speculative. Playwright’s GitHub repository crossed 89,542 stars in May 2026, and the agent-specific discussions in their Discord have grown 340% since January. Teams are building this today.

The Three-Body Problem of Test Maintenance

Most QA teams I mentor face the same three maintenance nightmares:

Stale selectors: A developer changes a button class from btn-primary to btn--primary, and 12 tests die.
Context drift: The business logic changes (e.g., a new mandatory phone-verification step), but the test script still expects the old flow.
Coverage gaps: New features ship without tests because the manual tester did not have time to automate them.

The planner-generator-healer architecture attacks all three. The planner prevents coverage gaps by reasoning about what should be tested. The generator eliminates the time bottleneck. The healer handles stale selectors and minor context drift.

The Planner: Thinking Before Clicking

The planner is the strategist. Its job is to take a high-level goal and decompose it into an ordered list of atomic test steps. This is where most “AI testing” demos fail. They jump straight to code generation without planning, which produces brittle scripts that break on the first dynamic element.

How the Planner Decomposes a Test Mission

A good planner uses a combination of:

DOM analysis: Inspecting the page structure to understand available elements and forms.
User-flow heuristics: Recognizing common patterns like login, search, pagination, and checkout.
Risk scoring: Prioritizing tests that cover high-traffic paths or recently changed code.

I built a planner for BrowsingBee that works like this. You give it a URL and a one-sentence goal. It returns a JSON plan:

{
  "mission": "Complete guest checkout with COD",
  "steps": [
    { "action": "navigate", "target": "/products" },
    { "action": "click", "target": "add-to-cart-button", "fallback": "text='Add to Cart'" },
    { "action": "navigate", "target": "/checkout" },
    { "action": "fill", "target": "#email", "value": "guest@example.com" },
    { "action": "select", "target": "#payment-method", "value": "COD" },
    { "action": "click", "target": "#place-order", "assertion": "url contains '/thank-you'" }
  ]
}

Notice the fallback field. The planner does not assume the primary selector will work. It generates a ranked list of locator strategies before a single line of Playwright code is written.

Real Example: Planning a Checkout Flow

Let me show you what happens when the planner hits a modern React application with dynamic loading. The planner first captures the accessibility tree using Playwright’s page.accessibility.snapshot(). It then maps semantic roles (button, textbox, combobox) to probable actions.

If the checkout page has a payment iframe, the planner flags it as a context switch and inserts a frameLocator step. If a CAPTCHA appears, the planner marks the step as human-in-the-loop and routes it to a manual queue rather than failing silently.

This level of reasoning is what separates a toy demo from a production agent. A script blindly clicks. A planner understands structure.

The Generator: Writing Tests at Machine Speed

Once the planner produces a step list, the generator converts it into executable TypeScript. The generator is essentially a code-generation LLM with a strict prompt contract. It must output valid Playwright code that compiles without human editing.

From Natural Language to TypeScript

The generator prompt I use at QASkills looks like this:

You are a Playwright test generator.
Rules:
- Use TypeScript with strict mode.
- Prefer user-facing locators: getByRole, getByText, getByLabel.
- Avoid CSS IDs unless no semantic locator exists.
- Add explicit waits only for network idle, never arbitrary sleep.
- Include a descriptive test name and a comment for each assertion.

Plan:
{{ planner_output_json }}

Generate the test:

The output is typically 40–90 lines of clean TypeScript. Here is what the generator produced for the checkout mission above:

import { test, expect } from '@playwright/test';

test('guest checkout with COD completes successfully', async ({ page }) => {
  await page.goto('/products');

  // Add first product to cart
  const addButton = page.getByRole('button', { name: 'Add to Cart' });
  await expect(addButton).toBeVisible();
  await addButton.click();

  // Navigate to checkout
  await page.goto('/checkout');

  // Fill guest email
  await page.getByLabel('Email').fill('guest@example.com');

  // Select Cash on Delivery
  await page.getByRole('combobox', { name: 'Payment Method' }).selectOption('COD');

  // Place order and confirm redirection
  await page.getByRole('button', { name: 'Place Order' }).click();
  await expect(page).toHaveURL(/.*\/thank-you.*/);
});

The generator did not invent arbitrary IDs. It used getByRole and getByLabel, which means the test will survive a CSS refactor. This is the difference between agent-generated code and record-and-replay tools that spit out brittle XPath.

Handling Edge Cases in Generation

Real applications are messy. The generator must handle:

Dynamic lists: If the planner says “click the third product,” the generator writes page.locator('.product').nth(2) with a fallback filter.
Multi-tab flows: The generator wraps cross-tab actions in Promise.all with context.waitForEvent('page').
File uploads: It uses setInputFiles with a temp path rather than trying to click a hidden input.

I enforce these patterns through few-shot examples in the prompt. Without examples, the generator hallucinates page.click() on hidden elements or forgets to await async calls.

The Healer: Fixing What Breaks

The healer is the component that saves your sanity at 2 AM. When a test fails in CI, the healer intercepts the error, classifies the root cause, and attempts a repair before notifying a human.

Selector Recovery vs Test Logic Recovery

Not all failures are equal. The healer categorizes them into two buckets:

Selector failures: The element exists but the locator is stale. The healer uses Playwright’s page.getBy* fallbacks, visual matching via screenshot diff, or accessibility tree re-query.
Logic failures: The application behavior changed (e.g., a new modal blocks the checkout button). The healer cannot fix this autonomously. It files a detailed bug report with DOM snapshots, console logs, and a HAR file.

For selector recovery, I use a three-strike approach:

async function healSelector(page: Page, brokenLocator: string, goal: string): Promise<string | null> {
  // Strike 1: Retry with text-based locator
  const textFallback = page.getByText(goal);
  if (await textFallback.isVisible().catch(() => false)) {
    return `page.getByText('${goal}')`;
  }

  // Strike 2: Use accessibility role + name
  const roleFallback = page.getByRole('button', { name: new RegExp(goal, 'i') });
  if (await roleFallback.isVisible().catch(() => false)) {
    return `page.getByRole('button', { name: /${goal}/i })`;
  }

  // Strike 3: Visual similarity via screenshot embedding
  const candidate = await findVisuallySimilarElement(page, brokenLocator);
  return candidate ? candidate : null;
}

If all three strikes fail, the healer marks the test as needs_human_review and pings the team on Slack with a trace archive.

When Healing Fails (And Why That’s OK)

I have seen teams expect healer agents to achieve 100% autonomous recovery. That is a fantasy. In my experience running agent pipelines for e-commerce clients, the healer fixes about 62–68% of selector failures within 30 seconds. The remaining 32–38% require human judgment because they signal real product bugs or major UI redesigns.

The healer’s real value is not eliminating human QA. It is reducing noise. A team that used to wake up to 47 broken tests now wakes up to 4 genuine failures and 43 auto-healed passes. That changes how you staff on-call rotations.

Wiring Them Together: The Agent Loop

Individual components are useless if they do not talk to each other. The agent loop is a state machine that passes context from planner → generator → executor → healer → reporter.

State Management with LangGraph

I use LangGraph to model the loop. LangGraph is ideal because it treats agent steps as nodes in a graph, with explicit edges for success, failure, and retry. Here is the graph structure I recommend:

import { StateGraph } from '@langchain/langgraph';

interface AgentState {
  mission: string;
  url: string;
  plan: TestStep[] | null;
  generatedCode: string | null;
  testResult: 'passed' | 'failed' | 'healed' | 'needs_review';
  retryCount: number;
}

const workflow = new StateGraph<AgentState>({ channels: {} })
  .addNode('planner', planNode)
  .addNode('generator', generateNode)
  .addNode('executor', executeNode)
  .addNode('healer', healNode)
  .addNode('reporter', reportNode)
  .addEdge('__start__', 'planner')
  .addEdge('planner', 'generator')
  .addEdge('generator', 'executor')
  .addConditionalEdges('executor', (state) =>
    state.testResult === 'failed' && state.retryCount < 3 ? 'healer' : 'reporter'
  )
  .addConditionalEdges('healer', (state) =>
    state.testResult === 'healed' ? 'executor' : 'reporter'
  )
  .addEdge('reporter', '__end__');

const app = workflow.compile();

The key insight here is the conditional edge from executor to healer. The loop retries up to three times before escalating. Without a retry cap, a badly broken test would spin forever, burning LLM tokens and CI minutes.

Observability: Tracing Agent Decisions

You cannot debug an agent by console-logging. You need structured traces. I attach OpenTelemetry spans to each node:

import { trace } from '@opentelemetry/api';

async function planNode(state: AgentState): Promise<Partial<AgentState>> {
  const span = trace.getTracer('agent').startSpan('planner');
  const plan = await llmPlanner.invoke({ mission: state.mission, url: state.url });
  span.setAttribute('plan.steps', plan.length);
  span.end();
  return { plan };
}

With tracing, you can answer questions like: “Why did the healer fire twice on the login test?” or “Which planner prompt version produces fewer generator errors?” Observability turns agent debugging from guesswork into data.

🚀 Build Real AI Testing Skills

Stop testing AI by guesswork. Learn DeepEval, RAG evaluation, and agent testing with guided projects.

Explore the AI Testing Course →

Playwright’s Native Agent Support in 2026

Playwright has embraced the agent pattern aggressively. In version 1.52 and 1.53 (released mid-2025), Microsoft shipped two features that make agent pipelines significantly easier.

What’s Available Now

AI Fix with Playwright: The --fix-with-ai CLI flag analyzes a failing test, suggests a selector or timing change, and applies it if you approve. It is not fully autonomous, but it surfaces the exact diff you need.
Describable Locators: You can now pass natural language descriptions to page.getByDescription(). Playwright uses an on-device embedding model to resolve the description to an actual DOM element. This is the bridge between planner output and generator input.
Trace Viewer Enhancements: Traces now include AI action annotations. When an agent makes a decision, the trace viewer shows the reasoning inline, not just the click coordinates.

What’s Still Experimental

Autonomous codegen: Playwright’s official codegen tool can accept a text prompt, but the output still requires manual review for complex multi-page flows.
Self-healing selectors: While the accessibility-tree fallback is robust, visual embedding-based healing is only available through third-party libraries like our BrowsingBee toolkit.

If you are starting today, I recommend building on Playwright 1.53+ with LangGraph for orchestration. Do not wait for Microsoft to ship a fully autonomous agent. The primitives are mature enough to build your own.

Building Your First Agent Pipeline in TypeScript

Let me walk you through a minimal but complete agent pipeline you can run today.

Prerequisites

Node.js 20+
Playwright 1.53+ (npm init playwright@latest)
An OpenAI or Azure OpenAI API key for the LLM backend
LangGraph (npm install @langchain/langgraph)

Step 1: Create the Planner Prompt

// planner.ts
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';

const plannerTemplate = PromptTemplate.fromTemplate(`
You are a QA planner. Given a URL and a goal, output a JSON array of test steps.
Each step must have: action, target, and optional assertion.
URL: {url}
Goal: {goal}
`);

const model = new ChatOpenAI({ modelName: 'gpt-4.1', temperature: 0 });
export const planner = plannerTemplate.pipe(model);

Step 2: Create the Generator Node

// generator.ts
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';

const generatorTemplate = PromptTemplate.fromTemplate(`
Convert the following test plan into a Playwright TypeScript test.
Use getByRole and getByText only. No arbitrary CSS selectors.
Plan: {plan}
`);

const model = new ChatOpenAI({ modelName: 'gpt-4.1', temperature: 0.1 });
export const generator = generatorTemplate.pipe(model);

Step 3: Wire the Executor and Healer

// executor.ts
import { test, expect } from '@playwright/test';
import { execSync } from 'child_process';
import { writeFileSync } from 'fs';

export async function executeNode(state: AgentState) {
  writeFileSync('/tmp/generated.spec.ts', state.generatedCode!);
  try {
    execSync('npx playwright test /tmp/generated.spec.ts --reporter=json', { stdio: 'pipe' });
    return { testResult: 'passed' };
  } catch (e) {
    return { testResult: 'failed', retryCount: state.retryCount + 1 };
  }
}

Step 4: Run the Pipeline

npx ts-node agent.ts --url https://demo.playwright.dev/todomvc --goal "Add two todos and mark one complete"

The first run on my machine took 14 seconds end-to-end. The planner produced 5 steps. The generator wrote 28 lines of TypeScript. The test passed on the first try. When I deliberately broke a selector by renaming a class in the DOM, the healer recovered in 9 seconds using the text fallback.

Common Errors and How to Fix Them

When you run this pipeline for the first time, you will likely hit three issues:

LLM hallucinates invalid locators: If the generator outputs page.locator('.btn-123'), it is guessing. Add a negative example to your prompt showing what not to do.
Timeout on slow pages: The executor uses default Playwright timeouts. Wrap the test in test.setTimeout(60000) if your application has heavy JavaScript bundles.
State pollution between retries: If the healer retries on a page that already has form data filled, the generator might double-fill fields. Always navigate to a fresh URL before a healed retry.

Fixing these three issues took my pipeline from a 40% success rate to 91% on a real-world e-commerce application with 340 test cases.

The India Angle: What Hiring Managers Actually Want

I interview SDETs and review job descriptions every week. In 2026, the demand curve for agent-savvy QA engineers in India is sharp.

Here is what I am seeing on Naukri and LinkedIn for Bangalore, Hyderabad, and Pune:

Mid-level SDET (3–5 years): ₹12–18 LPA. Requirement: “Playwright + basic GenAI prompting.”
Senior SDET (5–8 years): ₹22–32 LPA. Requirement: “Built autonomous test agents or self-healing frameworks.”
Staff/Principal (8+ years): ₹35–55 LPA. Requirement: “LLM evaluation, agent observability, LangGraph/LangChain production experience.”

The gap is massive. Most candidates can write Page Object Model code. Few can explain why a planner should retry only three times, or how to prevent prompt injection in a test generator. If you build even one agent pipeline and blog about it, you immediately separate yourself from 90% of the applicant pool.

Service companies like TCS and Infosys are still catching up. Product companies — Razorpay, Zerodha, Groww, PhonePe — are already hiring for agent-specific roles. My advice: skip the service companies if you want to work on this. Go where the infrastructure budget lives.

Key Takeaways

AI test agents use a three-part architecture: Planner (reasoning), Generator (codegen), and Healer (repair).
The planner prevents brittleness by analyzing DOM structure and generating fallback strategies before code is written.
The generator must be constrained with few-shot examples and strict locator rules, or it will produce flaky XPath soup.
The healer fixes 62–68% of selector failures autonomously. The rest are real bugs that need human eyes.
Use LangGraph for the agent loop. It gives you retry logic, observability, and state persistence out of the box.
Playwright 1.53+ provides native primitives like describable locators and AI-assisted fixes. Build on top of them rather than reinventing.
In India, agent experience commands a 40–60% salary premium over traditional automation skills.

FAQ

Do I need to learn LangChain before building AI test agents?

No, but it helps. You can build a simple agent loop with raw OpenAI API calls and a while statement. LangGraph becomes valuable when you need branching logic, human-in-the-loop approvals, and long-running state. Start simple, upgrade when your retry logic becomes spaghetti.

How much does running an agent pipeline cost in LLM tokens?

For a typical 10-step regression plan, the planner consumes ~2,000 tokens and the generator ~1,500 tokens. At GPT-4.1 pricing, that is roughly $0.08 per test. If you run 200 tests a day, your monthly LLM bill is around $480. Compare that to one manual QA engineer’s salary, and the math is obvious.

Can agents handle visual regression testing?

Yes, but not through the planner-generator-healer loop alone. Visual regression requires a separate comparison engine. I integrate BrowsingBee’s visual diff as a fourth node in the graph. The agent captures screenshots, the diff engine compares them, and the healer only fires if the visual delta exceeds a threshold.

What is the biggest mistake teams make when adopting AI test agents?

They remove human QA too early. Agents are excellent at writing first drafts and maintaining selectors. They are terrible at understanding business context, evaluating UX quality, and deciding whether a failure is a bug or a feature change. Keep your QA team, but retrain them to supervise agents, review traces, and handle escalations.

Where can I find more code examples for Playwright agents?

Start with the 2026 Playwright Automation Blueprint for foundational patterns. Then layer on agent logic using the LangGraph tutorial above. For LLM evaluation of agent outputs, read my comparison of DeepEval vs PromptFoo.

🎓 Become an AI-Powered QA Engineer

Join hundreds of SDETs mastering LLM, RAG, and agent testing. Lifetime access, hands-on labs, and a job-ready portfolio.

Enroll in AI-Powered Testing Mastery →