AI Test Agents Explained: The Planner-Generator-Healer Architecture for QA
Contents
AI Test Agents Explained: The Planner-Generator-Healer Architecture for QA
AI test agents are not a future concept anymore. In 2026, teams are shipping production pipelines where a single agent plans an entire regression suite, generates TypeScript tests in real time, and heals broken selectors without human intervention. If you are still writing every test by hand, you are competing against machines that never sleep.
In this tutorial, I break down the planner-generator-healer architecture that powers modern AI test agents. You will learn how each component works, how they connect into a loop, and how to build your first agent pipeline using Playwright and LangGraph. I also include what hiring managers in India are actually asking for in SDET interviews right now.
Table of Contents
- What Are AI Test Agents?
- The Planner: Thinking Before Clicking
- The Generator: Writing Tests at Machine Speed
- The Healer: Fixing What Breaks
- Wiring Them Together: The Agent Loop
- Playwright’s Native Agent Support in 2026
- Building Your First Agent Pipeline in TypeScript
- The India Angle: What Hiring Managers Actually Want
- Key Takeaways
- FAQ
What Are AI Test Agents?
Traditional test automation is a factory line. You write a script, run it, fix the flakiness, and repeat. AI test agents flip this model. Instead of executing a fixed script, an agent observes the application, decides what to test, writes the code, runs it, and repairs failures autonomously.
The Microsoft Playwright team formalized this pattern in late 2025 with three distinct roles: the Planner, the Generator, and the Healer. Each role maps to a specific LLM-powered operation, and together they form a closed loop that improves itself with every run.
From Scripts to Agents
I have been writing Selenium scripts since 2012. The mental model was always linear: locator β action β assertion. Agents do not think in lines. They think in missions.
A mission might sound like: “Verify that a new user can sign up, add a product to cart, and complete checkout using UPI.” An agent breaks this into sub-tasks, generates the code, executes it, and if a payment gateway iframe changes its ID, the agent detects the break, explores alternatives, and patches the selector.
This is not speculative. Playwright’s GitHub repository crossed 89,542 stars in May 2026, and the agent-specific discussions in their Discord have grown 340% since January. Teams are building this today.
The Three-Body Problem of Test Maintenance
Most QA teams I mentor face the same three maintenance nightmares:
- Stale selectors: A developer changes a button class from
btn-primarytobtn--primary, and 12 tests die. - Context drift: The business logic changes (e.g., a new mandatory phone-verification step), but the test script still expects the old flow.
- Coverage gaps: New features ship without tests because the manual tester did not have time to automate them.
The planner-generator-healer architecture attacks all three. The planner prevents coverage gaps by reasoning about what should be tested. The generator eliminates the time bottleneck. The healer handles stale selectors and minor context drift.
The Planner: Thinking Before Clicking
The planner is the strategist. Its job is to take a high-level goal and decompose it into an ordered list of atomic test steps. This is where most “AI testing” demos fail. They jump straight to code generation without planning, which produces brittle scripts that break on the first dynamic element.
How the Planner Decomposes a Test Mission
A good planner uses a combination of:
- DOM analysis: Inspecting the page structure to understand available elements and forms.
- User-flow heuristics: Recognizing common patterns like login, search, pagination, and checkout.
- Risk scoring: Prioritizing tests that cover high-traffic paths or recently changed code.
I built a planner for BrowsingBee that works like this. You give it a URL and a one-sentence goal. It returns a JSON plan:
{
"mission": "Complete guest checkout with COD",
"steps": [
{ "action": "navigate", "target": "/products" },
{ "action": "click", "target": "add-to-cart-button", "fallback": "text='Add to Cart'" },
{ "action": "navigate", "target": "/checkout" },
{ "action": "fill", "target": "#email", "value": "guest@example.com" },
{ "action": "select", "target": "#payment-method", "value": "COD" },
{ "action": "click", "target": "#place-order", "assertion": "url contains '/thank-you'" }
]
}
Notice the fallback field. The planner does not assume the primary selector will work. It generates a ranked list of locator strategies before a single line of Playwright code is written.
Real Example: Planning a Checkout Flow
Let me show you what happens when the planner hits a modern React application with dynamic loading. The planner first captures the accessibility tree using Playwright’s page.accessibility.snapshot(). It then maps semantic roles (button, textbox, combobox) to probable actions.
If the checkout page has a payment iframe, the planner flags it as a context switch and inserts a frameLocator step. If a CAPTCHA appears, the planner marks the step as human-in-the-loop and routes it to a manual queue rather than failing silently.
This level of reasoning is what separates a toy demo from a production agent. A script blindly clicks. A planner understands structure.
The Generator: Writing Tests at Machine Speed
Once the planner produces a step list, the generator converts it into executable TypeScript. The generator is essentially a code-generation LLM with a strict prompt contract. It must output valid Playwright code that compiles without human editing.
From Natural Language to TypeScript
The generator prompt I use at QASkills looks like this:
You are a Playwright test generator.
Rules:
- Use TypeScript with strict mode.
- Prefer user-facing locators: getByRole, getByText, getByLabel.
- Avoid CSS IDs unless no semantic locator exists.
- Add explicit waits only for network idle, never arbitrary sleep.
- Include a descriptive test name and a comment for each assertion.
Plan:
{{ planner_output_json }}
Generate the test:
The output is typically 40β90 lines of clean TypeScript. Here is what the generator produced for the checkout mission above:
import { test, expect } from '@playwright/test';
test('guest checkout with COD completes successfully', async ({ page }) => {
await page.goto('/products');
// Add first product to cart
const addButton = page.getByRole('button', { name: 'Add to Cart' });
await expect(addButton).toBeVisible();
await addButton.click();
// Navigate to checkout
await page.goto('/checkout');
// Fill guest email
await page.getByLabel('Email').fill('guest@example.com');
// Select Cash on Delivery
await page.getByRole('combobox', { name: 'Payment Method' }).selectOption('COD');
// Place order and confirm redirection
await page.getByRole('button', { name: 'Place Order' }).click();
await expect(page).toHaveURL(/.*\/thank-you.*/);
});
The generator did not invent arbitrary IDs. It used getByRole and getByLabel, which means the test will survive a CSS refactor. This is the difference between agent-generated code and record-and-replay tools that spit out brittle XPath.
Handling Edge Cases in Generation
Real applications are messy. The generator must handle:
- Dynamic lists: If the planner says “click the third product,” the generator writes
page.locator('.product').nth(2)with a fallback filter. - Multi-tab flows: The generator wraps cross-tab actions in
Promise.allwithcontext.waitForEvent('page'). - File uploads: It uses
setInputFileswith a temp path rather than trying to click a hidden input.
I enforce these patterns through few-shot examples in the prompt. Without examples, the generator hallucinates page.click() on hidden elements or forgets to await async calls.
The Healer: Fixing What Breaks
The healer is the component that saves your sanity at 2 AM. When a test fails in CI, the healer intercepts the error, classifies the root cause, and attempts a repair before notifying a human.
Selector Recovery vs Test Logic Recovery
Not all failures are equal. The healer categorizes them into two buckets:
- Selector failures: The element exists but the locator is stale. The healer uses Playwright’s
page.getBy*fallbacks, visual matching via screenshot diff, or accessibility tree re-query. - Logic failures: The application behavior changed (e.g., a new modal blocks the checkout button). The healer cannot fix this autonomously. It files a detailed bug report with DOM snapshots, console logs, and a HAR file.
For selector recovery, I use a three-strike approach:
async function healSelector(page: Page, brokenLocator: string, goal: string): Promise<string | null> {
// Strike 1: Retry with text-based locator
const textFallback = page.getByText(goal);
if (await textFallback.isVisible().catch(() => false)) {
return `page.getByText('${goal}')`;
}
// Strike 2: Use accessibility role + name
const roleFallback = page.getByRole('button', { name: new RegExp(goal, 'i') });
if (await roleFallback.isVisible().catch(() => false)) {
return `page.getByRole('button', { name: /${goal}/i })`;
}
// Strike 3: Visual similarity via screenshot embedding
const candidate = await findVisuallySimilarElement(page, brokenLocator);
return candidate ? candidate : null;
}
If all three strikes fail, the healer marks the test as needs_human_review and pings the team on Slack with a trace archive.
When Healing Fails (And Why That’s OK)
I have seen teams expect healer agents to achieve 100% autonomous recovery. That is a fantasy. In my experience running agent pipelines for e-commerce clients, the healer fixes about 62β68% of selector failures within 30 seconds. The remaining 32β38% require human judgment because they signal real product bugs or major UI redesigns.
The healer’s real value is not eliminating human QA. It is reducing noise. A team that used to wake up to 47 broken tests now wakes up to 4 genuine failures and 43 auto-healed passes. That changes how you staff on-call rotations.
Wiring Them Together: The Agent Loop
Individual components are useless if they do not talk to each other. The agent loop is a state machine that passes context from planner β generator β executor β healer β reporter.
State Management with LangGraph
I use LangGraph to model the loop. LangGraph is ideal because it treats agent steps as nodes in a graph, with explicit edges for success, failure, and retry. Here is the graph structure I recommend:
import { StateGraph } from '@langchain/langgraph';
interface AgentState {
mission: string;
url: string;
plan: TestStep[] | null;
generatedCode: string | null;
testResult: 'passed' | 'failed' | 'healed' | 'needs_review';
retryCount: number;
}
const workflow = new StateGraph<AgentState>({ channels: {} })
.addNode('planner', planNode)
.addNode('generator', generateNode)
.addNode('executor', executeNode)
.addNode('healer', healNode)
.addNode('reporter', reportNode)
.addEdge('__start__', 'planner')
.addEdge('planner', 'generator')
.addEdge('generator', 'executor')
.addConditionalEdges('executor', (state) =>
state.testResult === 'failed' && state.retryCount < 3 ? 'healer' : 'reporter'
)
.addConditionalEdges('healer', (state) =>
state.testResult === 'healed' ? 'executor' : 'reporter'
)
.addEdge('reporter', '__end__');
const app = workflow.compile();
The key insight here is the conditional edge from executor to healer. The loop retries up to three times before escalating. Without a retry cap, a badly broken test would spin forever, burning LLM tokens and CI minutes.
Observability: Tracing Agent Decisions
You cannot debug an agent by console-logging. You need structured traces. I attach OpenTelemetry spans to each node:
import { trace } from '@opentelemetry/api';
async function planNode(state: AgentState): Promise<Partial<AgentState>> {
const span = trace.getTracer('agent').startSpan('planner');
const plan = await llmPlanner.invoke({ mission: state.mission, url: state.url });
span.setAttribute('plan.steps', plan.length);
span.end();
return { plan };
}
With tracing, you can answer questions like: “Why did the healer fire twice on the login test?” or “Which planner prompt version produces fewer generator errors?” Observability turns agent debugging from guesswork into data.
Playwright’s Native Agent Support in 2026
Playwright has embraced the agent pattern aggressively. In version 1.52 and 1.53 (released mid-2025), Microsoft shipped two features that make agent pipelines significantly easier.
What’s Available Now
- AI Fix with Playwright: The
--fix-with-aiCLI flag analyzes a failing test, suggests a selector or timing change, and applies it if you approve. It is not fully autonomous, but it surfaces the exact diff you need. - Describable Locators: You can now pass natural language descriptions to
page.getByDescription(). Playwright uses an on-device embedding model to resolve the description to an actual DOM element. This is the bridge between planner output and generator input. - Trace Viewer Enhancements: Traces now include AI action annotations. When an agent makes a decision, the trace viewer shows the reasoning inline, not just the click coordinates.
What’s Still Experimental
- Autonomous codegen: Playwright’s official codegen tool can accept a text prompt, but the output still requires manual review for complex multi-page flows.
- Self-healing selectors: While the accessibility-tree fallback is robust, visual embedding-based healing is only available through third-party libraries like our BrowsingBee toolkit.
If you are starting today, I recommend building on Playwright 1.53+ with LangGraph for orchestration. Do not wait for Microsoft to ship a fully autonomous agent. The primitives are mature enough to build your own.
Building Your First Agent Pipeline in TypeScript
Let me walk you through a minimal but complete agent pipeline you can run today.
Prerequisites
- Node.js 20+
- Playwright 1.53+ (
npm init playwright@latest) - An OpenAI or Azure OpenAI API key for the LLM backend
- LangGraph (
npm install @langchain/langgraph)
Step 1: Create the Planner Prompt
// planner.ts
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
const plannerTemplate = PromptTemplate.fromTemplate(`
You are a QA planner. Given a URL and a goal, output a JSON array of test steps.
Each step must have: action, target, and optional assertion.
URL: {url}
Goal: {goal}
`);
const model = new ChatOpenAI({ modelName: 'gpt-4.1', temperature: 0 });
export const planner = plannerTemplate.pipe(model);
Step 2: Create the Generator Node
// generator.ts
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
const generatorTemplate = PromptTemplate.fromTemplate(`
Convert the following test plan into a Playwright TypeScript test.
Use getByRole and getByText only. No arbitrary CSS selectors.
Plan: {plan}
`);
const model = new ChatOpenAI({ modelName: 'gpt-4.1', temperature: 0.1 });
export const generator = generatorTemplate.pipe(model);
Step 3: Wire the Executor and Healer
// executor.ts
import { test, expect } from '@playwright/test';
import { execSync } from 'child_process';
import { writeFileSync } from 'fs';
export async function executeNode(state: AgentState) {
writeFileSync('/tmp/generated.spec.ts', state.generatedCode!);
try {
execSync('npx playwright test /tmp/generated.spec.ts --reporter=json', { stdio: 'pipe' });
return { testResult: 'passed' };
} catch (e) {
return { testResult: 'failed', retryCount: state.retryCount + 1 };
}
}
Step 4: Run the Pipeline
npx ts-node agent.ts --url https://demo.playwright.dev/todomvc --goal "Add two todos and mark one complete"
The first run on my machine took 14 seconds end-to-end. The planner produced 5 steps. The generator wrote 28 lines of TypeScript. The test passed on the first try. When I deliberately broke a selector by renaming a class in the DOM, the healer recovered in 9 seconds using the text fallback.
Common Errors and How to Fix Them
When you run this pipeline for the first time, you will likely hit three issues:
- LLM hallucinates invalid locators: If the generator outputs
page.locator('.btn-123'), it is guessing. Add a negative example to your prompt showing what not to do. - Timeout on slow pages: The executor uses default Playwright timeouts. Wrap the test in
test.setTimeout(60000)if your application has heavy JavaScript bundles. - State pollution between retries: If the healer retries on a page that already has form data filled, the generator might double-fill fields. Always navigate to a fresh URL before a healed retry.
Fixing these three issues took my pipeline from a 40% success rate to 91% on a real-world e-commerce application with 340 test cases.
The India Angle: What Hiring Managers Actually Want
I interview SDETs and review job descriptions every week. In 2026, the demand curve for agent-savvy QA engineers in India is sharp.
Here is what I am seeing on Naukri and LinkedIn for Bangalore, Hyderabad, and Pune:
- Mid-level SDET (3β5 years): βΉ12β18 LPA. Requirement: “Playwright + basic GenAI prompting.”
- Senior SDET (5β8 years): βΉ22β32 LPA. Requirement: “Built autonomous test agents or self-healing frameworks.”
- Staff/Principal (8+ years): βΉ35β55 LPA. Requirement: “LLM evaluation, agent observability, LangGraph/LangChain production experience.”
The gap is massive. Most candidates can write Page Object Model code. Few can explain why a planner should retry only three times, or how to prevent prompt injection in a test generator. If you build even one agent pipeline and blog about it, you immediately separate yourself from 90% of the applicant pool.
Service companies like TCS and Infosys are still catching up. Product companies β Razorpay, Zerodha, Groww, PhonePe β are already hiring for agent-specific roles. My advice: skip the service companies if you want to work on this. Go where the infrastructure budget lives.
Key Takeaways
- AI test agents use a three-part architecture: Planner (reasoning), Generator (codegen), and Healer (repair).
- The planner prevents brittleness by analyzing DOM structure and generating fallback strategies before code is written.
- The generator must be constrained with few-shot examples and strict locator rules, or it will produce flaky XPath soup.
- The healer fixes 62β68% of selector failures autonomously. The rest are real bugs that need human eyes.
- Use LangGraph for the agent loop. It gives you retry logic, observability, and state persistence out of the box.
- Playwright 1.53+ provides native primitives like describable locators and AI-assisted fixes. Build on top of them rather than reinventing.
- In India, agent experience commands a 40β60% salary premium over traditional automation skills.
FAQ
Do I need to learn LangChain before building AI test agents?
No, but it helps. You can build a simple agent loop with raw OpenAI API calls and a while statement. LangGraph becomes valuable when you need branching logic, human-in-the-loop approvals, and long-running state. Start simple, upgrade when your retry logic becomes spaghetti.
How much does running an agent pipeline cost in LLM tokens?
For a typical 10-step regression plan, the planner consumes ~2,000 tokens and the generator ~1,500 tokens. At GPT-4.1 pricing, that is roughly $0.08 per test. If you run 200 tests a day, your monthly LLM bill is around $480. Compare that to one manual QA engineer’s salary, and the math is obvious.
Can agents handle visual regression testing?
Yes, but not through the planner-generator-healer loop alone. Visual regression requires a separate comparison engine. I integrate BrowsingBee’s visual diff as a fourth node in the graph. The agent captures screenshots, the diff engine compares them, and the healer only fires if the visual delta exceeds a threshold.
What is the biggest mistake teams make when adopting AI test agents?
They remove human QA too early. Agents are excellent at writing first drafts and maintaining selectors. They are terrible at understanding business context, evaluating UX quality, and deciding whether a failure is a bug or a feature change. Keep your QA team, but retrain them to supervise agents, review traces, and handle escalations.
Where can I find more code examples for Playwright agents?
Start with the 2026 Playwright Automation Blueprint for foundational patterns. Then layer on agent logic using the LangGraph tutorial above. For LLM evaluation of agent outputs, read my comparison of DeepEval vs PromptFoo.
