| |

AI Test Agents Need a Planner, Generator, and Healer

AI test agents planner generator healer workflow featured image

Table of Contents

Contents

What AI Test Agents Are

AI test agents are not magic buttons that read a Jira ticket and ship perfect automation. They are systems that combine an LLM, tools, browser control, test data, assertions, and evaluation rules. The useful version looks less like a chatbot and more like a junior SDET with a strict checklist.

I see teams make the same mistake: they ask one prompt to do planning, coding, debugging, selector repair, and reporting in a single pass. That looks impressive in a demo. It breaks when the login page changes, when the test data is stale, or when the generated assertion checks only that “something is visible”.

The market is clearly moving in this direction. Microsoft’s Playwright repository has more than 90,000 GitHub stars, and the Playwright MCP server is already a serious signal with more than 33,000 stars. The npm API showed more than 14 million monthly downloads for @playwright/mcp and more than 158 million monthly downloads for @playwright/test for the last-month window I checked. That does not mean every team should blindly adopt agents. It means QA engineers need to understand how these systems actually work.

For this Day 4 article in the 100 Days of AI in QA and SDET series, I want to make one point very clear: good AI test agents need separation of responsibilities. If you do not separate planning, generation, healing, and evaluation, you get flaky automation with a shiny AI label.

Why the Magic Button Mental Model Fails

The phrase “AI agent” creates unrealistic expectations. A manager hears it and imagines one click: generate tests, run tests, fix tests, create bug reports. A tester hears it and fears replacement. Both reactions miss the real engineering problem.

One prompt cannot own the whole testing lifecycle

A browser test has many decisions hidden inside it:

  • Which user journey matters for business risk?
  • Which test data is safe to use repeatedly?
  • Which selector is stable enough for CI?
  • Which assertion proves the product behavior?
  • Which failure needs a product bug versus a test fix?

When one LLM prompt tries to answer all five, the output becomes overconfident. It may create a passing test that checks the wrong thing. That is worse than a failing test because it gives the team false confidence.

The agent needs constraints, not freedom

Strong QA automation is not built by giving the agent unlimited freedom. It is built by giving it a small set of tools and forcing it to explain every decision. I want the agent to say:

  1. Here is the journey I plan to cover.
  2. Here are the selectors I will use and why.
  3. Here are the assertions that prove the outcome.
  4. Here are the risks I cannot verify automatically.
  5. Here is the evidence from the run.

That structure turns an AI demo into an engineering workflow. It also gives senior SDETs a review surface. You cannot review “AI created this test”. You can review a plan, a generated spec, a trace, and a healing diff.

Flakiness becomes harder to detect

Traditional flaky tests are annoying but visible. AI-generated flaky tests are more dangerous because they often come with confident explanations. The agent may repair a selector by picking the first matching button. It may replace a precise assertion with a weak URL check. It may ignore a timing issue and add a random timeout.

This is why I dislike the “self-healing tests” pitch when it is not backed by audit logs. Healing is useful only when the system stores what changed, why it changed, and whether the new behavior still matches the original intent.

AI Test Agents: The Planner, Generator, Healer Pattern

The pattern I prefer for AI test agents has three main roles: planner, generator, and healer. You can build them as separate prompts, separate functions, or separate nodes in a LangGraph-style flow. The important thing is that each role has a narrow job.

1. Planner: convert product intent into test intent

The planner does not write code. It reads the feature, acceptance criteria, API contract, or exploratory notes and produces a test plan. A good planner output includes:

  • Primary user journey
  • Preconditions and test data needs
  • Assertions that prove the business outcome
  • Negative cases worth automating
  • Risks that need manual review

For example, if the feature is “apply coupon during checkout”, the planner should not jump into code. It should first state that the test needs a valid user, a product in cart, a coupon with known discount rules, and an assertion on final payable amount. That is QA thinking. The LLM can assist, but the structure must come from us.

2. Generator: convert approved intent into runnable code

The generator writes Playwright, Selenium, API, or contract tests only after the plan is approved or automatically validated. It should use the team’s framework conventions: fixtures, page objects if used, test tags, retry policy, trace settings, and reporting format.

The generator must not invent random helper methods. It should inspect existing files, reuse existing selectors, and keep tests small. In Playwright, I prefer role-based locators first, data-testid second, and CSS/XPath only when the application gives no better option.

3. Healer: propose small repairs with evidence

The healer runs after a failure. It does not blindly edit the test. It reads the failure message, trace, screenshots, DOM snapshot, and last known passing intent. Then it proposes a minimal change.

A responsible healer output looks like this:

  • Failure category: selector changed, assertion changed, data issue, environment issue, product bug
  • Suggested change: exact diff
  • Confidence score: low, medium, high
  • Evidence: screenshot, DOM match, trace step
  • Human review required: yes or no

If the healer cannot prove that the new selector maps to the same business element, it should refuse to auto-merge. This one rule prevents many silent failures.

Why this pattern works

The planner-generator-healer pattern gives you checkpoints. It does not remove testers from the loop. It moves testers into better review positions. Instead of typing boilerplate for three hours, an SDET reviews intent, verifies generated assertions, and approves safe repairs.

This is the difference between AI-assisted testing and AI-chaotic testing.

A Concrete Playwright Example

Let us make this practical. Imagine we want an agent to create a checkout coupon test. The bad prompt is: “Write a Playwright test for coupon checkout.” The better workflow gives the agent a contract.

Planner prompt contract

You are the planner for a QA automation workflow.
Do not write code.
Return JSON only.
Feature: Coupon checkout
User goal: A logged-in user applies coupon QA10 and sees 10% discount.
Required output:
- journey_name
- preconditions
- test_steps
- assertions
- data_dependencies
- risks

The planner should return something like this:

{
  "journey_name": "logged_in_user_applies_valid_coupon",
  "preconditions": ["user exists", "cart has one eligible product", "coupon QA10 is active"],
  "test_steps": [
    "login as test user",
    "add eligible product to cart",
    "apply coupon QA10",
    "open order summary"
  ],
  "assertions": [
    "coupon row is visible",
    "discount equals 10 percent of item subtotal",
    "final payable amount equals subtotal minus discount plus shipping"
  ],
  "data_dependencies": ["stable test user", "known product SKU", "coupon configuration"],
  "risks": ["tax/shipping rules may vary by region"]
}

Generator output in Playwright TypeScript

import { test, expect } from '@playwright/test';

test('logged in user applies QA10 coupon', async ({ page }) => {
  await page.goto('/login');
  await page.getByLabel('Email').fill(process.env.QA_USER_EMAIL!);
  await page.getByLabel('Password').fill(process.env.QA_USER_PASSWORD!);
  await page.getByRole('button', { name: 'Sign in' }).click();

  await page.goto('/products/qa-stable-sku');
  await page.getByRole('button', { name: 'Add to cart' }).click();
  await page.getByRole('link', { name: 'Cart' }).click();

  await page.getByLabel('Coupon code').fill('QA10');
  await page.getByRole('button', { name: 'Apply coupon' }).click();

  const subtotal = await page.getByTestId('cart-subtotal').innerText();
  const discount = await page.getByTestId('coupon-discount').innerText();
  const total = await page.getByTestId('order-total').innerText();

  expect(page.getByText('Coupon QA10 applied')).toBeVisible();
  expect(toMoney(discount)).toBeCloseTo(toMoney(subtotal) * 0.10, 2);
  expect(toMoney(total)).toBeGreaterThan(0);
});

function toMoney(value: string): number {
  return Number(value.replace(/[^0-9.]/g, ''));
}

This is not perfect production code, but it shows the point. The generated test is tied to a plan. It uses accessible locators where possible and test IDs for values that are not user-facing labels. A senior SDET can now review the behavior instead of guessing what the model intended.

Healer rule for selector changes

If getByLabel('Coupon code') fails:
1. Inspect DOM snapshot and screenshot.
2. Search for an input near text containing Coupon, Promo, Voucher.
3. Prefer accessible name over CSS.
4. Return a diff only if the new selector maps to the same field.
5. Mark human_review_required=true if there are multiple candidates.

Notice the word “same”. Healing must preserve intent. If the coupon field changed to a gift card field, the healer should not patch the test. It should flag a product or UX change for review.

The Evaluation Layer Most Teams Skip

The missing layer in many AI test agents is evaluation. Teams evaluate the final Playwright run but not the agent’s reasoning. That is incomplete.

Promptfoo has more than 22,000 GitHub stars and describes itself as a way to test prompts, agents, and RAG systems with CI/CD integration. DeepEval has more than 16,000 GitHub stars and focuses on LLM evaluation. These tools matter because AI workflows need tests around the test generator itself.

What to evaluate

I would start with five checks:

  • Plan quality: Does the plan include business assertions?
  • Code safety: Does the generated test avoid hard-coded secrets?
  • Selector quality: Does it prefer stable locators?
  • Assertion strength: Does the test prove behavior, not just page load?
  • Healing safety: Does every repair include evidence and a small diff?

You can automate part of this with static checks. For example, reject tests that use waitForTimeout unless there is a written exception. Reject tests that contain passwords. Reject tests that have no expect. Reject healer diffs that modify more than a small number of lines.

A simple Promptfoo-style check

description: Evaluate Playwright test generator output
prompts:
  - file://prompts/generate-playwright-test.txt
providers:
  - openai:gpt-4.1-mini
tests:
  - vars:
      feature: "Apply coupon QA10 during checkout"
    assert:
      - type: contains
        value: "expect("
      - type: not-contains
        value: "waitForTimeout"
      - type: javascript
        value: "output.includes('getByRole') || output.includes('getByLabel')"

The exact syntax may change by tool and version, but the idea is simple: test the agent before trusting the tests created by the agent.

AI Test Agents in CI: A Workflow That Does Not Lie

I do not want an AI agent pushing random commits to the main branch. I want it to behave like a disciplined contributor.

My preferred CI flow

  1. Planner creates a JSON test plan from a ticket or spec.
  2. Reviewer or policy check approves the plan.
  3. Generator creates a Playwright spec in a branch.
  4. Static checks inspect selectors, secrets, waits, and assertions.
  5. Playwright runs with trace on failure.
  6. Healer proposes a patch only for safe categories.
  7. Human review approves merge for new tests and medium-risk repairs.

This flow is slower than a viral demo. It is also far more useful for teams that own revenue-critical flows.

Where MCP fits

The Model Context Protocol helps agents connect to tools in a consistent way. Playwright MCP gives agents browser capabilities through a server instead of asking the model to imagine the page. That is a big step because the agent can inspect actual browser state.

Still, MCP is plumbing, not quality. Browser access does not guarantee good assertions. It only gives the agent better evidence. Your QA process still needs intent, constraints, review, and metrics.

Metrics I track

  • Percentage of generated tests accepted without major rewrite
  • Flake rate of agent-generated tests versus human-written tests
  • Number of healer patches auto-approved
  • Number of healer patches rejected by reviewers
  • Defects caught by generated tests in staging or production-like runs

If a team cannot measure these, it should not claim that agents improved quality. It can say the experiment is promising. That is honest.

India Career Context for SDETs

For QA engineers in India, this shift is important. The gap between service-company automation and product-company SDET work is already visible. AI increases that gap.

A manual tester who only prompts “write test cases for login page” will not stand out. An SDET who can design a planner-generator-healer workflow, add evaluation checks, and integrate it with Playwright CI will stand out in interviews.

What hiring managers will ask

I expect more SDET interviews to include questions like:

  • How do you validate LLM-generated test cases?
  • How do you prevent self-healing tests from hiding product bugs?
  • What is the difference between flaky test repair and product behavior change?
  • How would you evaluate a prompt that generates Playwright specs?
  • How do you keep secrets and test data safe in an AI workflow?

For mid-level engineers targeting ₹25-40 LPA product roles, these are better talking points than saying “I used ChatGPT to write tests”. Companies do not pay for prompt copy-paste. They pay for judgment.

What to learn first

Do not start by learning every AI framework. Start with fundamentals:

  1. Playwright TypeScript with fixtures and traces
  2. API testing and contract thinking
  3. Prompt evaluation basics with Promptfoo or DeepEval
  4. CI/CD gates in GitHub Actions or Jenkins
  5. Basic agent architecture: planner, tool use, memory, evaluator

If you want adjacent reading, I have already written about DeepEval vs PromptFoo for SDETs, Playwright AI test generation, and API testing with AI agents. Read those after this article if you want the broader map.

A 30-Day Implementation Roadmap

Here is a practical roadmap for a small QA team. Do not attempt to automate the entire regression suite in week one.

Week 1: choose one stable flow

Pick one flow that matters but does not depend on five unstable systems. Login is often too simple. End-to-end payment is often too risky. A good middle path is search, cart, profile update, or coupon validation in a test environment.

  • Write the human-approved test intent.
  • Collect stable test data.
  • Define selector rules.
  • Define what the agent must never do.

Week 2: build the planner and generator

Create two prompts or two functions. Store planner output as JSON. Store generator output as a branch or patch. Make the generated code pass lint and Playwright locally before it enters CI.

Week 3: add evaluation gates

Add cheap checks first. Does the test contain assertions? Does it avoid sleeps? Does it use allowed locator patterns? Does it keep secrets out of code? These checks catch many weak outputs before a human wastes time reviewing them.

Week 4: add limited healing

Start with selector repair only. Do not let the healer rewrite assertions or skip steps. Require evidence. Store every patch. Review rejected patches every Friday. That review becomes your training data for better policies.

This 30-day roadmap is not flashy. It is realistic. It teaches your team how to trust AI slowly and safely.

Key Takeaways: Treat AI Test Agents Like Systems

AI test agents can help QA teams move faster, but only when we treat them as engineered systems. The magic button mindset creates fragile tests and false confidence.

  • Separate planner, generator, and healer roles.
  • Make the planner write test intent before code exists.
  • Force the generator to follow framework rules.
  • Allow healing only with evidence and small diffs.
  • Evaluate the agent, not just the Playwright run.

My opinion is simple: the future SDET is not replaced by an AI test agent. The future SDET designs the agent, tests the agent, and decides when the agent is wrong.

FAQ

Are AI test agents ready for production QA teams?

They are ready for controlled workflows, not blind autonomy. Use them for planning assistance, test generation drafts, failure triage, and limited selector healing. Keep human review for business-critical flows.

Should I use Playwright MCP for AI testing?

Use it if your agent needs real browser context. Playwright MCP helps the agent inspect pages and perform browser actions. It does not replace good test design or evaluation.

Can self-healing tests hide real bugs?

Yes. A healer can accidentally change the test to match broken product behavior. That is why every healing action needs evidence, a small diff, and review rules.

What should manual testers learn first?

Learn test design, Playwright basics, API testing, and prompt evaluation. Do not start with complex agent frameworks before you can judge whether the generated test is correct.

Which tools should I explore next?

Start with Playwright, Playwright MCP, Promptfoo, and DeepEval. Add LangGraph or another orchestration tool only after you understand the workflow boundaries.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.