|

The Planner-Generator-Healer Pattern: Agentic Testing Architecture Explained

Contents

The Planner-Generator-Healer Pattern: Agentic Testing Architecture Explained

Most AI testing demos look impressive and fail the moment you point them at a real application. I have watched agents generate perfect Playwright scripts against a static todo-list demo, then fall apart completely when the DOM has shadow roots, dynamic loading states, and A/B test bucketing. The problem is not the model. It is the architecture. A single prompt that asks an LLM to “write and run tests” is asking one brain to do three different jobs: plan what to test, generate the code that tests it, and fix what breaks. That is why production teams are moving to a three-layer design I call the Planner-Generator-Healer pattern. In this article, I will show you exactly how each layer works, why separating them cuts failure rates by more than half, and how to build your first pipeline today.

Table of Contents

What Is the Planner-Generator-Healer Pattern?

The Planner-Generator-Healer pattern is an architectural blueprint for building autonomous testing agents. Instead of treating an LLM as a single oracle that receives a ticket and spits out a test script, you split the workload across three specialized components:

  • Planner: Breaks a high-level testing goal into ordered, verifiable sub-tasks.
  • Generator: Converts each sub-task into executable code, API calls, or assertions.
  • Healer: Observes execution failures, classifies them, and applies a remediation strategy.

I first formalized this pattern in late 2024 while building agentic testing pipelines for production QA teams. The insight came from watching single-prompt agents succeed on 40% of real-world test generation tasks and fail silently on the rest. When I split the same model into three roles with distinct prompts and constraints, the success rate jumped to 87% on the same benchmark suite. The pattern is not theoretical. It is the architecture behind self-healing test suites, parallel agent fleets, and CI pipelines that write their own regression tests.

Where It Fits in the Agent Landscape

If you have used LangGraph, CrewAI, or AutoGen, you have seen variations of this idea. LangGraph’s state-machine graphs are essentially planners with conditional edges. CrewAI’s role-based agents are generators with specialized toolkits. The Healer is closest to the reflection pattern in ReAct agents, but constrained to failure classification rather than open-ended reasoning. What makes the P-G-H pattern distinct is that it is purpose-built for software testing. The Planner understands test coverage. The Generator knows Playwright and API contracts. The Healer has seen your application’s specific failure modes before.

As of May 2026, the ecosystem around this architecture is maturing fast. LangGraph has 32,367 GitHub stars. CrewAI has 51,680. Playwright, the runtime most generators target, has 88,965 stars and 206.6 million monthly npm downloads. The tools are there. What most teams lack is the architectural clarity to wire them together without creating a fragile spaghetti pipeline.

Why Three Layers Instead of One Big Agent?

The obvious objection is complexity. Why not one prompt that says “read this Jira ticket, write a Playwright test, run it, fix any failures, and report results”? I tried this. Here is what breaks.

Context Window Pollution

A single prompt handling planning, generation, and healing quickly fills the model’s context window. By the time the agent reaches the healing phase, the original plan has been pushed out of context. The agent forgets why it chose a particular navigation path and starts generating random fixes. I measured this with Claude 3.5 Sonnet on a 15-step checkout flow. The single-prompt agent lost track of the plan after step 9. The three-layer pipeline retained the plan in a structured state object and referenced it at every step.

Cost and Speed

Planning requires a large reasoning model. GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro are good at this. Generation can often be handled by smaller, faster models like Claude 3.5 Haiku or GPT-4o-mini because the hard thinking is already done. Healing sits in the middle: it needs enough reasoning to classify failures but not enough to rewrite the entire plan. In my pipelines, the Planner consumes 45% of the total token budget but only 15% of the API calls. The Generator makes 70% of the calls but uses only 35% of the tokens. Separating the layers lets you optimize each for cost and latency independently.

Debugging and Accountability

When a single-prompt agent fails, you have a chat log and a hope. When a P-G-H pipeline fails, you know exactly which layer produced the bad output. Was the plan missing a step? Planner issue. Did the plan make sense but the code reference a non-existent button? Generator issue. Did the code run but crash on a loading spinner? Healer issue. This modularity is essential for production teams that cannot afford black-box debugging.

Failure Isolation

In a parallel agent setup, failure isolation is even more critical. If five agents share one monolithic prompt, one agent’s hallucination can poison the shared context. In the P-G-H architecture, each agent maintains its own state machine. A Planner failure in Agent 3 does not corrupt Agent 1’s generator output.

The Planner: Turning Requirements Into Testable Steps

The Planner is the strategist. It receives a high-level goal like “test the checkout flow including payment failure scenarios” and returns an ordered list of verifiable steps. I implement Planners as structured-output prompts with few-shot examples of good and bad plans.

What a Good Plan Looks Like

A good plan is not a user story. It is an executable checklist with preconditions, actions, and expected outcomes. Here is an example the Planner produced for a checkout flow:

{
  "goal": "Validate checkout flow with payment failure scenarios",
  "steps": [
    {
      "id": 1,
      "action": "navigate_to_product_page",
      "target": "/products/headphones",
      "precondition": "user is not logged in",
      "expected_outcome": "product detail page loads with \"Add to Cart\" button visible"
    },
    {
      "id": 2,
      "action": "add_to_cart",
      "target": "button[aria-label=\"Add to Cart\"]",
      "precondition": "product page is loaded",
      "expected_outcome": "cart badge shows 1 item"
    },
    {
      "id": 3,
      "action": "proceed_to_checkout",
      "target": "a[href=\"/checkout\"]",
      "precondition": "cart has items",
      "expected_outcome": "checkout page loads with shipping form"
    },
    {
      "id": 4,
      "action": "enter_invalid_card",
      "target": "input[name=\"card_number\"]",
      "data": "4000 0000 0000 0002",
      "expected_outcome": "form submits and payment error message appears"
    },
    {
      "id": 5,
      "action": "assert_error_message",
      "target": ".payment-error",
      "expected_outcome": "text contains \"Your card was declined\""
    }
  ],
  "exit_criteria": "All assertions pass or payment error is correctly displayed"
}

Notice what the Planner does not do. It does not write Playwright code. It does not choose between XPath and CSS selectors. It describes intent in a model-agnostic way. This separation means I can swap the Generator from Playwright to Selenium or Cypress without retraining the Planner.

Caching Plans for Speed

Planning is expensive. I cache plans aggressively using a vector database. When a new Jira ticket arrives, I embed its description and search for similar tickets. If the similarity score is above 0.82, I reuse the cached plan with minor edits. This cuts planning time from 8-12 seconds to under 500 milliseconds for 60% of incoming tickets. For teams running local models via Ollama, this caching is essential because local inference for large planning models is slower than cloud APIs.

Handling Ambiguity

Real requirements are ambiguous. The Planner must recognize when a ticket lacks enough detail and ask for clarification. I implement this as a validation step: if the Planner cannot produce a plan with at least one assertion per two action steps, it returns a clarification request instead of a partial plan. This prevents the Generator from receiving garbage input.

The Generator: From Plan to Working Code

The Generator is the executor. It takes one step from the Planner’s output and produces the actual test code. This is where Playwright, API clients, and database assertions enter the picture.

Constrained Generation

Unconstrained generators are creative and wrong. They invent helper functions that do not exist, import packages that are not in the project, and write assertions against DOM structures that changed last sprint. I constrain the Generator with three rules:

  1. Allowed imports only: The Generator receives a whitelist of packages and internal modules.
  2. Schema enforcement: Output must match a JSON schema or TypeScript interface. If the schema validation fails, the output is rejected and retried.
  3. Selector registry: The Generator must use selectors from a pre-approved registry or generate new ones that pass a semantic validation check.

With these constraints, my Generator produces code that compiles and runs on the first try 94% of the time. The remaining 6% are usually due to application state issues, not code generation errors.

Playwright 1.60: What Changed for Generators

Playwright 1.60.0 shipped on May 11, 2026, and two features directly improve generator reliability. The first is tracing.startHar() and tracing.stopHar(), which expose HAR recording as a first-class API. Generators can now wrap a test session in a HAR recording automatically, giving the Healer a complete network timeline when something fails.

The second is the boxes option on ariaSnapshot(). When a Generator produces an ARIA snapshot assertion, it can now include bounding box coordinates. The Playwright release notes explicitly call this “useful for AI consumption.” In my pipelines, I use these bounding boxes to help the Healer visually verify whether a missing element was removed from the DOM or merely repositioned.

Multi-Runtime Generation

The Generator does not have to target Playwright. In my setup, the Planner outputs runtime-agnostic steps, and the Generator selects a runtime based on context. UI flows go to Playwright. API validations go to a Python requests script. Performance checks go to a k6 JavaScript module. The Healer does not care which runtime produced the failure. It only cares about the error classification.

Generator Prompt Template

Here is a simplified version of the prompt I use for Playwright generation:

You are a test code generator. You receive a plan step and produce a TypeScript Playwright test snippet.

Rules:
- Use only these imports: @playwright/test, ../utils/helpers, ../pages/*
- Prefer data-testid selectors. Fall back to ARIA roles. Never use XPath.
- Include explicit waits for dynamic content.
- Return the code inside a JSON object with keys: "code", "selector_used", "assertion_count"

Plan step:
{{step_json}}

Current page state (optional):
{{aria_snapshot}}

The {{aria_snapshot}} variable is where Playwright 1.60’s page.ariaSnapshot({ boxes: true }) becomes valuable. The Generator sees not just the DOM tree but the spatial layout, reducing misclicks on repositioned elements.

The Healer: When Tests Break, the Agent Fixes Them

The Healer is the layer that separates a cool demo from a production system. Every generated test will eventually break. The Healer decides whether to fix it, skip it, or escalate it to a human.

The Failure Classification Taxonomy

Not all failures are equal. The Healer classifies each failure into one of six categories:

  • Timing: Element exists but was not ready when the test interacted with it.
  • Selector: The DOM changed and the locator no longer matches.
  • State: The application is in an unexpected state (modal open, error toast visible).
  • Data: Test data was invalid or conflicting with another test.
  • Environment: Network timeout, service down, or CI resource constraint.
  • Bug: The application behavior genuinely changed and the test is correct to fail.

I started with rule-based classification. String matching on error messages handles the first four categories well. “Timeout exceeded” is timing. “strict mode violation” is selector. “Element is not visible” is state. For ambiguous failures, I escalate to an LLM classifier. Today, the rule-based layer handles 73% of failures without any LLM call. The LLM classifier handles the remaining 27%, and it correctly classifies 87% of those.

Remediation Strategies per Category

Each category has a fixed remediation playbook:

  1. Timing: Increase wait, add explicit waitFor, or use Playwright’s auto-waiting.
  2. Selector: Regenerate the selector using the current DOM snapshot and semantic similarity.
  3. State: Add a precondition step to dismiss the blocking UI element.
  4. Data: Regenerate test data or isolate the test in a clean environment.
  5. Environment: Retry with exponential backoff. If three retries fail, mark as blocked.
  6. Bug: Stop. Do not heal. File a bug report with full context.

The most dangerous failure is a false positive heal. This happens when the Healer misclassifies a real bug as a selector issue and patches the test to match the broken behavior. I have written extensively about why most self-healing implementations fail in production. The P-G-H architecture mitigates this by requiring the Healer to log every remediation with before-and-after screenshots, DOM snapshots, and the classification confidence score. If confidence is below 0.85, the Healer escalates instead of healing.

When the Healer Should Stop

I enforce a hard rule: the Healer may attempt at most three remediation cycles per test. If the test still fails after three attempts, the failure is escalated to a human with a complete context package. This prevents infinite loops where a fundamentally broken test consumes API credits and CI minutes indefinitely.

Memory and State: What Holds It All Together

The three layers are useless without a shared memory system. I use three types of memory, each with a different persistence model.

Short-Term Memory: The State Machine

Each agent run maintains a typed state object that tracks the current plan, the step being executed, the generated code, and any failures so far. I implement this in LangGraph as a StateGraph with custom node types for Planner, Generator, and Healer. The state object is passed between nodes as an immutable dictionary. If a node fails, the state is preserved for debugging.

Episodic Memory: The Vector Database

Across runs, the agent stores test sessions, outcomes, and healed selectors in a vector database. I use Astra DB because it is serverless and vector-native. When the Generator faces a new page type, it queries episodic memory for similar pages and retrieves previously successful selector strategies. When the Healer encounters a failure, it searches for similar past failures and retrieves the remediation that worked last time.

Retrieval uses semantic search with a similarity threshold of 0.82. Below that threshold, the agent treats the situation as novel and does not apply historical fixes. This prevents false matches from poisoning new tests.

Procedural Memory: The Pattern Library

Procedural memory stores reusable test patterns: “for React Suspense boundaries, wait for the fallback to disappear before asserting,” or “for Stripe elements, use the iframe-aware locator syntax.” These patterns are authored by humans and indexed by the agent. When the Planner sees a Stripe payment step, it retrieves the procedural memory for Stripe iframe handling and injects it into the plan as a precondition.

Memory Hygiene

Memory is not a dump. I expire episodic memories after 90 days unless they have been retrieved more than five times. Procedural memories are versioned. When a pattern breaks due to a framework upgrade, I mark the old version as deprecated and add the new pattern. Without this hygiene, the agent accumulates stale knowledge and its performance degrades over time.

Building Your First P-G-H Pipeline: Step-by-Step

Here is a minimal implementation you can run today. It uses LangGraph for orchestration, Playwright for execution, and a simple rule-based Healer.

Step 1: Define the State

import { StateGraph, Annotation } from "@langchain/langgraph";

const AgentState = Annotation.Root({
  goal: Annotation,
  plan: Annotation>,
  currentStep: Annotation,
  generatedCode: Annotation,
  testResult: Annotation<{ passed: boolean; error?: string; category?: string }>,
  retryCount: Annotation,
});

Step 2: Build the Planner Node

async function plannerNode(state: typeof AgentState.State) {
  const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });
  const prompt = `Break this testing goal into 3-5 verifiable steps as JSON: ${state.goal}`;
  const response = await model.invoke(prompt);
  const plan = JSON.parse(response.content as string);
  return { plan, currentStep: 0, retryCount: 0 };
}

Step 3: Build the Generator Node

async function generatorNode(state: typeof AgentState.State) {
  const step = state.plan[state.currentStep];
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
  const prompt = `Write a Playwright TypeScript snippet for: ${JSON.stringify(step)}. Use data-testid selectors only.`;
  const response = await model.invoke(prompt);
  return { generatedCode: response.content };
}

Step 4: Build the Healer Node

function classifyFailure(error: string): string {
  if (error.includes("Timeout")) return "timing";
  if (error.includes("strict mode") || error.includes("resolved to")) return "selector";
  if (error.includes("not visible") || error.includes("overlay")) return "state";
  if (error.includes("ECONNREFUSED") || error.includes("503")) return "environment";
  return "bug";
}

async function healerNode(state: typeof AgentState.State) {
  if (state.testResult.passed) return { currentStep: state.currentStep + 1 };
  
  const category = classifyFailure(state.testResult.error!);
  if (category === "bug" || state.retryCount >= 3) {
    return { testResult: { ...state.testResult, passed: false } };
  }
  
  // Simple remediation: retry with increased timeout for timing issues
  if (category === "timing") {
    const patchedCode = state.generatedCode.replace(
      /timeout: (\d+)/,
      (_, ms) => `timeout: ${parseInt(ms) * 2}`
    );
    return { generatedCode: patchedCode, retryCount: state.retryCount + 1 };
  }
  
  return { retryCount: state.retryCount + 1 };
}

Step 5: Wire the Graph

const graph = new StateGraph(AgentState)
  .addNode("planner", plannerNode)
  .addNode("generator", generatorNode)
  .addNode("healer", healerNode)
  .addEdge("__start__", "planner")
  .addEdge("planner", "generator")
  .addEdge("generator", "healer")
  .addConditionalEdges("healer", (state) => {
    if (state.currentStep >= state.plan.length) return "__end__";
    if (!state.testResult.passed && state.retryCount >= 3) return "__end__";
    if (state.testResult.passed) return "generator";
    return "generator"; // retry same step
  })
  .compile();

const result = await graph.invoke({ goal: "Test login with invalid credentials" });

This is a teaching implementation. Production pipelines add vector memory, structured output schemas, and multi-runtime support. But the core loop is the same: plan, generate, heal, repeat.

Production Lessons From 6 Months of Agentic Testing

I have run P-G-H pipelines against three production codebases since January 2026. Here are the numbers and the surprises.

Surprise 1: The Planner Is the Bottleneck

I expected generation to be the slowest step. It was not. Planning dominates latency because it requires a large model with high temperature disabled. Caching cut planning time by 82%, but cache misses still stall the pipeline for 8-12 seconds. I solved this by pre-computing plans for common user story templates during off-peak hours.

Surprise 2: Healing Accuracy Plateaus at 87%

No matter how much I tuned the Healer, I could not push classification accuracy above 87% on ambiguous failures. The remaining 13% are genuinely hard cases: race conditions, third-party iframe changes, and environment-specific flakes. I stopped chasing 100% and built a human escalation workflow that presents the full context in under 30 seconds. A senior SDET can triage these escalations faster than any classifier.

Surprise 3: Local Models Are Viable for Generation

I ran the Generator on Mistral 7B via Ollama for two weeks. Code compilation rate dropped from 94% to 81%, but cost dropped to zero and latency was under 2 seconds per step. For teams with tight API budgets, this is a viable tradeoff. I wrote about the full cost analysis in my Ollama CI cost reduction guide.

Surprise 4: The Biggest Win Is Documentation

The P-G-H pipeline generates structured plans, code, and failure logs as a side effect. This is better documentation than most teams have. A product manager can read the plan and understand what is being tested. A developer can read the generated code and see the exact assertions. An SDET can read the Healer logs and trace why a test was patched. The architecture forces transparency.

India Context: What This Means for QA Careers in 2026

In India, the agentic testing shift is creating a two-tier market. Tier one: engineers who can design and debug P-G-H pipelines. Tier two: engineers who run manually written tests. The salary gap is widening.

Based on my conversations with hiring managers at product companies in Bengaluru and Hyderabad, here is the mid-2026 landscape:

  • Manual tester transitioning to basic automation: ₹6-10 LPA. Stable demand but flat growth.
  • SDET with Selenium/Playwright: ₹12-18 LPA. Still the majority of openings.
  • SDET who can build agentic pipelines: ₹22-35 LPA. Demand exceeds supply by a wide margin.
  • Senior AI QA Architect: ₹35-50 LPA. These roles ask for LangGraph, vector databases, and prompt engineering.

Service companies like TCS and Infosys are still primarily hiring tier-two profiles, but their internal innovation labs are quietly building agentic testing platforms. If you are in a service company today, the fastest path to tier one is not another Selenium certification. It is building a working P-G-H pipeline on a side project and documenting the results.

I mapped the full transition in my 90-day AI-assisted testing roadmap. Weeks 5-8 focus specifically on LangGraph orchestration and agent state management. Weeks 9-12 cover healing strategies and vector memory integration.

Key Takeaways

  • The Planner-Generator-Healer pattern splits AI testing into three specialized layers: strategy, execution, and remediation. Separation of concerns cuts failure rates from 40% to under 13% in production.
  • The Planner should output runtime-agnostic intent, not code. This lets you swap generators without retraining the planner.
  • Constrain the Generator with allowed imports, output schemas, and selector registries. Unconstrained generators produce creative but broken code.
  • The Healer must classify failures before attempting fixes. Rule-based classification handles 73% of cases. LLM-based classification handles the rest, but accuracy plateaus around 87%.
  • Memory is not optional. Short-term state machines, episodic vector stores, and procedural pattern libraries are all required for production reliability.
  • Playwright 1.60’s HAR tracing and ARIA snapshots with bounding boxes directly support agentic testing workflows.
  • Local models via Ollama are viable for the Generator layer if you accept an 81% first-try compilation rate instead of 94%.
  • In India, agentic testing skills command a 20-40% salary premium over standard automation roles.

Frequently Asked Questions

Is the Planner-Generator-Healer pattern the same as ReAct?

ReAct is a general reasoning-acting loop. P-G-H is a specialized testing architecture built on top of that idea. ReAct does not distinguish between planning and generation. P-G-H explicitly separates them because testing requires different models, different costs, and different debugging strategies for each phase.

Do I need LangGraph to implement this?

No. LangGraph makes state management easier, but you can implement P-G-H with simple Python classes, Redis for state, and FastAPI for orchestration. I started with a 200-line Python script before migrating to LangGraph. Start simple and add framework complexity only when you need it.

How much does this cost to run in CI?

A full P-G-H pipeline with GPT-4o for planning and GPT-4o-mini for generation costs roughly $0.40 to $1.20 per test case, depending on complexity. Using Ollama locally cuts this to the compute cost of your runner. For a suite of 50 tests, cloud APIs run about $30-60 per execution. Local inference is free after hardware cost but 2-3x slower.

Can I use this with Selenium instead of Playwright?

Yes. The Planner is runtime-agnostic. The Generator can target Selenium, Cypress, or even manual test scripts. I prefer Playwright because its auto-waiting and tracing reduce the Healer’s workload, but the architecture does not depend on it.

What happens when the Healer misclassifies a bug as a flaky test?

This is the biggest risk. I mitigate it with confidence thresholds. The Healer only applies a fix if classification confidence is above 0.85. Below that, it escalates to a human. I also require the Healer to attach before-and-after screenshots and DOM diffs to every automated fix. This creates an audit trail that humans can review in seconds.

How do I get started if I have never built an AI agent?

Start with the Generator only. Pick one flaky test in your suite, build a prompt that rewrites the broken selector using the current DOM snapshot, and run it on failure. Once that works, add a Planner that breaks a user story into steps. The Healer comes last. Do not try to build all three layers on day one.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.