AI QA Agents: Runnable Checks

Day 8 of 100 Days of AI in QA & SDET.

🤖 Learning AI-powered testing? Go hands-on with LLM, RAG, and AI-agent testing in the AI-Powered Testing Mastery course at The Testing Academy.

AI QA agents are useful only when they move beyond neat test-case text and produce something a team can run, review, and trust with evidence. I see too many teams stop at “write test cases for checkout” and call it AI testing. That is not AI testing. That is a better notebook.

The real shift is simple: prompts should become runnable checks. A runnable check can fail, attach a trace, capture a screenshot, save an API response, and tell the team exactly which assumption broke. This article gives you a practical workflow for moving from prompts to QA agents that can generate Playwright checks, run them in CI, and use evals to stop weak output from reaching your regression suite.

Table of Contents

Why prompts are not enough
What AI QA agents should actually do
A practical AI QA agents workflow
Prompt to Playwright: a runnable example
Eval gates for AI QA agents
The evidence pack every agent run needs
How I would package this as a QASkills workflow
India career context for SDETs
Key takeaways
FAQ

Contents

Why prompts are not enough

Prompting is a good start, but it is a weak finish. A prompt can describe a test idea. It cannot prove that a login page works, that a discount code is applied only once, or that a payment retry flow records the correct audit event. The moment the output stays inside a chat window, the QA value is still theoretical.

Look at the current tooling signal. The Playwright GitHub repository shows more than 91,000 stars, and the npm downloads API reported more than 227 million Playwright package downloads in the last month checked during research. PromptFoo crossed 22,000 GitHub stars and the npm downloads API reported more than 1.17 million monthly downloads. The browser-use repository also crossed 99,000 GitHub stars, which shows how much attention browser agents are getting. These numbers do not prove quality by themselves, but they show where teams are putting energy: executable automation, browser agents, and repeatable evaluation.

Generated text has no accountability

A generated test case often looks impressive because it uses familiar QA words: boundary value, negative scenario, valid input, invalid input, expected result. The problem is that none of those words guarantee coverage. The output may miss user roles, data state, browser state, feature flags, localization, and backend side effects.

I covered this problem in Weak AI-Generated Test Cases: Day 7 QA Guide. The short version: generic prompts create generic coverage. If the AI does not know the risk model, it fills gaps with average internet QA language.

Runnable checks force clarity

When a prompt becomes a runnable check, vague thinking gets exposed. A selector may not exist. A role may not have permission. A fixture may need a seeded user. An API response may return a business error instead of a UI error. These failures are not bad. They are the point.

A QA agent should make ambiguity visible. If the agent cannot turn a requirement into a check, it should say what is missing instead of inventing a fake test. That behavior separates a useful assistant from a noisy generator.

What AI QA agents should actually do

AI QA agents should behave like a junior automation partner with strict rules, not like a magical replacement for testers. I expect the agent to read context, propose checks, generate code, run tests, collect evidence, and ask for missing information when the requirement is unclear.

The key is not autonomy for the sake of autonomy. The key is a closed loop. Input goes in, executable work comes out, evidence is attached, and evals judge whether the output met the contract.

The five jobs of a useful QA agent

Parse the requirement. Identify actors, rules, data states, and expected outcomes.
Create a risk map. List what can break and what matters most to users.
Generate runnable checks. Prefer Playwright, API checks, or contract checks over plain prose.
Run and collect evidence. Save trace, screenshot, network logs, console logs, and failure reason.
Score the output. Use evals so the agent does not silently degrade next week.

Where the human still owns the decision

The human owns risk priority. The agent can suggest that payment retry is high risk, but a tester or product owner should confirm whether that scenario blocks release. The agent can generate a selector, but the SDET should decide whether that selector is stable. The agent can report that a check passed once, but the QA lead decides whether it belongs in smoke, regression, or exploratory follow-up.

This is why I do not like the phrase “AI will replace testers.” It hides the real opportunity. AI will replace shallow test-case factories. It will not replace testers who can define risk, evaluate output, and build reliable feedback loops.

A practical AI QA agents workflow

Here is the workflow I would use for a product team that wants AI-assisted QA without turning CI into a random experiment. The workflow has seven steps. It is boring by design, because reliable testing is usually boring in the right places.

Step 1: Write the requirement in testable language

Bad input creates bad automation. Start with a short requirement that includes rule, role, data state, and expected system behavior.

Requirement:
When a logged-in buyer applies a valid coupon on the cart page,
the order total should reduce by the coupon value exactly once.
The coupon cannot be reused after successful payment.
The discount event must be visible in the order audit trail.

This gives the agent enough material to reason. It also gives the tester a checklist for reviewing the generated output.

Step 2: Ask for checks, not only cases

The prompt should request a runnable structure. Do not ask “write test cases.” Ask for scenarios, fixtures, selectors, API calls, assertions, and missing information.

You are a QA automation assistant.
Convert the requirement into Playwright checks.
Return:
1. Risk summary
2. Test data needed
3. API setup assumptions
4. Playwright test skeleton
5. Evidence to collect
6. Missing information, if any
Do not invent endpoints or selectors. Mark unknowns clearly.

That last line matters. A helpful agent says “unknown” instead of creating a fake endpoint like /api/apply-coupon because it sounds plausible.

Step 3: Run in a controlled sandbox

Never point a fresh agent at production. Run it against a staging environment, a local branch, or a mocked service. Give it a test user with limited permissions. Give it seeded test data. Keep write operations contained.

If the agent can create orders, payments, or support tickets, add cleanup logic. If it cannot clean up, do not let it run unattended.

Step 4: Require an evidence pack

The agent should return more than “passed.” A pass without evidence is weak. A fail without evidence is noise.

Playwright trace URL or artifact path
Screenshot before and after the key action
Console errors captured during the run
Network request and response for the important API call
Final assertion result
One-sentence human-readable failure reason

This evidence pack makes agent output reviewable in a stand-up, pull request, or release meeting.

Prompt to Playwright: a runnable example

Let us convert one requirement into a Playwright check. This is not a full framework, but it shows the shape I want from AI QA agents: deterministic setup, clear assertions, and trace-ready execution.

Requirement

A buyer can apply a valid coupon once during checkout. The cart total should update, the success message should appear, and a second attempt with the same coupon should not double-discount the order.

Generated Playwright check

import { test, expect } from '@playwright/test';

test.describe('checkout coupon rules', () => {
  test('valid coupon is applied once and cannot double discount', async ({ page }) => {
    await test.step('open cart with seeded buyer session', async () => {
      await page.goto('/cart?seed=buyer-with-two-items');
      await expect(page.getByRole('heading', { name: /cart/i })).toBeVisible();
    });

    const total = page.getByTestId('cart-total');
    const beforeTotal = await total.innerText();

    await test.step('apply valid coupon', async () => {
      await page.getByLabel('Coupon code').fill('QA10');
      await page.getByRole('button', { name: 'Apply coupon' }).click();
      await expect(page.getByText('Coupon applied')).toBeVisible();
    });

    const afterFirstApply = await total.innerText();
    expect(afterFirstApply).not.toEqual(beforeTotal);

    await test.step('try applying the same coupon again', async () => {
      await page.getByLabel('Coupon code').fill('QA10');
      await page.getByRole('button', { name: 'Apply coupon' }).click();
      await expect(page.getByText('Coupon already applied')).toBeVisible();
    });

    const afterSecondApply = await total.innerText();
    expect(afterSecondApply).toEqual(afterFirstApply);
  });
});

Notice what this test does not do. It does not depend on random text selectors. It does not accept “no error visible” as success. It checks that the total changed once and stayed stable after the second attempt.

Agent review checklist

Before I accept AI-generated automation, I check these items:

Does the test state the business rule clearly?
Are selectors stable enough for CI?
Is setup deterministic, or does it depend on old staging data?
Are assertions tied to user-visible behavior and backend state?
Does the test fail for the right reason?
Can another SDET maintain this after two months?

If the answer is no, I treat the agent output as a draft. That is still valuable. Drafts save time, but they do not remove review.

Eval gates for AI QA agents

The biggest risk with AI QA agents is not one bad output. It is silent drift. A prompt that produced good checks last week may start producing weaker output after a model change, prompt edit, or context update. This is where eval gates matter.

I wrote a dedicated guide on this in LLM Output Evaluation for QA Engineers: Day 5. For agent workflows, I like lightweight evals that judge structure, correctness, and refusal behavior.

What to evaluate

Start with a small eval suite. Ten requirements are enough to expose patterns. Include happy paths, negative paths, security-sensitive flows, missing data, and one deliberately ambiguous requirement.

Completeness: Did the agent cover role, data, action, expected result, and evidence?
Grounding: Did the agent avoid inventing selectors, endpoints, or product rules?
Runnable quality: Does the generated code follow the team’s Playwright pattern?
Failure behavior: Does it ask for missing context when needed?
Risk fit: Did it prioritize scenarios that matter for release?

PromptFoo example

description: AI QA agent output quality
providers:
  - openai:gpt-4.1-mini
prompts:
  - file://prompts/qa-agent.md
tests:
  - vars:
      requirement: "Coupon can be applied once during checkout"
    assert:
      - type: contains
        value: "Playwright"
      - type: contains
        value: "evidence"
      - type: not-contains
        value: "/api/apply-coupon"
      - type: llm-rubric
        value: "Output must identify missing selectors instead of inventing them."

This eval is intentionally small. It checks that the output has runnable intent, asks for evidence, and does not invent an endpoint. You can expand it later with team-specific rules.

DeepEval style scoring

DeepEval can help when you want Python-based scoring around expected criteria. I use this style when the team needs a repeatable check in CI.

from deepeval.test_case import LLMTestCase
from deepeval.metrics import GEval

case = LLMTestCase(
    input="Generate QA checks for one-time coupon application",
    actual_output=agent_output,
    expected_output="Must include setup, Playwright skeleton, assertions, and evidence."
)

quality = GEval(
    name="qa_agent_runnable_check_quality",
    criteria="Score whether the output is grounded, runnable, and clear about unknowns."
)

quality.measure(case)
print(quality.score, quality.reason)

Do not overbuild evals on day one. Start with the failures you already see: invented endpoints, missing assertions, generic test cases, and weak evidence.

🚀 Build Real AI Testing Skills

Stop testing AI by guesswork. Learn DeepEval, RAG evaluation, and agent testing with guided projects.

Explore the AI Testing Course →

The evidence pack every agent run needs

A good QA agent leaves breadcrumbs. When a test fails, the team should not spend 20 minutes asking “what did the agent do?” The run should explain itself.

Minimum evidence

For UI checks, I want these artifacts:

Trace: Playwright trace for step-by-step replay.
Screenshot: Before action, after action, and failure state.
Console log: Browser errors, warnings, and uncaught exceptions.
Network log: Key request and response for the feature under test.
Data snapshot: Test user, feature flag, tenant, and environment.
Final reason: One clear sentence explaining pass or fail.

This is where Playwright fits naturally. Tracing, screenshots, fixtures, and API testing are already part of the ecosystem. If your team is exploring Playwright MCP, read Playwright MCP vs Traditional Test Scripts and Playwright MCP Smoke Test: 5-Step QA Guide. Both are useful background for agent-driven browser work.

Evidence changes the review conversation

Without evidence, the review is emotional. One person says the agent is impressive. Another says it is risky. With evidence, the review becomes concrete. Did the selector work? Did the API return the expected response? Did the total change once? Did the trace show a real user path?

This is also how managers can judge AI testing work. They should not ask, “Did we use AI?” They should ask, “What evidence did AI produce, and how much review time did it save?”

How I would package this as a QASkills workflow

The popped topic for today came from a QASkills product update idea: move from prompting to runnable QA agents. That is exactly the right packaging. A tester should be able to add a skill, give it a requirement, and receive a structured output that fits their workflow.

A practical QASkills workflow can look like this:

npx @qaskills/cli add qa-agent-runnable-checks

The skill does not need to control the whole SDLC. It should do one job well: convert a requirement into a reviewed, evidence-ready testing plan and a first Playwright draft.

Skill contract

I would define the skill contract like this:

Input: Requirement, product area, user role, environment, known selectors, API docs.
Output: Risk summary, test data, Playwright draft, missing context, evidence checklist.
Safety rule: Never invent endpoints, selectors, or business rules.
Review rule: Mark generated code as draft until a human approves it.
Eval rule: Run prompt regression checks before changing the skill prompt.

Why small skills beat one giant agent

One giant QA agent sounds exciting, but it becomes hard to debug. Small skills are easier to trust. One skill reviews requirements. One skill generates Playwright drafts. One skill checks selectors. One skill builds PromptFoo evals. One skill summarizes the evidence pack.

This modular style also helps teams with different maturity levels. A manual testing team can start with requirement review. An automation team can add Playwright draft generation. A platform team can add CI and eval gates later.

India career context for SDETs

For SDETs in India, this shift is important. The market is already crowded with people who can say “I know Selenium” or “I used ChatGPT for test cases.” That is no longer enough. Product companies and strong services teams want people who can connect AI output to engineering systems.

If you are targeting stronger SDET roles, learn to speak in this language:

“I converted requirements into executable Playwright checks.”
“I added evals to prevent prompt regression.”
“I captured trace, screenshots, console logs, and API responses for agent runs.”
“I reviewed generated tests for selector stability and risk coverage.”
“I reduced manual draft time but kept human approval for release risk.”

That sounds very different from “I know prompt engineering.” Prompt engineering is a skill. Eval design and runnable automation are career advantage. For mid-level SDETs aiming at ₹25-40 LPA product-company roles, this distinction matters because hiring managers need proof that you can improve delivery, not just use a chatbot.

A 14-day practice plan

If you want to build this skill, use this two-week plan:

Pick one small feature you understand well.
Write five requirements in testable language.
Ask an AI assistant to generate risk maps and Playwright drafts.
Reject any output that invents selectors or endpoints.
Run at least two generated checks locally.
Add trace and screenshot capture.
Create five PromptFoo evals for your prompt.
Document what the agent missed and how you fixed it.

At the end, you have a portfolio story. Not “I used AI.” Instead: “I built a mini QA agent workflow with Playwright, evidence artifacts, and prompt regression checks.” That is much stronger in interviews.

Key takeaways

AI QA agents are not valuable because they sound smart. They are valuable when they produce runnable checks and reviewable evidence.

Prompts are a starting point. Runnable checks are the QA outcome.
Do not let agents invent selectors, endpoints, or business rules.
Use Playwright for executable checks and evidence collection.
Use PromptFoo or DeepEval to catch prompt and output regression.
Keep humans responsible for risk priority, review, and release decisions.

My simple rule for Day 8: if the AI output cannot run, fail, and explain itself with evidence, it is still a draft. Useful, but not trusted yet.

FAQ

Are AI QA agents ready to replace automation engineers?

No. They are ready to speed up parts of requirement analysis, test draft generation, and evidence summarization. They still need human review for risk, coverage, selectors, and release impact.

Should manual testers learn AI QA agents before automation?

Manual testers should learn both together. Start with requirements, risks, and prompts, but connect the output to Playwright or API checks as early as possible. AI without execution becomes shallow quickly.

What is the best first tool for this workflow?

Playwright is a strong first tool because it handles UI checks, API checks, traces, screenshots, and fixtures in one ecosystem. Add PromptFoo when you need to evaluate prompt quality over time.

How many evals do I need at the beginning?

Start with 5 to 10 eval cases. Cover happy path, negative path, missing context, security-sensitive flow, and one ambiguous requirement. Expand only after you see real failures.

What should I avoid when building AI QA agents?

Avoid production access, vague prompts, invented selectors, unreviewed generated code, and “passed once” confidence. A single pass is evidence, not a release guarantee.

🎓 Become an AI-Powered QA Engineer

Join hundreds of SDETs mastering LLM, RAG, and agent testing. Lifetime access, hands-on labs, and a job-ready portfolio.

Enroll in AI-Powered Testing Mastery →