Weak AI-Generated Test Cases

100 Days of AI in QA & SDET, Day 7

🤖 Learning AI-powered testing? Go hands-on with LLM, RAG, and AI-agent testing in the AI-Powered Testing Mastery course at The Testing Academy.

Weak AI-generated test cases are usually not an AI problem. They are a QA input problem. If you give a model one vague sentence, it gives you a polite checklist that looks useful until the first production bug escapes.

I see this pattern every week with testers moving from manual testing into AI-assisted QA. The model can help, but only when we force it to reason from requirements, risks, data, states, and evidence. This guide shows the exact framework I use to turn a lazy prompt into a coverage workflow that a real SDET can trust.

Table of Contents

Why Weak AI-Generated Test Cases Happen
The Bad Prompt Pattern
Use a Coverage-First Prompt
Add an Eval Loop with PromptFoo and DeepEval
Convert Output into Automation-Ready Tests
India Context: The New SDET Skill Gap
A Practical Checklist for QA Teams
Key Takeaways
FAQ

Contents

Why Weak AI-Generated Test Cases Happen

Weak AI-generated test cases happen when the model is asked to invent context. Most testers start with prompts like “write test cases for login” or “create scenarios for checkout.” The model does not know your product rules, your data model, your risk appetite, or your defect history.

So it fills the gap with generic QA memory. You get rows like “verify user can login with valid credentials” and “verify error message for invalid password.” These are not wrong, but they are not enough. A fresher can write them without AI.

The hidden issue is that the output looks complete. It has a table. It has test IDs. It has expected results. Managers see volume and think coverage improved. But coverage did not improve if the model skipped account lockout, rate limits, session expiry, role-based access, device trust, localization, or audit logs.

The model is not your product owner

An LLM predicts likely text. It does not automatically know which business rules matter. If your subscription system allows downgrade only after the current billing period, the model needs that rule. If your cart reserves inventory for 15 minutes, the model needs that rule. If your admin user can refund partial amounts but a support user cannot, the model needs that rule.

When I test AI output, I look for missing states first. A test case list without states is usually shallow. Good QA work asks, “What can change?” Then it covers the combinations that carry real risk.

Generic prompts reward generic testing

Bad prompts usually miss four things:

Scope: which feature, screen, API, role, and environment?
Rules: what must always be true?
Risks: where can money, security, data, or trust break?
Evidence: what should the automation capture when it passes or fails?

Without those four inputs, the model writes a template. With those inputs, the model starts behaving like a junior QA analyst who has been given a proper brief.

The Bad Prompt Pattern

Here is the prompt I still see in teams:

Create test cases for the checkout page.

This prompt feels productive because it returns a table in five seconds. But it has no product constraints. The model may produce 15 test cases, but most will be variations of add item, remove item, apply coupon, place order, and verify confirmation.

That is not a test strategy. It is a first draft.

What a weak answer usually misses

For checkout, a shallow AI-generated set often misses:

Guest checkout versus logged-in checkout
Saved address versus new address
Coupon stacking rules
Payment timeout and retry behavior
Inventory changes between cart and payment
Tax and shipping recalculation after address change
Partial payment, wallet, gift card, or store credit rules
Mobile viewport issues
Accessibility of error messages
Security checks like tampered totals and unauthorized coupon use

Notice the pattern. The missing tests are not fancy. They are the scenarios that come from product understanding.

The test count trap

More generated rows do not mean better coverage. A model can produce 100 test cases and still miss the one rule that causes a ₹20 lakh revenue issue during a sale day. Test count is a vanity metric unless every row maps to a risk, rule, state, or observable behavior.

I prefer a 25-row coverage matrix with clear risk mapping over a 120-row spreadsheet full of duplicates. The first one helps automation. The second one creates review fatigue.

Use a Coverage-First Prompt

The fix is simple: stop asking AI to write test cases first. Ask it to prove coverage first. This is the core move that reduces weak AI-generated test cases.

A coverage-first prompt asks the model to create a matrix before it writes executable scenarios. The matrix makes gaps visible. It also makes review easier for leads, product owners, and automation engineers.

A better prompt for requirements

Use this structure:

You are a senior SDET reviewing a requirement.

Feature: Checkout payment retry
Requirement:
- If payment fails due to network timeout, the user can retry up to 3 times.
- The cart total must not change during retry.
- Inventory is reserved for 15 minutes.
- A successful retry creates only one order.
- Failed attempts must be logged with payment gateway reason codes.

Create a risk-based coverage matrix.
Columns:
1. Requirement or rule
2. User state
3. Data state
4. Risk
5. Test type: positive, negative, edge, abuse, accessibility, security
6. Expected evidence for automation
7. Priority: P0, P1, P2

Do not write test steps yet. First prove coverage.

This prompt changes the output. The model now has to connect rules to risks. It cannot hide behind generic steps. It has to explain what evidence should be captured, such as order ID uniqueness, gateway code, cart total, inventory reservation timestamp, and retry count.

Then ask for scenarios

After the coverage matrix is reviewed, ask for test cases from selected rows:

Now convert only the P0 and P1 rows into test scenarios.
For each scenario include:
- preconditions
- test data
- API or UI path
- steps
- expected result
- automation evidence
- Playwright selector notes
- whether this belongs in smoke, regression, or nightly

This two-step flow is slower than one prompt, but it is far safer. It separates thinking from formatting. QA work is not just producing test cases. QA work is finding what can break before users find it.

Use a numbered review loop

Here is the review loop I recommend:

Paste the real requirement, not a summary from memory.
Ask for a coverage matrix, not test cases.
Review missing states, roles, and data boundaries.
Ask the model to challenge its own matrix.
Convert only high-risk rows to executable scenarios.
Map automation evidence before writing code.
Run an eval so the same prompt can be checked again next week.

This loop turns AI into a repeatable QA assistant instead of a random spreadsheet generator.

Add an Eval Loop with PromptFoo and DeepEval

Prompt quality should be tested like code. PromptFoo describes itself as an open-source CLI and library for evaluating and red-teaming LLM apps. Its GitHub repository showed 22,204 stars when I checked for this article, and the npm registry reported 1,196,660 downloads in the last month for the promptfoo package.

DeepEval is another useful framework for LLM evaluation. Its quickstart shows a local eval flow with a test case, a metric, and the command deepeval test run. Its GitHub repository showed 16,161 stars during my check. These tools matter because AI testing prompts need regression checks, not one-time excitement.

If you used a prompt today and got good output, you still need to know whether it gives good output after the model changes, after the requirement changes, or after a teammate edits the prompt. That is where eval design becomes an SDET skill.

A small PromptFoo config for QA prompts

Here is a simple pattern you can adapt:

description: checkout coverage prompt eval

providers:
  - openai:gpt-4.1-mini
  - anthropic:messages:claude-3-5-sonnet-latest

prompts:
  - file://prompts/checkout-coverage.txt

tests:
  - vars:
      requirement: |
        Payment retry allowed 3 times after network timeout.
        Cart total must not change. Successful retry creates one order.
    assert:
      - type: contains
        value: inventory
      - type: contains
        value: one order
      - type: contains
        value: gateway reason code
      - type: llm-rubric
        value: Output must include positive, negative, edge, and abuse cases.

This is not a full enterprise setup. It is a starting point. The key is that you define what a good answer must include. If the model drops security, inventory, or audit evidence, the eval fails.

What to evaluate

For QA test generation, I evaluate five things:

Requirement coverage: does every explicit rule appear?
State coverage: are user, data, system, and network states represented?
Risk coverage: are money, security, privacy, and reliability risks visible?
Automation usefulness: does the output mention evidence, selectors, APIs, logs, or traces?
Duplicate control: does the model avoid repeating the same scenario in different words?

This is also why I wrote about LLM output evaluation for QA engineers. The future SDET will not only write automation. The future SDET will test the AI systems that write and maintain parts of that automation.

Convert Output into Automation-Ready Tests

A good AI-generated scenario should be easy to automate. If the output cannot be turned into a Playwright test, an API check, or a traceable manual check, it is probably too vague.

I like asking the model for an automation contract. That contract describes selectors, APIs, assertions, test data, and evidence. It does not need to write perfect code, but it must give enough structure for a human SDET to move fast.

TypeScript example with Playwright

Here is a small Playwright-style example based on the payment retry requirement:

import { test, expect } from '@playwright/test';

test('payment retry keeps total stable and creates one order', async ({ page }) => {
  await page.goto('/checkout');

  const totalBefore = await page.getByTestId('cart-total').innerText();

  await page.route('**/payments/charge', async route => {
    const request = route.request();
    const retry = Number(request.headers()['x-retry-count'] ?? '0');

    if (retry < 1) {
      await route.fulfill({
        status: 504,
        contentType: 'application/json',
        body: JSON.stringify({ reasonCode: 'NETWORK_TIMEOUT' })
      });
      return;
    }

    await route.continue();
  });

  await page.getByRole('button', { name: 'Pay now' }).click();
  await expect(page.getByText('Payment timed out')).toBeVisible();

  await page.getByRole('button', { name: 'Retry payment' }).click();

  await expect(page.getByTestId('cart-total')).toHaveText(totalBefore);
  await expect(page.getByText(/Order confirmed/)).toBeVisible();

  const orderIds = await page.getByTestId('order-id').allInnerTexts();
  expect(new Set(orderIds).size).toBe(1);
});

The point is not that AI writes this perfectly. The point is that the coverage prompt should lead to tests with observable evidence. If a generated test case says “verify payment retry works,” it is weak. If it says “simulate gateway timeout, assert retry count, stable total, one order ID, and gateway reason code,” it is useful.

Ask for evidence, not only steps

Evidence makes AI-generated tests stronger. For UI tests, evidence can be a screenshot, trace, network log, or DOM assertion. For API tests, it can be response schema, database record, event log, or idempotency key. For LLM app tests, it can be input, output, metric score, retrieved context, and judge reason.

This is why tools like Playwright MCP smoke testing are interesting for QA. Browser automation is moving closer to agent workflows, but the same rule applies: if you cannot inspect evidence, you cannot trust the run.

🚀 Build Real AI Testing Skills

Stop testing AI by guesswork. Learn DeepEval, RAG evaluation, and agent testing with guided projects.

Explore the AI Testing Course →

India Context: The New SDET Skill Gap

In India, I see a clear gap forming between testers who use AI for speed and testers who use AI with evaluation discipline. The first group asks ChatGPT for test cases. The second group builds prompts, coverage matrices, evals, and automation contracts.

That second group will have more career upside in 2026. Product companies do not need someone who can paste a requirement into an AI tool. They need someone who can say, “This generated suite missed three P0 risks, here is the eval that caught it, and here is the Playwright test that proves the fix.” That is a different level of SDET maturity.

What hiring managers will notice

If you are aiming for ₹25 to ₹40 LPA SDET roles, AI tool usage alone is not enough. You need to show system thinking. A strong portfolio project can include:

A real requirement document
A coverage-first prompt
A PromptFoo or DeepEval regression check
Generated test scenarios with human review notes
Playwright or API automation for the P0 paths
A short README explaining what the AI missed and how you fixed it

This kind of artifact stands out more than a screenshot of 50 generated test cases. It shows judgment. It shows that you understand both QA and AI limitations.

Manual testers can use this immediately

You do not need to become an ML engineer to start. Manual testers can use the same coverage-first method during requirement review. Ask AI for a matrix, challenge missing states, then bring the improved matrix to grooming or sprint planning.

That is practical AI adoption. It helps the team this week, not someday after a big platform migration. If you are transitioning from manual testing, read AI testing skills for manual testers next because it maps the skill shift without making it sound magical.

A Practical Checklist for QA Teams

Use this checklist before accepting any AI-generated test case set. It takes 10 minutes and catches most weak outputs.

Coverage checks

Does every requirement rule appear at least once?
Are user roles covered?
Are data boundaries covered?
Are state transitions covered?
Are negative, edge, and abuse scenarios present?
Are localization, accessibility, and security considered where relevant?

Automation checks

Does each high-priority case have clear evidence?
Can the test be automated through UI, API, or contract checks?
Are selectors, API endpoints, logs, or database records mentioned?
Does the output separate smoke, regression, and nightly scope?
Are flaky or hard-to-control conditions marked clearly?

Eval checks

Can the prompt be saved and versioned?
Can expected coverage rules be asserted?
Can duplicate or shallow answers be rejected?
Can the same prompt run against two models for comparison?
Can failures be reviewed in CI?

If the answer is no to most of these, the generated test cases are not ready. They are brainstorming material.

Key Takeaways

Weak AI-generated test cases are fixable when QA teams stop treating the model like a magic test writer and start treating it like a system that needs inputs, constraints, and evals.

Bad prompts create generic test cases because the model has to invent context.
Coverage-first prompting makes gaps visible before test steps are written.
PromptFoo and DeepEval help teams regression-test QA prompts and LLM outputs.
Automation evidence is the bridge between AI-generated scenarios and real SDET work.
For Indian SDETs, eval design is becoming a career differentiator, not a side skill.

My simple rule: do not ask AI to write test cases. Ask AI to prove coverage, then ask it to help you automate the highest-risk paths.

FAQ

Are AI-generated test cases bad?

No. AI-generated test cases are useful as a draft. They become risky when teams accept them without requirement mapping, risk review, and evidence checks.

What is the best prompt for AI test case generation?

The best prompt includes the real requirement, product rules, user roles, data states, risk categories, and an output format that starts with a coverage matrix. Test steps should come after coverage review.

Should QA teams use PromptFoo or DeepEval?

Use them when your prompts matter enough to regression-test. PromptFoo is strong for prompt and model comparison workflows. DeepEval is strong for LLM test cases, metrics, and local eval runs. Both push QA teams toward repeatable evaluation.

Can manual testers use this workflow?

Yes. Manual testers can use coverage-first prompting during requirement review. You do not need advanced automation skills to find missing states, weak assumptions, and shallow AI output.

What should I read next?

Read AI testing skills for manual testers, LLM output evaluation for QA engineers, and AI test agents need a planner, generator, and healer.

Sources Checked

PromptFoo GitHub repository and PromptFoo 0.121.15 release
promptfoo npm package and npm downloads API
DeepEval GitHub repository and DeepEval quickstart docs
@playwright/test npm package

🎓 Become an AI-Powered QA Engineer

Join hundreds of SDETs mastering LLM, RAG, and agent testing. Lifetime access, hands-on labs, and a job-ready portfolio.

Enroll in AI-Powered Testing Mastery →