Flaky Tests in AI Testing: QA Problem 2026

Flaky tests in AI testing are becoming a bigger problem in 2026 because AI is increasing the speed at which teams write tests, change code, and ship pull requests. That sounds good until the test suite becomes noisy. A flaky suite turns every CI failure into a debate: is the product broken, did the AI write a weak test, did the selector move, or did the environment blink?

🤖 Learning AI-powered testing? Go hands-on with LLM, RAG, and AI-agent testing in the AI-Powered Testing Mastery course at The Testing Academy.

I see this pattern more often now. Teams add Copilot, Cursor, Claude Code, or an internal coding agent. Developers start generating unit tests and Playwright tests faster than before. QA teams also use AI to create missing coverage. The test count goes up. The review depth does not always keep up. Then one bad thing happens: the CI signal gets weak.

Once the CI signal gets weak, everything slows down. Engineers rerun jobs. QA loses trust in automation. Product managers stop believing the release dashboard. The team still has 2,000 tests, but nobody knows which failures deserve attention.

This is why flaky tests are not a small automation issue in 2026. They are a delivery risk. AI makes that risk larger because it can create more tests, more mocks, more edge-case assumptions, and more asynchronous code than most teams can review properly.

Table of Contents

What flaky tests mean in 2026
Why AI made flaky tests worse
Where flakiness hides in AI-generated tests
The real cost of noisy CI
How to fix flaky tests in AI testing
A Playwright example that removes common flakiness
The QA process I would use in 2026
FAQ

Contents

What flaky tests mean in 2026

A flaky test is a test that passes and fails against the same code. Nothing meaningful changed in the application, but the result changed. That is the problem. The test no longer tells the team the truth.

Google’s Testing Blog defined a flaky result as a test that shows both passing and failing outcomes with the same code. In the same post, Google said around 1.5% of all test runs reported a flaky result, almost 16% of tests had some level of flakiness, and about 84% of pass-to-fail transitions involved a flaky test. That post is old, but the lesson is still current: even strong engineering teams struggle when automated tests become probabilistic. Source: Google Testing Blog on flaky tests.

In 2026, the scale is different. Many teams now run tests on every pull request, every merge, every deployment candidate, and sometimes every AI-generated patch. A flaky test that fires once per week is annoying. A flaky test that fires inside an AI-assisted workflow can block dozens of small changes in one day.

Flakiness breaks trust before it breaks the pipeline

The first damage is not technical. It is psychological. Once engineers learn that a red build might be fake, they start treating every failure as optional.

You hear comments like:

“Rerun it once. This test fails sometimes.”
“That spec has been unstable since last sprint.”
“Ignore this one. It passes locally.”
“The AI generated that test. I don’t trust it yet.”

That is a dangerous habit. The day a real regression appears in the same area, the team may ignore it because the suite already trained them to ignore red builds.

AI changes the economics of test creation

Before AI coding tools became common, writing tests had friction. A developer had to understand the feature, write the test, debug it, and clean it up. That friction was painful, but it also filtered bad tests.

AI reduces the friction. GitHub’s Copilot documentation explicitly lists generating unit tests, creating mock objects, creating end-to-end tests, and updating tests as supported tasks. Source: GitHub Copilot features.

That is useful. I use AI for test generation too. The issue is that generated tests often look complete before they are reliable. They may assert the visible text but miss the real state. They may wait for a timeout instead of waiting for a condition. They may mock the happy path and quietly skip the hard behavior.

Why AI made flaky tests worse

The simple answer: AI increases output faster than it increases judgment.

An AI tool can write ten tests in a minute. It cannot fully understand your production timing, backend queue delays, feature flag combinations, real browser behavior, or CI resource limits unless you give it that context. Most teams do not.

So AI-generated test suites tend to inherit four problems:

They over-trust selectors that are easy to see but easy to break.
They use sleeps because sleeps are easy to write.
They assert shallow UI changes instead of stable business outcomes.
They mock too much, so the test passes while the integration path is broken.

That is how teams get more coverage on paper and less confidence in practice.

AI writes tests from visible patterns

Most coding agents work from the files, prompts, and examples they can see. If your repo has old Selenium tests full of brittle XPath, the AI may copy that pattern. If your Playwright tests use waitForTimeout(3000), the AI may repeat it. If your page objects hide poor waits behind helper methods, the AI may call those helpers again.

The agent is not being careless. It is following the local style. Bad examples become training material inside your own repo context.

AI-generated tests often pass once

A test that passes once is not the same as a test that is stable. This difference matters. A coding agent can run a test locally, see green, and assume the work is done. CI is different. CI runs on shared infrastructure, with different CPU pressure, browser timing, network behavior, seeded data, and parallel workers.

Many flaky tests pass during the first local run. They fail when the suite runs in parallel, when the database has older data, or when the browser is slower than expected.

AI increases the number of small changes

AI also changes developer behavior. Teams ship smaller patches because the tool helps them edit faster. That should be good. But if every small patch triggers a noisy suite, the cost of noise multiplies.

METR’s 2025 study on experienced open-source developers found that, in their experiment, developers took 19% longer when using early-2025 AI tools on their own repositories. The study is not about flaky tests, but it is a useful warning: AI does not automatically remove delivery friction. Sometimes it moves the friction into review, debugging, and verification. Source: METR 2025 AI developer productivity study.

Flaky tests are one of those friction points. They sit after code generation, after test generation, and before release. If that gate is weak, everything behind it gets delayed.

Where flakiness hides in AI-generated tests

AI-generated tests usually fail in boring ways. That is the frustrating part. The test looks reasonable at first glance. The bug hides in timing, isolation, data, or selector choice.

1. Timing assumptions

The most common issue is a hidden timing assumption. The test assumes a modal appears immediately, a toast stays visible long enough, an API response arrives in a fixed window, or a spinner disappears before the assertion runs.

Generated tests often include code like this:

await page.click('text=Save');
await page.waitForTimeout(2000);
await expect(page.locator('text=Saved')).toBeVisible();

This test may pass on a developer laptop. It may fail in CI when the page takes 2.2 seconds. Increasing the timeout to 5 seconds does not fix the design. It only makes the suite slower and still flaky under pressure.

2. Weak selectors

AI tends to choose selectors that are obvious in the DOM. Text selectors, long CSS chains, nth-child selectors, and generated class names appear often because they are visible. Stable selectors require product knowledge. The model needs to know which element represents the user action, not only which element is present.

If your team already uses data-testid, accessible roles, and stable labels, AI has better examples to follow. If the repo has weak selectors, the generated tests will copy them.

I wrote more about this problem in Self-Healing Selectors in 2026: Production Reality. Self-healing can help, but it cannot rescue a suite built on poor intent.

3. Shared state

AI-generated tests often reuse existing setup code without questioning it. That creates shared state problems. One test creates a customer. Another test expects a clean customer list. A third test deletes a record that another worker still needs.

Parallel execution exposes this quickly. A test that passes alone fails when eight workers run together. The application is fine. The test design is not.

4. Mocked confidence

Mocks are useful, but AI can overuse them. A generated unit test may mock the exact function that contains the risk. An API test may mock the response so tightly that contract changes never surface. A UI test may intercept the backend call and then claim the user flow works.

That creates false confidence. The test is green because the mock is green. Production does not care.

The real cost of noisy CI

Flaky tests waste time, but the time loss is only one part. The larger cost is decision quality.

A reliable CI pipeline answers one question: can we merge or release this change? A flaky pipeline answers with a shrug. That shrug is expensive.

Reruns become process debt

Every rerun looks small. Thirty seconds here. Five minutes there. One engineer waiting during lunch. Another engineer checking the same red job after a standup.

At team scale, this becomes process debt. Ten engineers each losing 20 minutes a day to reruns and investigation means more than 16 hours per week. That is two full engineering days spent asking whether the test suite is lying.

I am not using that number as an industry benchmark. It is a simple team math example. Many teams lose more.

Bug triage gets polluted

Flaky tests also pollute bug triage. A real defect enters the backlog beside five false alarms. The team spends energy sorting noise from signal. Over time, people start asking for manual confirmation before they believe automation.

That is a bad place for any QA team. Automation should reduce manual checking. A flaky suite brings manual checking back through the side door.

AI agents can amplify bad feedback

AI coding agents need feedback. They run tests, read failures, patch code, and try again. If the feedback is flaky, the agent may fix the wrong thing.

Imagine an agent changing product code because a test failed due to stale test data. Or changing a selector because an environment timeout caused the page to load slowly. Now the team has two problems: the original flaky test and a patch created to satisfy bad feedback.

This is why flaky tests matter more in agentic workflows than in old CI. The test result may guide the next code change automatically.

If your team is experimenting with agents, read LangGraph for QA Engineers: Multi-Agent Pipelines and MCP Servers for Testers: Playwright TypeScript QA Guide. Both topics depend on clean test feedback.

How to fix flaky tests in AI testing

You cannot fix flaky tests with one more retry. Retries hide pain. They are useful as a temporary safety net, but they should produce data, not silence.

The better approach is to treat flakiness as a product quality signal. When a test flakes, the team should learn which part of the system is unstable: the test, the app, the environment, the data, or the automation pattern.

Start with quarantine, but do not stop there

Quarantine unstable tests so they do not block every merge. Then make the quarantine visible. A hidden quarantine folder becomes a graveyard. A visible flaky test dashboard becomes work the team can prioritize.

Track at least these fields:

Test name and file path
Failure rate across the last 20 runs
First flaky date
Suspected cause
Owner
Decision: fix, delete, rewrite, or move to lower frequency

If a test has no owner, it will stay flaky forever.

Use AI to find patterns, not to blindly patch

AI is very useful for triage. Give it the last 20 failure logs, screenshots, traces, and commit ranges. Ask it to group failures by cause. Ask it to identify repeated selectors, repeated network errors, and repeated timeout points.

Do not ask it to “fix all flaky tests” in one broad prompt. That usually creates shallow changes. Use narrow prompts:

Analyze these 12 Playwright failures.
Group them by root cause.
Do not edit code yet.
Return:
1. suspected flaky pattern
2. evidence from logs
3. safest code change
4. tests that need product investigation

This prompt forces the agent to reason before editing.

Write test-generation rules for your AI tools

If your team uses Cursor, Claude Code, Copilot, or another coding agent, add test rules to your repo. Put them in a visible instruction file. The exact file depends on your tool, but the content matters more than the filename.

Use rules like:

Never use fixed sleeps in Playwright tests.
Prefer role-based locators and stable test IDs.
Each test must create its own data or use an isolated fixture.
Do not mock the behavior being tested.
Every generated test must explain what user risk it covers.

These rules reduce flakiness because they change what the AI copies next time.

🚀 Build Real AI Testing Skills

Stop testing AI by guesswork. Learn DeepEval, RAG evaluation, and agent testing with guided projects.

Explore the AI Testing Course →

A Playwright example that removes common flakiness

Playwright already has strong auto-waiting. Its docs explain that actions wait for actionability checks before running. Source: Playwright actionability docs. The problem is that many generated tests still add sleeps because old examples in the repo use sleeps.

Here is the weak version:

test('user can save profile', async ({ page }) => {
  await page.goto('/profile');
  await page.click('text=Edit');
  await page.fill('#name', 'Pramod');
  await page.click('text=Save');
  await page.waitForTimeout(3000);
  await expect(page.locator('text=Profile updated')).toBeVisible();
});

The test has three problems. It uses text selectors for actions, it relies on a fixed wait, and it only checks a toast. A toast can appear while the backend save still fails later.

A better version waits for the real signal:

test('user can save profile', async ({ page }) => {
  await page.goto('/profile');

  await page.getByRole('button', { name: 'Edit profile' }).click();
  await page.getByLabel('Display name').fill('Pramod');

  const saveResponse = page.waitForResponse(response =>
    response.url().includes('/api/profile') &&
    response.request().method() === 'PUT' &&
    response.status() === 200
  );

  await page.getByRole('button', { name: 'Save profile' }).click();
  await saveResponse;

  await expect(page.getByText('Profile updated')).toBeVisible();
  await expect(page.getByLabel('Display name')).toHaveValue('Pramod');
});

This is not perfect, but it is more honest. It waits for the backend save and checks the UI state after the save. If this test fails, the failure has more meaning.

What I would ask an AI agent to do

I would not ask the agent to “make the test better.” That prompt is too vague. I would ask:

Rewrite this Playwright test to remove flakiness.
Rules:
- no waitForTimeout
- use getByRole, getByLabel, or stable data-testid
- wait for the API response that proves the save happened
- keep the test focused on one user outcome
- explain each change in 5 bullets

The prompt gives the AI boundaries. Boundaries are how you get better test code from AI.

The QA process I would use in 2026

If I were setting up an AI-assisted QA process now, I would build around one principle: every generated test must earn trust before it joins the blocking suite.

Here is the process I would use.

1. Split generated tests by confidence

Do not put every AI-generated test into the main CI gate on day one. Split them into buckets:

Blocking: stable tests that protect critical user journeys.
Observation: new AI-generated tests that run but do not block merges yet.
Exploratory: tests used to discover coverage gaps, not to enforce release decisions.
Deleted: tests that duplicate existing coverage or assert weak behavior.

After 20 clean runs, promote a test from observation to blocking. If it flakes twice, investigate before promotion.

2. Review tests like product code

AI-generated tests need code review. The reviewer should check selectors, waits, data setup, assertions, and business value. A test that only checks implementation detail should not block a release.

Ask one practical question: if this test fails at 6 PM on release day, will we trust it enough to stop the release? If the answer is no, the test is not ready for the blocking suite.

3. Measure flakiness as a first-class metric

Track flaky rate per test, per suite, per team, and per service. Do not only track pass rate. A suite with 96% pass rate may still waste hours if the same unstable tests fail again and again.

Useful metrics include:

Number of flaky tests introduced per week
Mean time to fix a flaky test
Reruns per pull request
Quarantined tests older than 14 days
Tests deleted because they had no useful signal

That last metric matters. Deleting a bad test is sometimes the best QA decision.

4. Feed failures back into your AI rules

Every flaky failure teaches you something about your repo. If five failures come from weak selectors, update your AI instructions. If three failures come from shared data, add fixture rules. If timeouts keep happening after a specific action, create a helper that waits for the real application state.

This turns flakiness into a feedback loop. The team gets better examples. The AI copies better examples. The next generated test has a higher chance of being stable.

Why this is the biggest QA problem of 2026

AI will keep writing more code and more tests. That part is not going away. The bottleneck moves to verification. The teams that win will not be the teams with the largest number of generated tests. They will be the teams with the cleanest signal.

Flaky tests attack that signal directly. They make good automation look unreliable. They make real defects look optional. They make AI agents chase false failures. They slow down developers who thought AI would speed them up.

For QA engineers, this is the opportunity. The future of QA is not only writing prompts that generate tests. The real work is building systems where AI-generated tests can be trusted. That means better fixtures, better selectors, better CI data, better review rules, and a clear policy for unstable tests.

In 2026, flaky tests in AI testing are the place where QA skill becomes visible. Anyone can generate a test. Not everyone can make a test suite worth believing.

FAQ

Are flaky tests really worse because of AI?

Yes, in many teams. AI increases the number of generated tests and code changes. If the review process and CI design do not improve at the same speed, flaky tests become more common and more expensive.

Should teams stop using AI to generate tests?

No. AI is useful for creating test drafts, edge-case ideas, mocks, and refactoring suggestions. The mistake is treating generated tests as production-ready without review and stability checks.

What is the fastest way to reduce flaky tests?

Start by removing fixed sleeps, weak selectors, and shared state. Then quarantine unstable tests with owners and deadlines. Do not allow old quarantined tests to sit untouched for months.

Can self-healing selectors solve flaky tests?

They can help with selector drift, but they do not solve timing issues, data isolation, poor assertions, or broken product behavior. Self-healing is a tool, not a replacement for test design.

What should QA engineers learn for AI testing in 2026?

Learn Playwright reliability patterns, CI debugging, prompt rules for coding agents, test data isolation, and failure analysis. The valuable QA engineer in 2026 is the person who can turn AI output into trustworthy release signal.

Final thoughts

Flaky tests were already painful before AI. In 2026, they are more dangerous because AI depends on test feedback to move fast. Bad feedback creates bad fixes. Weak tests create weak confidence.

If your team is adding AI to test automation, do not measure success only by test count. Measure the signal. Count reruns. Track quarantined tests. Review generated tests like production code. Delete tests that do not protect a real user outcome.

The teams that get this right will ship faster because their CI tells the truth. The teams that ignore it will have more tests, more dashboards, and more people clicking rerun.

🎓 Become an AI-Powered QA Engineer

Join hundreds of SDETs mastering LLM, RAG, and agent testing. Lifetime access, hands-on labs, and a job-ready portfolio.

Enroll in AI-Powered Testing Mastery →