Cost of Flaky Tests: Engineering Time and CI Waste
The cost of flaky tests is not a testing problem. It is an engineering tax that shows up as wasted CI minutes, delayed releases, ignored alerts, and developers who stop trusting the pipeline.
I see teams spend months adding more tests while the real leak sits inside the build: tests that pass, fail, and pass again without a code change. This article gives you a practical 2026 model to estimate that cost, reduce it, and make flakiness visible before it quietly becomes normal.
Table of Contents
- What Are Flaky Tests?
- The Cost of Flaky Tests in 2026
- How Flaky Tests Waste CI/CD Capacity
- How Flaky Tests Burn Engineering Time
- Root Causes I See Most Often
- A Simple Flaky Test Cost Model
- Playwright Example: Track Retries and Failures
- A 30-Day Triage System for QA Teams
- India QA Team Context
- Key Takeaways
- FAQ
Contents
What Are Flaky Tests?
A flaky test is a test that gives different results for the same code. One run passes. The next run fails. A rerun passes again. No product code changed, but the pipeline changed its mind.
Google’s testing team defines a flaky result as a test that shows both passing and failing results with the same code. In its public write-up on flaky tests, Google said it saw about 1.5% of all test runs report a flaky result, and almost 16% of tests had some level of flakiness. That number should scare any QA leader because Google has stronger tooling than most teams.
Why the definition matters
Many teams call every intermittent failure flaky. That is too loose. A real product bug that appears only under a race condition is not noise. A test that fails because your application has an actual concurrency bug is doing its job.
I split failures into three buckets:
- Product failures: the user flow is broken.
- Test failures: the assertion, selector, fixture, or setup is weak.
- Environment failures: CI, network, test data, or third-party dependencies failed.
The expensive part is not the label. The expensive part is the investigation. If an engineer spends 25 minutes proving a failure is noise, the company still paid for those 25 minutes.
Flaky tests destroy trust before they destroy speed
The first cost is psychological. Developers stop treating red builds as meaningful. QA stops pushing back because every build has “known flakes.” Managers start asking for manual confirmation before a release.
Once that happens, automation is no longer a quality gate. It is background noise with screenshots.
The Cost of Flaky Tests in 2026
The cost of flaky tests has increased in 2026 because pipelines run more often, test suites are larger, and teams expect faster release cycles. A flaky test that ran once per day in 2016 may now run on every pull request, every merge, every preview environment, and every deployment.
That changes the math. Even a small flake rate becomes expensive when multiplied by 50 pull requests, 8 parallel shards, and multiple environments.
The visible cost: CI minutes
CI waste is the easy part to measure. If a failed run triggers a full rerun, you pay for the first run and the rerun. If your suite takes 18 minutes across 10 workers, one false red build can consume 180 worker-minutes before a human even opens the logs.
Most cloud CI bills hide this because they charge by credits, concurrency, or included minutes. But the waste is still real. You see it as longer queue time, slower feedback, more runners, and larger monthly bills.
The hidden cost is bigger. Engineers lose context when they switch from feature work to build forensics. A flaky test does not only consume the time spent reading logs. It also breaks flow.
Martin Fowler’s article on eradicating non-determinism in tests makes the core point clearly: non-deterministic tests damage the value of an automated regression suite. If teams cannot trust the suite, they either ignore it or add manual checks around it.
The release cost: delayed confidence
Google also reported that about 84% of pass-to-fail transitions in post-submit testing involved a flaky test. The important lesson is not that every team has Google’s exact number. The lesson is that flaky transitions can dominate CI investigation work.
When red builds are mostly noise, every real failure takes longer to identify. That delay increases risk. The team may ship late, or worse, ship after assuming a real failure is “just another flake.”
How Flaky Tests Waste CI/CD Capacity
CI/CD waste from flaky tests shows up in four places: reruns, blocked queues, duplicated artifacts, and human-triggered rechecks. None of these require a dramatic failure. Small daily friction adds up.
1. Reruns multiply your pipeline cost
Many teams add retries as the first fix. Playwright, Cypress, pytest, and most CI systems support retries. Retries are useful for diagnosis, but they are not a cure.
Playwright’s own documentation on test retries describes tests as passed, flaky, or failed based on retry behavior. That classification is helpful because it separates a clean pass from a pass-after-retry. Your dashboard should make that difference visible.
A retry policy like this is common:
- Run the full test suite on pull request.
- If one test fails, retry the failed test once.
- If the retry passes, mark the build green.
- If the retry fails, rerun the job or ask QA to inspect.
The problem is step three. A green build after retry is not the same as a stable build. It is a warning with a green icon.
2. Flaky tests block parallel workers
Modern test suites use sharding. That makes flakiness more expensive. One bad shard can hold the whole build hostage while nine other shards finish cleanly.
If your suite runs in 12 shards and one shard retries twice, your wall-clock time may jump from 12 minutes to 24 minutes. The CI bill also grows because runners stay allocated longer.
3. Artifacts and traces become noise
Screenshots, videos, traces, HAR files, logs, and reports are useful when they point to a real problem. They become expensive when nobody reads them. A team that stores thousands of flaky-run artifacts each week pays for storage and still gets weak signal.
Tools like Playwright traces are excellent, but the process matters. If every flaky retry produces another artifact without ownership, the artifact folder becomes a junk drawer.
4. Manual rechecks return through the side door
The final CI/CD waste is manual testing disguised as caution. Someone says, “Can QA quickly check login because the test failed again?” Then the manual tester repeats a flow that automation was supposed to cover.
That is not a one-off. It is a system smell. If a test repeatedly needs human confirmation, either fix the test, remove it from the release gate, or replace it with a better check.
How Flaky Tests Burn Engineering Time
The cost of flaky tests is easiest to underestimate when you only count CI minutes. Developer time is the real bill.
The 25-minute failure tax
Here is a conservative example from a mid-sized product team:
- 40 pull requests per day.
- 20-minute end-to-end pipeline.
- 8% of PR builds hit at least one flaky failure.
- Each flaky failure takes 25 minutes of engineer or QA time.
- Average blended engineering cost: ₹2,000 per hour.
That gives you 3.2 flaky PR builds per day. At 25 minutes each, the team loses 80 minutes per day. At ₹2,000 per hour, that is about ₹2,667 per day, ₹58,674 per working month, and more than ₹7 lakh per year.
That is only investigation time. It ignores release delay, context switching, extra CI spend, and customer risk.
Context switching is not free
A developer debugging payment logic has to switch into test infrastructure mode when the pipeline fails. They open the report, scan the trace, inspect the selector, compare the last run, maybe rerun locally, and then return to the feature.
That switch is expensive because it breaks mental state. The calendar shows 15 minutes. The brain pays more.
Flaky tests make code review slower
Code review depends on confidence. A reviewer wants to know whether the PR is safe. If the build is red, the reviewer waits. If the author says “known flaky,” the reviewer has to decide whether to trust that claim.
After enough false alarms, review culture changes. People merge with red builds. Teams add labels like ci-flake and move on. That label may be practical, but it can also become a trash bin for real failures.
Root Causes I See Most Often
Flaky tests do not come from one source. In 2026, I see five patterns again and again across Selenium, Playwright, Cypress, API tests, and mobile automation.
Weak waiting strategy
Hard waits are still everywhere. A test waits for 3 seconds because the app was slow one day. Then the app takes 4 seconds in CI. The test fails. Someone increases the wait to 5 seconds. The suite becomes slower and still fails under load.
Better strategy:
- Wait for user-visible state, not time.
- Use locator assertions instead of sleep.
- Wait for network completion only when the user experience depends on it.
- Remove arbitrary waits during code review.
Unstable selectors
Selectors based on CSS structure break when the UI changes. A test that clicks .card:nth-child(3) .btn is not a reliable user story. It is a bet against future frontend refactoring.
Prefer accessible roles, labels, and stable test IDs. This also makes the test easier for manual testers and developers to understand.
Dirty test data
Many flaky tests are data problems. The test expects a clean user, a fixed order status, or an empty cart. Another test changes the same data. The failure looks random because the suite order changes.
The fix is boring but powerful: isolate data, create it through APIs, clean it after the test, and avoid shared accounts for parallel runs.
Third-party dependencies
Payment sandboxes, email providers, SMS gateways, maps, analytics scripts, and feature flag services can all introduce random failures. If your release gate depends on a slow external sandbox, your pipeline is only as stable as that sandbox.
Use contract tests, mocks, or sandbox health checks for third-party systems. Do not let external noise decide whether your login test is green.
Environment drift
CI and local environments drift over time. Browser versions, time zones, CPU limits, test data, feature flags, and seed scripts differ. The test passes locally and fails in CI because the environments are not equivalent.
Docker Compose, ephemeral test environments, and explicit seed data help here. I wrote about this setup in Docker Compose for QA with Playwright, Postgres, and Redis.
A Simple Flaky Test Cost Model
You do not need a perfect model to start. You need a model that is good enough to make waste visible. Use this formula:
Monthly flaky test cost =
(flaky failures per month × average triage minutes × hourly cost / 60)
+ extra CI cost
+ release delay cost
Step-by-step calculation
Start with four numbers. If you do not have exact data, use two weeks of pipeline history.
- Flaky failures per month: count failures that pass on retry without code change.
- Average triage minutes: sample 20 incidents and measure time from first red build to decision.
- Hourly cost: use a blended engineering rate, not only QA salary.
- Extra CI cost: compare retry minutes and rerun minutes against normal suite minutes.
Example:
flaky_failures = 96 per month
triage_minutes = 22
hourly_cost = ₹2,000
extra_ci_cost = ₹18,000
people_cost = 96 × 22 × 2000 / 60 = ₹70,400
monthly_cost = ₹70,400 + ₹18,000 = ₹88,400
That is ₹10.6 lakh per year before release delay. For a team with 20 engineers, this is realistic. For a large product company, it is small.
Use bands, not fake precision
Do not pretend the number is exact. Give leadership a range:
- Low estimate: only confirmed flaky failures.
- Expected estimate: confirmed failures plus retry-pass builds.
- High estimate: include manual rechecks and release delay.
This makes the conversation practical. You are not saying, “Flakiness is bad.” You are saying, “This suite is costing us ₹8-12 lakh per year in avoidable waste.”
Playwright Example: Track Retries and Failures
Playwright already exposes retry information. The mistake is leaving it inside reports that nobody reviews. Push it into a small JSON summary and trend it.
Reporter example
Use a custom reporter to capture retry-pass tests. This example is intentionally small.
// flaky-reporter.ts
import type { Reporter, TestCase, TestResult } from '@playwright/test/reporter';
import fs from 'node:fs';
type FlakeRecord = {
title: string;
file: string;
retry: number;
status: string;
durationMs: number;
};
class FlakyReporter implements Reporter {
private records: FlakeRecord[] = [];
onTestEnd(test: TestCase, result: TestResult) {
if (result.retry > 0 || result.status === 'flaky') {
this.records.push({
title: test.title,
file: test.location.file,
retry: result.retry,
status: result.status,
durationMs: result.duration,
});
}
}
onEnd() {
fs.mkdirSync('test-results', { recursive: true });
fs.writeFileSync(
'test-results/flaky-summary.json',
JSON.stringify(this.records, null, 2)
);
}
}
export default FlakyReporter;
Then add it to your Playwright config:
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
retries: process.env.CI ? 1 : 0,
reporter: [
['html'],
['./flaky-reporter.ts']
],
});
What to track weekly
Do not track 20 metrics. Track five:
- Retry-pass count.
- Top 10 flaky test files.
- Average triage time.
- Flaky failures by root cause.
- Percentage of release-blocking failures later marked flaky.
If you want a broader strategy for AI-era flakiness, read Flaky Tests in AI Testing: Biggest QA Problem of 2026. The same principle applies: classify before you automate.
A 30-Day Triage System for QA Teams
A flaky test program fails when it becomes a spreadsheet nobody owns. Keep it operational. Make it part of the release system.
Week 1: Measure and label
For the first week, do not try to fix everything. Label failures. Add a simple taxonomy:
selectorwaittest-dataenvironmentthird-partyproduct-bugunknown
The goal is to stop arguing from memory. After one week, you should know whether your biggest problem is data, selectors, or infrastructure.
Week 2: Quarantine with rules
Quarantine is useful when it has rules. It is dangerous when it becomes a graveyard.
My quarantine rule:
- A quarantined test must have an owner.
- It must have a ticket.
- It must have a deadline.
- It must still run outside the release gate.
- It must return to the gate only after 20 clean runs.
This keeps the release gate stable without deleting coverage. It also stops teams from hiding uncomfortable tests forever.
Week 3: Fix the top 10 offenders
Do not spread effort across 200 flaky tests. Sort by impact. Fix the top 10 tests by rerun count and triage minutes.
Most teams get a visible improvement from a small set of fixes: stable selectors, better API setup, isolated test data, and removed sleeps.
Week 4: Add prevention to code review
Prevention is cheaper than triage. Add a lightweight checklist to test code review:
- No hard waits unless approved with a comment.
- No shared mutable test data.
- No selectors tied to layout structure.
- No dependency on third-party sandbox availability for core release gates.
- Every new E2E test has a clear owner.
This is where test automation becomes engineering work, not just scripting.
India QA Team Context
For India-based QA teams, flaky tests create a career and hiring problem too. Product companies expect SDETs to understand CI, debugging, test architecture, and release risk. Service companies often still reward test count and execution volume.
That gap matters. A QA engineer who can say, “I reduced flaky retry-pass builds by 42% and saved 300 CI hours per quarter” sounds different from someone who says, “I automated 200 test cases.”
What hiring managers notice
For SDET roles in Bengaluru, Pune, Hyderabad, and remote product teams, I see interviewers ask more system-level questions:
- How do you debug a flaky Playwright test?
- When do you quarantine a test?
- How do you separate product bugs from test bugs?
- What metrics prove automation is helping?
- How do you design CI gates for a large suite?
These questions matter more for ₹25-40 LPA roles because companies are not paying for script execution. They are paying for judgment.
What managers should change
If you manage QA, stop rewarding only new automation count. Add stability metrics:
- Flaky rate by suite.
- Mean time to classify failure.
- Retry-pass trend.
- Release gate false-red rate.
- Top flaky root cause.
This gives senior testers the right incentive. They will improve reliability instead of adding more fragile tests.
Key Takeaways: Reduce the Cost of Flaky Tests
The cost of flaky tests is measurable. You do not need a six-month transformation to start reducing it. You need classification, ownership, and a cost model that leadership understands.
- Flaky tests are not harmless because every false red build consumes CI capacity and human attention.
- Google’s public data shows that even mature engineering organizations fight flakiness at scale.
- Retries are useful for diagnosis, but a pass-after-retry is still a stability signal.
- Track retry-pass tests, top offenders, triage time, and root causes every week.
- Fix the top 10 flaky tests before trying to clean the whole suite.
- In India, SDETs who can quantify automation reliability stand out in product-company interviews.
My practical advice: start with one dashboard and one weekly review. If the team cannot see the flake, it will keep paying the tax.
FAQ
What is the biggest cost of flaky tests?
The biggest cost is engineering attention. CI minutes matter, but the larger loss comes from developers and QA engineers investigating false failures, losing context, and delaying reviews or releases.
Should we delete flaky tests?
Not immediately. First classify the root cause. If the test covers a critical user journey, quarantine it from the release gate, keep running it separately, assign an owner, and fix it. Delete only tests that provide weak coverage and high noise.
Are retries a good solution for flaky tests?
Retries are a diagnostic tool, not a fix. They help identify pass-after-retry behavior, but they can also hide instability if the team treats the final green build as clean.
How many flaky tests are acceptable?
Zero is the goal, but most real teams need thresholds. A better question is: how many flaky failures block releases, how long do they take to classify, and is the trend improving every month?
How do I convince leadership to invest time in fixing flakiness?
Show the monthly cost. Count flaky failures, multiply by average triage minutes and blended hourly cost, then add extra CI spend. A rupee estimate gets attention faster than a generic complaint about test quality.
