Flaky Tests Are a Trust Problem: How to Diagnose, Quarantine, and Permanently Fix Unreliable Tests

Clifford Austin Domingo, a Senior Quality Engineer, recently published something that should be pinned in every QA team’s Slack channel: “A failing test is useful. A flaky test is dangerous.” That distinction captures the entire problem. Flaky tests do not just waste CI minutes — they trigger a trust erosion cycle that eventually makes your entire automation investment worthless. Ignored failures lead to false positives hiding real bugs, which lead to broken pipelines, which lead to teams that stop believing in automation altogether.

I have watched this cycle destroy automation programs at companies of every size. A team spends six months building a comprehensive Selenium or Playwright suite. It works beautifully for three months. Then flaky tests start appearing — one here, two there. The team adds retries. The retries mask the problem but slow the pipeline. More flaky tests appear. The failure notifications become noise. Developers stop waiting for green. A real bug slips through. Someone asks “what is the point of automation if it does not catch bugs?” and the program’s credibility is gone.

The good news is that this cycle is entirely preventable. Every root cause of flakiness has a permanent fix, and the diagnostic process is systematic, not mysterious.

Contents

The Trust Erosion Cycle

Understanding the cycle is the first step to breaking it. It starts with intermittent test failures that have no obvious cause. The team investigates a few, finds them unrelated to real bugs, and starts adding retries or marking tests as “known flaky.” This reduces the investigation burden but introduces two hidden costs: retried tests take longer to run, and the “known flaky” label becomes a parking lot where tests go to die.

As the flaky test count grows, the signal-to-noise ratio in CI results degrades. Developers stop investigating failures because most are false alarms. At this point, your automation suite has lost its primary value — the ability to block bad code from reaching production. Martin Poulose, a Gen AI Test Lead at TCS, frames this in terms of what modern QA requires: “Speed and parallelism define modern QA success.” Flaky tests undermine both by forcing serial investigation and slowing pipeline throughput.

Systematic Root Cause Diagnosis

The five root causes that account for over 90% of flakiness follow a predictable distribution. Timing and race conditions (approximately 40%) occur when tests act before the application is ready — clicking before rendering completes, asserting before API responses arrive, navigating before page load finishes. Test data dependencies (approximately 25%) happen when tests share data that one test modifies, breaking the expectations of another. Environment inconsistencies (approximately 15%) emerge when tests behave differently across local, CI, and staging environments. Asynchronous operation mishandling (approximately 10%) occurs in SPAs where state updates, WebSocket messages, and re-renders create race conditions that deterministic test code does not anticipate. External service dependencies (approximately 10%) introduce flakiness when third-party APIs, CDNs, or authentication providers have variable response times or intermittent outages.

Diagnose by examining the failure pattern, not just the error message. A test that fails on the same assertion every time but only in CI is almost certainly an environment issue. A test that fails randomly on different assertions is likely a timing issue. A test that fails only when run after certain other tests is a data dependency. A test that fails more on Monday mornings than Friday afternoons might depend on an external service with scheduled maintenance.

The Quarantine Strategy

Quarantine is not a euphemism for “ignore.” It is a structured process: identify a test as flaky (flakiness score above 2% over the past 30 runs), move it to a separate quarantine suite that runs but does not gate deployments, assign an owner and a fix deadline (I recommend two sprints maximum), and track quarantine population over time as a team health metric.

The quarantine suite still runs on every build. You still see its results. But a quarantine failure does not block a merge or deployment. This protects your main suite’s signal integrity while giving you time to fix problems properly. The worst alternative — adding retries to mask flakiness — hides real failures, slows your pipeline, and removes the urgency to address root causes.

Permanent Fixes by Root Cause

For timing issues: use framework-native waiting. Playwright’s auto-waiting handles most scenarios. For Selenium, use explicit WebDriverWait with expected conditions targeted at the specific state you need — element visible, text present, URL changed. Never use Thread.sleep() or hardcoded delays. For data dependencies: isolate every test’s data. Factory functions that generate unique test users, API setup calls that create fresh records, teardown hooks that clean up. The upfront cost is real but the flakiness elimination is permanent.

For environment issues: containerize with Docker. Pinned browser versions, pinned runtime versions, identical containers locally and in CI. For async issues: wait for specific completion signals — network responses returning, loading indicators disappearing, specific text appearing. For external dependencies: mock them. Playwright’s page.route() and Selenium’s proxy-based interception return predetermined responses, making your tests deterministic regardless of external service behavior.

Case Study: 30% to Under 2%

A fintech team I consulted with had a 450-test Selenium suite with a 30% flakiness rate. Their pipeline ran for 25 minutes and required an average of 1.8 retries per test. After a three-month remediation effort, they reduced flakiness to 1.7%. The breakdown: 45% of flaky tests were fixed by replacing Thread.sleep() with explicit waits, 30% by isolating test data (they had been sharing a single test account across 200 tests), 15% by containerizing the test environment, and 10% by mocking external API calls. Pipeline duration dropped from 25 minutes to 11 minutes because retries were nearly eliminated.

The Honest Caveats

Zero flakiness is not achievable in practice. Some non-determinism reflects genuine application behavior — eventual consistency, distributed system timing, real network variability. The goal is a flakiness rate low enough that every failure is investigated, which in my experience means under 2%. The case study I described involved a team with dedicated time allocation for the remediation — three months of focused effort alongside regular sprint work. Teams that try to fix flakiness as side work rarely make lasting progress.

Building flakiness dashboards, implementing quarantine systems, and permanently fixing the root causes of unreliable tests — with hands-on labs in both Playwright and Selenium — is covered in Module 6 of my AI-Powered Testing Mastery course.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.