DeFlaky AI Root Cause Analysis for Flaky Tests
Day 17 of 100 Days of AI in QA & SDET: DeFlaky AI root cause analysis is the workflow I want every SDET to try before adding one more blind retry to CI. Flaky tests are not just annoying red builds. They are a trust problem, and once developers stop trusting automation, your test suite becomes background noise.
I like DeFlaky because it starts with a boring but powerful idea: run the same command multiple times, compare the results, and convert reliability into a number the team can track. Then the AI layer becomes useful because it has evidence to reason about instead of guessing from one failed screenshot.
Table of Contents
- Why Flaky Tests Are Trust Debt
- What Is DeFlaky AI Root Cause Analysis?
- How DeFlaky Fits Modern QA Stacks
- Setup DeFlaky With Playwright
- The Root Cause Playbook I Use
- CI Quality Gates That Do Not Punish Teams
- AI Prompts for Flaky Test Investigation
- India SDET Career Angle
- Key Takeaways
- FAQ
Contents
Why Flaky Tests Are Trust Debt
A flaky test is a test that passes and fails without a relevant product change. That sounds simple, but the damage is bigger than one failed pipeline. Flaky tests train engineers to ignore automation.
Google’s Testing Blog reported in 2016 that about 1.5% of all test runs in its corpus produced a flaky result, almost 16% of tests had some level of flakiness, and about 84% of pass-to-fail transitions involved a flaky test. I do not copy those numbers blindly into every team because your codebase is different. I use them as a warning: even strong engineering teams have to treat flakiness as a system problem.
Most teams talk about flaky tests as if the cost is only the extra five minutes in CI. That is the smallest cost. The real cost is the human loop after the failure:
- A developer sees a red build and assumes it is random.
- A QA engineer reruns the job instead of investigating.
- A release manager waits for confidence that should already exist.
- A real regression hides behind a noisy test history.
Once this pattern repeats, the team starts treating a red test as a suggestion. That is dangerous. Automation should make release decisions clearer, not more political.
Retries are not a strategy
Playwright supports test retries and its documentation correctly explains how retry data can classify flaky tests. Retries are useful for detection, quarantine, and short-term signal recovery. They are not a fix by themselves.
If a login test passes on retry, you still need to know whether the root cause is an unstable selector, a race condition, delayed API response, shared test data, animation timing, or an environment issue. DeFlaky helps because it moves the conversation from “rerun it” to “show me the pattern across runs.”
What Is DeFlaky AI Root Cause Analysis?
DeFlaky is an open-source CLI and dashboard for detecting and tracking flaky tests. The official DeFlaky site describes it as a tool that works with Playwright, Selenium, Cypress, Jest, Pytest, and other frameworks that can produce JUnit XML or JSON results.
The core idea is simple. DeFlaky executes your test command multiple times, compares outcomes, and calculates a FlakeScore:
FlakeScore = (stable_tests / total_tests) * 100
A score of 100 means every test behaved consistently across the configured runs. A lower score means some tests changed state across repeated executions. That is the signal the AI root cause analysis can use.
Why repeated runs matter
One failed run gives you a symptom. Five or ten runs give you a pattern. AI is much better when the input contains structured evidence:
- Which test failed intermittently?
- Which step failed most often?
- Was the failure tied to one browser, shard, or environment?
- Did the failure happen before or after a network call?
- Did the same selector or assertion appear in multiple failures?
That evidence is what separates useful AI assistance from generic advice like “increase timeout.”
Current project facts
At the time of writing, the DeFlaky GitHub repository is public under PramodDutta/deflaky, the npm package is deflaky-cli, and the latest npm version visible in the registry is 1.1.2. The npm downloads API reported 60 downloads for the last month window ending 23 June 2026, so I treat this as an early-stage tool, not a mature enterprise platform. That is fine for this series because Day 17 is about building practical AI-assisted QA muscle, not chasing logos.
How DeFlaky Fits Modern QA Stacks
Most SDET teams already have the pieces DeFlaky needs. You have a test command, a CI runner, and a reporter. The missing piece is a reliability loop.
The minimum useful stack
For a Playwright team, I would start with this stack:
- Playwright Test for browser automation.
- JUnit reporter for CI-friendly result files.
- DeFlaky for repeated execution and FlakeScore.
- GitHub Actions, Jenkins, GitLab CI, or Azure Pipelines for scheduled checks.
- An AI assistant for summarizing failure patterns from traces, logs, and test output.
This is intentionally boring. Boring stacks survive real release pressure. If your flaky-test process needs three dashboards and a two-hour meeting, engineers will not use it.
Where AI helps and where it does not
AI helps with clustering, summarizing, and hypothesis generation. It can read five failure logs and suggest that all failures happen after a toast animation blocks a click. It can compare Playwright traces and point to a selector that sometimes resolves to two elements.
AI does not replace the SDET’s judgment. It should not auto-delete assertions or blindly raise timeouts. The job is to propose a root cause and a fix candidate, then make the team prove it with another DeFlaky run.
If you are building AI browser workflows, connect this thinking with the evidence-pack approach I wrote about in AI Testing Evidence Pack: Trace, Screenshot, Logs. Evidence beats vibes every time.
Setup DeFlaky With Playwright
Here is the setup I would use for a small Playwright suite. The same pattern works for Selenium, Cypress, Jest, and Pytest when the runner can emit JUnit XML or JSON.
1. Configure a JUnit reporter
In Playwright, add a JUnit reporter so CI and DeFlaky can reason about test results as data.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
testDir: './tests',
retries: process.env.CI ? 1 : 0,
reporter: [
['list'],
['junit', { outputFile: 'test-results/junit.xml' }],
['html', { open: 'never' }]
],
use: {
trace: 'retain-on-failure',
screenshot: 'only-on-failure',
video: 'retain-on-failure'
}
});
Playwright docs also cover retries and reporters. Use those features. Just do not confuse “the retry passed” with “the defect is solved.”
2. Run DeFlaky locally first
Start with five runs. Do not begin with twenty runs on the full suite unless you enjoy waiting.
npx deflaky-cli run \
-c "npx playwright test tests/checkout.spec.ts" \
-r 5 \
--format junit
If your suite is large, start with the top twenty tests that fail most often in CI. You want a short feedback loop while you prove the process.
3. Push only useful signals
The DeFlaky README shows a dashboard push flow with a token:
npx deflaky-cli run \
-c "npx playwright test" \
-r 10 \
--push \
--token "$DEFLAKY_TOKEN"
Do not push every experimental local run from every laptop. Keep the dashboard focused on CI, nightly reliability jobs, or investigation runs for known flaky tests.
The Root Cause Playbook I Use
When a test is flaky, I do not start by editing the test. I start by classifying the failure. DeFlaky gives me the list of suspects. Then I use this root cause playbook.
Step-by-step investigation
- Reproduce with repetition: Run the suspect test five to ten times with DeFlaky.
- Collect artifacts: Save JUnit XML, Playwright trace, screenshot, console logs, and network logs.
- Classify the failure: Put it into selector, timing, data, network, environment, test isolation, or product bug.
- Make one fix: Change one thing only. Multiple changes hide the real cause.
- Prove the fix: Run DeFlaky again and compare FlakeScore.
- Document the lesson: Add a short note to the test or team wiki.
Common root causes I see
Most flaky UI tests fall into a small set of buckets:
- Selector drift: The locator sometimes matches the wrong element or a hidden duplicate.
- Async race: The test clicks before the app is ready.
- Shared state: One test depends on data modified by another test.
- Backend instability: The UI test exposes a slow or inconsistent API.
- Environment noise: CPU, memory, browser version, test data, or service mocks differ across runs.
- Over-assertion: The test checks cosmetic details that are not part of the real requirement.
For Playwright-specific cleanup, I would pair this article with Playwright Flaky Tests: Retries and Fixes. That post covers the framework-level habits. This one adds the repeated-run and AI investigation loop.
CI Quality Gates That Do Not Punish Teams
A quality gate should improve behavior. A bad gate creates fear and workarounds. The DeFlaky README includes a threshold option, and that is useful when you apply it carefully.
npx deflaky-cli run \
-c "npx playwright test" \
-r 3 \
--fail-threshold 90
My recommended rollout
I would not block every pull request on Day 1. Use this rollout instead:
- Week 1: Run DeFlaky nightly on a small smoke suite. Publish the FlakeScore but do not fail builds.
- Week 2: Add a triage label for the top five flaky tests. Assign owners.
- Week 3: Fail only if the score drops below the current baseline by more than 5 points.
- Week 4: Introduce a hard threshold for critical smoke tests.
This avoids the classic mistake where leadership demands “zero flaky tests” overnight and the team responds by skipping the hardest tests.
Example GitHub Actions job
GitHub Actions supports scheduled workflows, so I like a nightly reliability job separate from normal PR validation.
name: nightly-flake-check
on:
schedule:
- cron: '30 18 * * *' # midnight IST
workflow_dispatch:
jobs:
deflaky:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npx playwright install --with-deps
- run: |
npx deflaky-cli run \
-c "npx playwright test @smoke" \
-r 5 \
--fail-threshold 90
If you already use CI/CD patterns from CI/CD Integration for QA, add DeFlaky as a reliability lane, not as a random extra step in every job.
AI Prompts for Flaky Test Investigation
The prompt matters less than the evidence you attach. Still, a structured prompt saves time.
Prompt template
Act as a senior SDET investigating a flaky automated test.
Context:
- Framework: Playwright Test
- Test name: [test name]
- DeFlaky runs: [5 or 10]
- Outcomes: [pass/fail pattern]
- FlakeScore change: [before/after]
Evidence:
- Error messages:
[paste top errors]
- Failing step:
[paste step]
- Relevant trace observations:
[paste observations]
- Recent code changes:
[paste diff summary]
Task:
1. Classify the likely root cause.
2. List the top 3 hypotheses with confidence.
3. Suggest the smallest safe fix.
4. Suggest a verification run using DeFlaky.
5. Tell me what not to change.
Bad AI output to reject
Reject any AI answer that says only one of these:
- Increase timeout.
- Add waitForTimeout.
- Retry more times.
- Disable the test.
- Use a more stable selector without showing which selector.
Those are not root cause analysis. They are patches without proof. A useful AI answer names the evidence, classifies the failure, proposes a minimal change, and tells you how to verify it.
India SDET Career Angle
For SDETs in India, flaky-test investigation is a serious career signal. Many manual testers moving into automation learn syntax first: locators, assertions, waits, and page objects. That is necessary, but it is not enough for senior roles.
I see a clear difference between two candidate profiles. The first candidate says, “I automated 200 test cases.” The second candidate says, “I reduced our top flaky smoke failures by separating product bugs, selector issues, backend instability, and test data collisions, then added a nightly reliability gate.” The second candidate sounds closer to a staff-level SDET mindset because the story connects automation to release confidence.
This matters in service companies too. If you work in TCS, Infosys, Wipro, Cognizant, or a large QA vendor, you may not control the whole engineering stack. You can still bring a repeatable flake investigation report to the client. Show the test name, run count, pass/fail pattern, suspected root cause, evidence links, and recommended fix. That kind of report moves the conversation from blame to engineering.
In product companies, the bar is higher. You are expected to debug across browser automation, APIs, CI runners, containers, test data, and observability. DeFlaky does not magically make you senior, but it gives you a clean operating model for the work senior SDETs already do.
Product companies want engineers who can protect release confidence. In interviews, I would rather hear one detailed story about reducing flaky CI noise than ten generic claims about writing automation frameworks. If you can explain FlakeScore, repeated execution, trace-based evidence, and a safe quality gate, you sound like someone who has lived with production pipelines.
Interview story format
Use this structure:
- Problem: Our smoke suite had random failures after login and checkout.
- Signal: DeFlaky showed a reliability score below our target after five repeated runs.
- Evidence: Traces showed a race after payment status polling.
- Fix: We replaced a brittle wait with a deterministic API-backed assertion.
- Proof: The next DeFlaky run stayed above the threshold for the smoke suite.
That story works better than “I know Playwright.” It shows judgment. For a ₹25 to ₹40 LPA SDET role, judgment matters.
Key Takeaways
DeFlaky AI root cause analysis gives QA teams a practical way to turn flaky-test pain into measurable reliability work. I would start small, prove the signal, and then add quality gates only where the team has enough evidence.
- Flaky tests are trust debt, not just CI noise.
- DeFlaky runs the same command multiple times and converts stability into FlakeScore.
- AI becomes useful when it receives repeated-run evidence, traces, screenshots, logs, and test output.
- Retries help detect flakiness, but they do not fix root causes.
- Start with a nightly reliability job before blocking every pull request.
If you are building AI-assisted QA workflows, also read QA Skills Directory for AI Agents. The next level is turning this investigation pattern into a reusable skill your whole team can run.
FAQ
Is DeFlaky only for Playwright?
No. The DeFlaky docs describe it as framework agnostic when the test runner can output JUnit XML or JSON. Playwright is just the easiest example for many modern SDET teams.
How many repeated runs should I use?
Start with five runs for investigation. Use ten runs for high-value smoke tests or a known flaky area. Avoid running the full suite ten times on every pull request unless the suite is small.
Should I fail CI when FlakeScore drops?
Yes, but not on the first day. Publish the score first, establish a baseline, assign owners for the worst tests, and then add thresholds for critical suites.
Can AI automatically fix flaky tests?
Sometimes it can suggest a good fix, but I would not auto-merge those changes. AI should propose a hypothesis and a patch. DeFlaky should prove whether the patch improved reliability.
What is the biggest mistake teams make?
They hide flaky tests with retries and skips. That makes dashboards green but release confidence weaker. Measure the flake, classify the root cause, fix one thing, and prove it with another run.
Sources checked: DeFlaky official site and GitHub README, npm registry for deflaky-cli, Google Testing Blog on flaky tests, Playwright documentation on retries and reporters, Pytest documentation for JUnit XML output, and GitHub Actions workflow syntax.
