PromptFoo Evaluation for AI Testing: Stop Saying AI Failed
PromptFoo evaluation is the missing habit in many AI testing teams. If your bug report says only “the AI failed,” you have not created evidence. You have created a complaint that nobody can repeat.
I see this pattern every week with QA engineers experimenting with AI agents, RAG chatbots, test generators, and browser automation copilots. The demo works once, fails once, and the team starts arguing about the model. A better move is simple: write one bad prompt, one assertion, and one repeatable PromptFoo check.
Table of Contents
- What Is PromptFoo Evaluation?
- Why “The AI Failed” Is Not a Test Result
- Quick Start: PromptFoo Evaluation in 15 Minutes
- A Real QA Example: Test Case Generator Eval
- Assertions That Matter for AI Testing
- Put PromptFoo Evaluation in CI/CD
- India Context: Why SDETs Should Learn Evals Now
- Common Mistakes I See Teams Make
- Key Takeaways
- FAQ
Contents
What Is PromptFoo Evaluation?
PromptFoo evaluation is a repeatable way to test prompts, model outputs, RAG answers, agents, and AI workflows against expected behavior. The official Promptfoo documentation describes it as an open-source CLI and library for evaluating and red-teaming LLM apps. The same intro page says it helps teams build reliable prompts, models, and RAGs with benchmarks specific to their use case.
That sentence matters for QA. We are not paid to say “looks good.” We are paid to convert risk into evidence. PromptFoo gives SDETs a YAML file, a list of test cases, a set of assertions, and a command that can run again tomorrow.
Why this belongs in a QA workflow
Traditional automation checks deterministic systems. Click this button. Assert this text. Validate this API status code. AI systems add a messier layer. The same input can produce different wording, different ordering, and sometimes a confident wrong answer. That does not mean we skip testing. It means we test behavior instead of exact sentences.
PromptFoo supports assertions such as contains, regex, is-json, contains-json, JavaScript checks, Python checks, similarity checks, latency checks, and cost checks. Its assertions documentation defines accuracy as the proportion of prompts that produce the expected or desired output. That language is close to how testers already think about pass rate, failure clusters, and release gates.
The numbers show serious adoption
PromptFoo is not a random weekend script. At the time I checked for this article, the GitHub API showed 22,606 stars, 2,007 forks, an MIT license, and recent repository activity on 26 June 2026. The npm downloads API reported 1,414,066 downloads for the package promptfoo in the last month ending 24 June 2026. The Promptfoo docs also show mature product areas around evaluations, red teaming, guardrails, model security, and MCP proxy workflows.
I do not treat those numbers as proof that the tool is perfect. I treat them as proof that AI evaluation is now a practical engineering skill, not a research-only topic.
Why “The AI Failed” Is Not a Test Result
The phrase “the AI failed” hides the real defect. Did the prompt miss a constraint? Did the model ignore the expected JSON schema? Did the RAG retriever return irrelevant context? Did the browser agent click the wrong element? Did the evaluator itself use a vague rubric? Each one needs a different fix.
A weak bug report
Here is the kind of AI bug report I do not want from a QA team:
Title: AI test generator failed
Steps: Asked it to create test cases for login
Actual: Bad output
Expected: Good output
This gives a developer almost nothing. There is no prompt, no model, no input data, no expected behavior, no assertion, no version, and no repeatable command. It is equivalent to saying “Selenium is flaky” without sharing the selector, trace, browser version, or failure screenshot.
A useful AI testing bug report
Now compare it with this:
Title: AI test generator omits negative login cases
Prompt version: prompts/login-test-generator-v3.txt
Model: openai:gpt-5-mini
Input: valid login story with lockout after 5 failed attempts
Expected: output contains at least 2 negative test cases and 1 lockout case
Actual: generated only happy path and forgot lockout
Repro: npx promptfoo@latest eval -c promptfooconfig.yaml --filter-first-n 1
Failing assertion: javascript check for negative scenario count
That is evidence. A developer can run the same eval, inspect the output, change the prompt, and confirm whether the pass rate improves. If you already collect traces for AI browser runs, read the ScrollTest article on AI Testing Evidence Pack: Trace, Screenshot, Logs. The same evidence mindset applies here.
The repeatability rule
My rule is direct: if another engineer cannot rerun your AI failure in under 5 minutes, your bug report is incomplete. You do not need a 40-page test strategy. You need:
- the exact prompt or prompt file,
- the model/provider used,
- the input variables,
- the expected behavior,
- the assertion that failed,
- the command to rerun it.
Quick Start: PromptFoo Evaluation in 15 Minutes
The fastest way to understand PromptFoo evaluation is to create one tiny eval. Do not start with a giant RAG system or a 200-test benchmark. Start with one prompt and one output contract.
1. Install and initialize
Promptfoo’s command-line docs list init, eval, view, validate, retry, export, and redteam as CLI command groups. For a QA team, the first four are enough for day one.
mkdir ai-login-eval
cd ai-login-eval
npm init -y
npm install --save-dev promptfoo
npx promptfoo init
If your company blocks global npm installs, use npx promptfoo@latest. If your test environment uses a private npm proxy, pin a version in devDependencies so the same build runs in CI and on your laptop.
2. Write a bad prompt on purpose
Save this as prompts/login-cases.txt:
Generate test cases for this user story:
{{story}}
Return concise test cases.
This prompt is intentionally weak. It does not say how many cases to return. It does not demand negative paths. It does not ask for structured JSON. That weakness is useful because it gives us something to catch.
3. Create a minimal PromptFoo config
Save this as promptfooconfig.yaml:
description: Login test case generator eval
prompts:
- file://prompts/login-cases.txt
providers:
- openai:gpt-5-mini
tests:
- description: lockout story must include negative tests
vars:
story: |
As a banking user, I should be locked out after 5 failed login attempts
so that my account is protected from brute force attacks.
assert:
- type: icontains
value: lockout
- type: javascript
value: |
const text = output.toLowerCase();
const negativeWords = ['invalid', 'failed', 'wrong password', 'negative'];
return negativeWords.some(word => text.includes(word));
The official configuration guide says YAML configs run each prompt through example inputs, also called test cases, and check whether they meet requirements, also called assertions. That is exactly what we are doing here.
4. Run the eval
OPENAI_API_KEY=sk-your-key npx promptfoo eval -c promptfooconfig.yaml
npx promptfoo view
The first command runs the eval. The second command opens a browser UI so your team can inspect outputs. Promptfoo’s docs also show matrix views for comparing prompts, inputs, and providers side by side. That is useful when one model passes a schema check but another model gives a better business answer.
A Real QA Example: Test Case Generator Eval
Let us make the example more realistic. Many SDETs now ask AI to create test ideas from Jira stories, release notes, or API contracts. That is fine, but only if the AI output is reviewed and tested. For release notes specifically, I recently covered similar QA risk thinking in Playwright 1.61 Release Notes: Passkeys, WebStorage API, and What QA Teams Must Know.
Define the output contract
Do not ask the model for “good test cases.” Ask for a contract. For example:
- Return valid JSON.
- Return exactly 6 test cases.
- Include at least 2 negative cases.
- Include
id,title,type,priority, andsteps. - Use only
positive,negative, orsecurityas the type.
That gives PromptFoo something to measure. A vague output gets a vague review. A structured contract gets a clean pass or fail.
Use TypeScript for a stronger assertion
Create assertions/testCaseContract.ts:
type TestCase = {
id: string;
title: string;
type: 'positive' | 'negative' | 'security';
priority: 'P0' | 'P1' | 'P2';
steps: string[];
};
export default function validateTestCases(output: string) {
let parsed: TestCase[];
try {
parsed = JSON.parse(output);
} catch {
return { pass: false, score: 0, reason: 'Output is not valid JSON' };
}
if (!Array.isArray(parsed)) {
return { pass: false, score: 0, reason: 'Output must be a JSON array' };
}
const hasSix = parsed.length === 6;
const negativeCount = parsed.filter(tc => tc.type === 'negative').length;
const securityCount = parsed.filter(tc => tc.type === 'security').length;
const everyCaseHasSteps = parsed.every(tc => Array.isArray(tc.steps) && tc.steps.length >= 3);
const pass = hasSix && negativeCount >= 2 && securityCount >= 1 && everyCaseHasSteps;
return {
pass,
score: pass ? 1 : 0,
reason: JSON.stringify({ hasSix, negativeCount, securityCount, everyCaseHasSteps })
};
}
Then reference it from your config:
assert:
- type: is-json
- type: javascript
value: file://assertions/testCaseContract.ts
This is where QA engineers can add real value. Product owners may know the story. Developers may know the implementation. SDETs know the missing edge cases, bad data paths, security cases, and regression risks.
Compare prompt versions
PromptFoo lets you test multiple prompts against the same cases. That is useful when you improve a prompt from version 1 to version 2.
prompts:
- file://prompts/login-cases-v1.txt
- file://prompts/login-cases-v2.txt
Now your prompt review is not a debate. You can run both versions against the same 20 stories and inspect pass rate, failure reasons, latency, and cost. If version 2 passes 18 out of 20 and version 1 passes 11 out of 20, the discussion becomes concrete.
Assertions That Matter for AI Testing
PromptFoo has many assertion types. Do not use all of them on day one. Pick the smallest set that catches real product risk.
Use deterministic assertions first
Start with checks that do not need another LLM judge:
is-jsonwhen downstream code parses the output.contains-jsonwhen the model may include text around a JSON block.containsoricontainsfor required words like “lockout,” “refund,” or “OTP.”regexfor IDs, URLs, Jira keys, or masked account numbers.javascriptorpythonfor domain-specific rules.latencywhen the AI response sits inside a user-facing flow.costwhen a prompt change makes the run too expensive.
Promptfoo’s assertions page lists equality, contains, regex, JSON, HTML, SQL, JavaScript, Python, similarity, moderation, and model-graded options. I prefer deterministic checks first because they are cheaper and easier to debug.
Use model-graded checks carefully
Model-graded checks are useful for quality rubrics. For example, you can ask an evaluator model whether a test case set covers boundary values. But that adds another AI system to your test. Treat it like a reviewer, not a compiler.
A good model-graded rubric is specific:
Score 1 only if the answer includes:
- at least one boundary value case,
- at least one negative case,
- one clear expected result per case,
- no unsupported claim about implementation internals.
A bad rubric says “check whether the answer is good.” That is not testing. That is asking for vibes with a YAML wrapper.
Use repeat counts for flaky AI outputs
If a prompt passes once, it has not earned trust. Run it multiple times or across multiple inputs. This is the same logic I use when explaining why one successful AI agent run is weak evidence in AI Agent Testing: Why One Pass Means Nothing.
For critical workflows, I want at least 10 to 20 representative examples before I trust a prompt change. For release-critical prompts, I also want a small regression pack with stories that failed before.
Put PromptFoo Evaluation in CI/CD
An eval that runs only on one laptop will be forgotten. Put it in CI like any other quality gate.
GitHub Actions example
Here is a minimal workflow:
name: ai-evals
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
- 'assertions/**'
jobs:
promptfoo:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm ci
- name: Run PromptFoo evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: npx promptfoo eval -c promptfooconfig.yaml
This catches prompt regressions before they merge. If a developer edits a prompt and removes the “return JSON” instruction, the PR should fail before that change reaches staging.
What to store as build artifacts
For each eval run, store the config, prompt files, output export, and summary. If a failure appears after a model update, you need to know whether the prompt changed, the test data changed, or the provider behavior changed.
- Commit prompt files and assertion code.
- Version the eval dataset.
- Export failing outputs from CI.
- Add the failure to a regression pack after the fix.
- Review pass rate trends weekly, not only during production incidents.
Connect it with Playwright and API tests
PromptFoo is not a replacement for Playwright, Selenium, Cypress, Postman, or contract tests. It tests the AI decision layer. Your existing tools still test the product behavior.
A practical stack looks like this:
- PromptFoo checks whether the AI generates valid test cases or correct actions.
- Playwright checks whether the browser flow actually works.
- API tests check the service contract and data rules.
- Trace review checks what happened when an AI browser run failed.
If you are building agentic browser tests, the ScrollTest guide on DeepEval vs PromptFoo for SDET Teams is a good companion read because it compares where each evaluator fits.
India Context: Why SDETs Should Learn Evals Now
In India, many QA engineers are still being measured by automation scripts, regression coverage, and CI stability. Those skills still matter. But AI features are entering product roadmaps faster than most QA teams are updating their test strategy.
The career signal
If two SDETs both know Playwright, API testing, and CI/CD, the one who can also test LLM output has a stronger 2026 profile. Product companies are not looking only for “prompt users.” They need engineers who can turn AI uncertainty into measurable checks.
For mid-level SDETs in Bengaluru, Pune, Hyderabad, Noida, and Chennai, I see the stronger compensation bands usually go to people who can own systems, not only scripts. A tester who can build a PromptFoo eval pack, connect it to GitHub Actions, and explain failure categories in a release review has a better story for ₹25 to ₹40 LPA roles than someone who says “I tried ChatGPT for test cases.”
Where services teams can use this
For TCS, Infosys, Wipro, Cognizant, Accenture, and similar service teams, this can become a differentiator in proposals. Instead of saying “we use AI for testing,” show a demo where:
- a Jira story goes into a prompt,
- PromptFoo validates the generated test cases,
- Playwright executes the selected flows,
- CI blocks weak AI output before a QA lead reviews it.
That is a better sales story because it includes evidence, controls, and repeatability.
Common Mistakes I See Teams Make
PromptFoo evaluation is simple to start, but teams still make avoidable mistakes. Most of them come from treating evals as magic instead of test automation.
Mistake 1: Testing only happy paths
If your eval dataset has 5 clean stories and no messy inputs, it will give false confidence. Add real defects from your bug tracker. Add vague requirements. Add missing acceptance criteria. Add security-sensitive stories. Add one story that caused production pain last quarter.
Mistake 2: Changing prompts without versioning
A prompt is production logic. Store it in Git. Review it like code. When a prompt changes, run the eval pack. If pass rate drops from 90% to 65%, do not merge because the wording “looks cleaner.”
Mistake 3: Using only LLM judges
LLM judges can help with nuance, but they can also hide simple failures. If the output must be JSON, use is-json. If it must include 6 cases, count 6 cases. If it must not expose a password, write a regex. Use model grading for judgment, not for basic parsing.
Mistake 4: No owner for the eval pack
An eval pack without an owner becomes stale. Assign it like any regression suite. When production defects appear, add a test. When product behavior changes, update the expected result. When a provider changes model behavior, inspect the failures instead of deleting the assertion.
Key Takeaways
PromptFoo evaluation gives QA teams a repeatable way to test AI outputs instead of arguing about one demo run. The core idea is not complicated: inputs, prompts, providers, assertions, and a command your team can run again.
- Stop writing “the AI failed.” Share the prompt, input, model, assertion, and rerun command.
- Start with deterministic checks such as
is-json,contains,regex, JavaScript, and Python. - Use model-graded checks only when the rubric is specific enough to audit.
- Put evals in CI so prompt changes fail before they reach staging.
- For SDETs, AI evaluation is becoming a practical skill alongside Playwright, API testing, and CI/CD.
My recommendation is simple: create one PromptFoo evaluation this week for one AI-assisted QA workflow. Do not boil the ocean. Pick one prompt, one assertion, and one failure your team can repeat.
FAQ
Is PromptFoo only for developers?
No. QA engineers can use it directly because the workflow looks like test automation: define inputs, run a command, check assertions, and report failures. You need basic YAML and enough JavaScript or Python to write custom checks.
Can PromptFoo replace manual review of AI outputs?
No. It reduces repetitive checking and catches known failure patterns. Human review still matters for new risks, product judgment, and unclear requirements. Treat evals as a regression suite for AI behavior.
Should I use PromptFoo or DeepEval?
Use PromptFoo when you want practical prompt, provider, CLI, CI, and red-team workflows. Use DeepEval when you want Python-first LLM evaluation patterns and metrics inside a Python test stack. Many SDET teams can use both, depending on the system under test.
How many test cases should my first eval have?
Start with 5 to 10 examples. Include at least 2 examples that previously failed or represent high business risk. After the first working eval, grow it to 20 or more representative cases.
What is the best first assertion for QA teams?
For AI test generation, start with is-json or a JavaScript contract check. For RAG answers, start with required source checks and hallucination-sensitive rubrics. For browser agents, check the planned action, selector, and final page state separately.
Sources checked: Promptfoo intro, configuration, assertions, and command-line documentation; Promptfoo OpenAI acquisition update dated 9 March 2026; GitHub repository API for promptfoo/promptfoo; npm downloads API for promptfoo.
