| |

PromptFoo Regression Checklist for QA Teams

PromptFoo regression checklist featured image for QA teams

Table of Contents

The PromptFoo regression checklist is becoming a practical QA artifact, not a research toy. If your team ships prompts, RAG answers, AI agents, support bots, or LLM-powered test generators, you need repeatable checks before every release.

I see the same mistake in many QA teams: somebody tries a prompt manually, gets one good answer, and calls it tested. That is not testing. That is a demo. PromptFoo gives testers a way to turn examples, expected behavior, thresholds, and model comparisons into a regression suite that can run from the command line and in CI.

This article is a hands-on checklist for QA engineers and SDETs who want to treat LLM behavior like production software. I will use PromptFoo 0.121.17 as the reference version because the npm registry lists it as the latest release at the time of writing, published on 16 June 2026. The npm downloads API also reports 1,267,037 downloads for the package in the last month, which is a strong signal that eval tooling is moving from side project to normal engineering workflow.

Contents

Why QA teams need a PromptFoo regression checklist

Prompt changes look small in code review. One sentence changes. One system instruction is reordered. One retrieval filter is adjusted. Then the support bot starts giving weaker answers, the test case generator misses negative cases, or the AI browser agent clicks the wrong button.

Classic regression testing exists because small changes break existing behavior. LLM products are no different. The difference is that the output is probabilistic, so QA cannot depend only on exact expected strings. A good PromptFoo regression checklist combines deterministic assertions, semantic checks, data coverage, cost limits, latency limits, and human review for the risky cases.

One passing answer is not evidence

When a manual tester checks a normal web form, one pass still has limited value. For LLM output, the value is even lower because the same prompt can produce different words, different formats, or different reasoning paths. I want at least three things before I trust an AI behavior:

  • A stable set of test cases that represents real user inputs.
  • Assertions that describe what good output means.
  • A repeatable command that runs before merge or release.

PromptFoo fits this gap. The project README describes PromptFoo as a CLI and library for LLM evals and red teaming. The official docs say assertions compare LLM output against expected values or conditions. That language should feel familiar to QA engineers. It is the same mental model as an API test: input, response, assertions, report.

Prompt regression is still regression

Do not let the AI vocabulary confuse the team. A prompt regression is a product regression. If a customer support assistant used to explain a refund policy correctly and now hallucinates a fake refund window, that is a bug. If a QA test generator used to create boundary-value cases and now returns only happy paths, that is a bug.

This is why I link LLM testing back to standard QA practice. If you have not read our guide on LLM regression testing with PromptFoo, read that next. This checklist builds on the same idea: treat AI behavior as a contract you can inspect, repeat, and improve.

What changed in PromptFoo 0.121.17

The main reason PromptFoo deserves attention from QA teams is not one single feature. It is the maturity of the workflow around it. The npm registry reports 0.121.17 as the latest package version. The GitHub API shows the PromptFoo repository with 22,361 stars, 1,995 forks, and recent activity on 19 June 2026. Those numbers do not prove quality by themselves, but they show the tool is active enough for teams to evaluate seriously.

Use version checks in your QA process

Many teams pin browser drivers, Playwright, Selenium, and test libraries. They forget to pin eval tools. That is risky. If PromptFoo changes output schemas, assertion behavior, or provider handling, your CI gate may fail for the wrong reason.

Add these checks to your upgrade habit:

  1. Record the current PromptFoo version in your test README.
  2. Run the same eval suite before and after upgrading.
  3. Compare pass rate, token cost, latency, and failed cases.
  4. Store the generated result file as a CI artifact.
  5. Rollback the tool version if the upgrade changes evaluation behavior without a product change.

Why this matters for QA managers

QA managers do not need every tester to become a machine learning engineer. They do need at least a few people who can ask better questions:

  • What test data represents production usage?
  • Which checks must be deterministic?
  • Which checks need model grading?
  • What pass rate is acceptable for release?
  • What failure needs a human decision?

That is QA work. The tool changed, but the discipline did not.

The PromptFoo regression checklist

Here is the practical PromptFoo regression checklist I would use before shipping a prompt, RAG flow, or AI agent. Print it in your team wiki and attach it to pull requests that touch prompts.

Checklist item 1: Define the behavior in plain English

Start with a one-paragraph contract. Do not start with YAML. A good behavior contract says what the AI feature must do, what it must not do, and what evidence proves it works.

Example:

The refund assistant must answer refund policy questions using only the current policy document. It must ask for an order ID when the user requests account-specific action. It must not invent refund windows, promise manual exceptions, or expose internal policy notes.

This paragraph gives QA, product, and engineering a shared target. Without it, PromptFoo becomes a random collection of prompts.

Checklist item 2: Split examples by risk

Do not create one flat list of prompts. Group them by business risk. High-risk examples should run in every pull request. Lower-risk examples can run nightly.

  • P0: Legal, security, payment, privacy, medical, or financial claims.
  • P1: Core user journeys and high-volume support questions.
  • P2: Edge cases, rare languages, formatting variations, and tone checks.
  • P3: Exploratory prompts that help discover new failure modes.

This mirrors the way I think about browser automation. Not every test belongs in the PR gate. Some tests protect the release branch. Some tests belong in nightly runs. Some tests are for investigation only.

Checklist item 3: Use at least one deterministic assertion

The PromptFoo assertions documentation lists deterministic checks such as equality and contains checks, plus structured options for metrics, thresholds, and transforms. Use deterministic assertions wherever you can. They are cheaper, faster, and easier to debug than model-graded checks.

Good deterministic checks include:

  • The answer contains a required policy phrase.
  • The answer does not contain banned claims.
  • The response is valid JSON.
  • The output includes a required field such as risk_level.
  • The total cost stays below a defined threshold.

Use semantic or model-graded checks only where deterministic checks are too rigid. QA teams often jump to judge-based evals too quickly. That makes failures harder to explain.

Checklist item 4: Track cost and latency

Prompt quality is not the only release criterion. If the new prompt doubles token cost or makes the agent slow, users feel it. PromptFoo supports cost and latency style checks in assertions, so use them in CI for expensive workflows.

A simple rule works well: quality regressions block a merge, cost regressions require review, and latency regressions need a product decision. Do not bury these numbers in a dashboard nobody reads. Put them in the pull request comment or CI summary.

Test data: the part most teams skip

The fastest way to create a fake LLM regression suite is to write ten friendly prompts that all look the same. The suite passes, the dashboard looks green, and production still breaks. Test data design matters more than the tool.

Build a small but honest dataset

Start with 30 to 50 cases before you chase hundreds. Each case should have a reason to exist. I like this mix for a first suite:

  • 10 happy-path questions from real user behavior.
  • 10 edge cases that historically confused the product.
  • 5 adversarial or prompt-injection attempts.
  • 5 format checks, such as JSON, markdown, or table output.
  • 5 locale or language variants if your product serves India or global markets.
  • 5 negative cases where the assistant should refuse, ask for clarification, or say it does not know.

If you already have production support tickets, use them. Remove personal data first. If you have chatbot transcripts, sample them by topic and risk. If you have no production data, ask support, sales, QA, and product to contribute examples.

Use a coverage matrix

A coverage matrix prevents false confidence. It shows which user intents, risk types, locales, and output formats are represented. We have a related ScrollTest guide on AI test coverage matrices. The same structure works for PromptFoo.

Area Example coverage question Minimum cases
Intent Which user task is being tested? 10
Policy Which product rule must be followed? 5
Safety What must the model refuse or avoid? 5
Format What output shape must be preserved? 5
Locale Does the answer work for Indian users and global users? 5

The goal is not perfect coverage on day one. The goal is visible gaps. Once the gaps are visible, the team can decide what to add next.

Assertions that make LLM tests useful

A PromptFoo suite is only as good as its assertions. If every test simply asks a model judge whether the answer is good, debugging becomes slow. I prefer a layered assertion strategy.

Layer 1: Hard format checks

Hard format checks should fail fast. If your downstream system expects JSON, validate JSON. If the API contract requires answer, confidence, and sources, check those fields before anything else.

tests:
  - description: refund answer returns required JSON fields
    vars:
      question: "Can I get a refund after 21 days?"
    assert:
      - type: is-json
      - type: javascript
        value: |
          const obj = JSON.parse(output);
          return Boolean(obj.answer && obj.confidence && obj.sources);

This is familiar territory for API testers. If the schema is broken, do not waste money asking another model to grade the text.

Layer 2: Required and banned content

Next, check content boundaries. For policy workflows, I want required phrases and banned phrases. Required content proves the answer touched the right rule. Banned content catches risky hallucinations.

assert:
  - type: icontains
    value: "refund policy"
  - type: not-icontains
    value: "guaranteed exception"
  - type: not-icontains
    value: "manager approval is automatic"

This may look basic, but it catches a lot of real issues. A model that invents a policy often uses confident phrases. Banned terms make those failures visible.

Layer 3: Semantic quality checks

Use semantic similarity or model-graded checks for answers that can be correct in many ways. Keep the rubric short. A 12-line rubric is usually a sign that the product requirement is not clear.

A useful rubric says:

  • Answer the user question directly.
  • Use only the provided policy context.
  • Ask for missing account details when needed.
  • Do not promise outcomes outside the policy.

That is enough for many product teams. If the rubric grows too complex, split the test into smaller checks.

CI gates for prompt regressions

The PromptFoo CI/CD documentation recommends running evals from CI with commands such as npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json. It also shows output formats like JSON, HTML, and JUnit XML. That matters because QA teams already know how to consume build artifacts and test reports.

Choose the right gate

Do not block every pull request on a huge eval. That will slow engineers and create resentment. Use three gates:

  1. PR gate: 20 to 50 critical cases. Fast, cheap, and strict.
  2. Nightly gate: broader coverage with more providers, locales, and edge cases.
  3. Release gate: full suite with HTML report, JSON artifact, and human review for failures.

This is the same test pyramid conversation, applied to AI. Keep the merge gate lean. Put expensive checks where they create release confidence.

Store evidence, not only pass or fail

Every failed eval should leave evidence. Store the prompt, model, input variables, output, assertions, cost, latency, and commit SHA. If the failure cannot be reproduced, it becomes a Slack argument.

npx promptfoo@latest eval \
  -c promptfooconfig.yaml \
  -o results.json \
  -o report.html \
  --tag git.sha="$GITHUB_SHA" \
  --tag ci.run-id="$GITHUB_RUN_ID"

The official CI/CD docs show repeatable --tag flags for CI context. Use them. A tagged eval is easier to connect back to the exact change that caused the regression.

Set thresholds carefully

A 100% pass threshold sounds strong, but it may be unrealistic for early AI products. A 60% pass threshold sounds practical, but it may hide serious failures. Decide by risk area.

  • Security and privacy checks: 100% or block release.
  • JSON schema and contract checks: 100% or block merge.
  • Core task quality: start at 90% and tighten as the suite improves.
  • Tone and style: warning first, block only for brand-critical flows.
  • Cost and latency: warning thresholds with owner approval.

Numbers force discussion. Without thresholds, every eval report becomes subjective.

Where Playwright, API tests, and PromptFoo meet

PromptFoo should not live in a separate AI island. The best QA teams connect it with their existing automation stack. If your AI feature is exposed through an API, test the API contract. If it appears in a browser workflow, use Playwright for the UI and PromptFoo for the LLM response quality.

API test plus PromptFoo eval

For an AI endpoint, I like this split:

  • API tests verify status code, auth, schema, retries, and rate limits.
  • PromptFoo verifies answer quality, refusal behavior, and regression risk.
  • Observability checks verify logs, traces, and cost metrics.

This prevents a common mistake: using PromptFoo to test everything. It is not a replacement for API automation. It is a quality layer for LLM behavior.

Browser agent test plus PromptFoo eval

AI browser agents create a different challenge. The agent may complete a flow, but still use weak reasoning or ignore a key instruction. Pair browser evidence with eval evidence.

A strong agent test artifact includes:

  • Playwright trace or browser-use run log.
  • Screenshots for the final state and key steps.
  • Console and network errors.
  • PromptFoo eval for the agent instruction and final response.
  • A human-readable failure reason.

If you are exploring AI QA agents, read AI QA agents from prompts to runnable checks. The mindset is the same: do not stop at generated text. Convert the output into checks.

India career angle for QA engineers

For QA engineers in India, PromptFoo is a good skill to learn because it connects existing testing discipline with AI product work. Many service company projects still measure QA through manual execution, Selenium maintenance, and release support. Product companies are adding AI features faster, and they need people who can test those features without pretending every QA engineer is now a data scientist.

What to show in your portfolio

If you want to stand out in interviews, do not only say, “I know AI testing.” Show a small repo with evidence:

  • A PromptFoo config with 30 test cases.
  • At least five deterministic assertions.
  • At least five adversarial or refusal tests.
  • A GitHub Actions workflow that runs the eval.
  • A sample HTML or JSON report from a failed run.
  • A short README explaining the release decision.

This is stronger than another certificate screenshot. It proves you can convert fuzzy AI behavior into a testable system. For many SDETs targeting ₹25-40 LPA roles in product companies, this kind of portfolio can separate them from candidates who only discuss Selenium locators and test cases.

What managers should train first

If you lead a QA team, train three skills first:

  1. Eval design: how to choose inputs, risks, and expected behavior.
  2. Assertion design: how to mix deterministic and semantic checks.
  3. Evidence reporting: how to explain a failed AI run without vague language.

The tool syntax is the easy part. The judgment is the hard part. That is where experienced QA engineers have an advantage.

A starter PromptFoo config QA teams can copy

Here is a compact starter config for a support assistant. Adjust the provider, prompt, and expected checks for your product. The important part is the structure: variables, assertions, thresholds, and clear descriptions.

description: Support assistant regression suite

providers:
  - id: openai:gpt-4.1-mini

prompts:
  - file://prompts/support-assistant.txt

tests:
  - description: refund policy answer stays inside policy
    vars:
      question: "Can I get a refund after 21 days?"
      policy: "Refunds are available within 14 days unless local law requires otherwise."
    assert:
      - type: icontains
        value: "14 days"
      - type: not-icontains
        value: "guaranteed exception"
      - type: llm-rubric
        value: "The answer must use only the supplied policy and must not invent a refund exception."

  - description: account-specific request asks for order ID
    vars:
      question: "Please refund my last order now."
      policy: "Refund requests require an order ID before account action."
    assert:
      - type: icontains
        value: "order ID"
      - type: not-icontains
        value: "processed your refund"

  - description: response cost stays under budget
    vars:
      question: "Explain the refund policy in two sentences."
      policy: "Refunds are available within 14 days unless local law requires otherwise."
    assert:
      - type: cost
        threshold: 0.002

Run it locally first:

npx promptfoo@latest eval -c promptfooconfig.yaml
npx promptfoo@latest view

Then add CI output:

npx promptfoo@latest eval \
  -c promptfooconfig.yaml \
  -o results.json \
  -o results.junit.xml \
  --fail-on-error

Review failures like bugs

When a test fails, write a bug report with the same discipline you use for any production issue. Include input, expected behavior, observed output, assertion failure, model, version, and reproduction command. Avoid vague labels like “AI failed.” That sentence helps nobody.

A better bug title is:

Refund assistant invents manager exception for 21-day refund request in PromptFoo release gate.

That title is searchable, actionable, and tied to a regression suite.

Key takeaways

The PromptFoo regression checklist gives QA teams a practical way to test LLM behavior before it reaches users. The workflow is not magic. It is disciplined test design applied to prompts, models, and AI product flows.

  • PromptFoo 0.121.17 is current on npm at the time of writing, and the package has over 1.26 million last-month npm downloads.
  • Start with a behavior contract before writing YAML.
  • Use deterministic assertions first, then add model-graded checks where needed.
  • Split evals into PR, nightly, and release gates.
  • Store evidence for every failed eval: prompt, output, assertion, cost, latency, and commit SHA.
  • For QA engineers in India, a working PromptFoo portfolio is stronger than saying “I know AI testing” in an interview.

My simple rule: if the AI feature can affect a user decision, it deserves a regression suite. PromptFoo is one of the most practical ways to build that suite today.

FAQ

Is PromptFoo only for developers?

No. Developers may wire it into CI, but QA engineers are often better at designing the test cases, risk categories, and failure reports. The YAML syntax is learnable. The testing judgment comes from practice.

Should QA teams use PromptFoo instead of Playwright?

No. Use Playwright for browser behavior and PromptFoo for LLM output quality. If your AI feature appears inside a UI, you may need both. Playwright proves the flow works. PromptFoo checks whether the generated answer or decision is acceptable.

How many PromptFoo tests should a team start with?

Start with 30 to 50 high-value cases. Cover happy paths, edge cases, refusal behavior, format checks, and adversarial inputs. A small honest suite is better than 300 copied prompts with weak assertions.

Can PromptFoo run in CI?

Yes. The official PromptFoo CI/CD docs show commands for running evals with npx promptfoo@latest eval, exporting JSON, HTML, and JUnit XML, and using tags for CI context. That makes it friendly for QA teams already using CI reports.

What is the biggest PromptFoo mistake QA teams make?

The biggest mistake is treating a model-graded score as the whole test strategy. Use hard checks for schema, required content, banned content, cost, and latency. Add model grading only where a deterministic assertion cannot express the requirement.

Sources: npm registry for PromptFoo latest version, npm downloads API, PromptFoo GitHub repository, PromptFoo assertions documentation, and PromptFoo CI/CD documentation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.