How QA Engineers Can Evaluate LLMs Before Production: A Hands-On Guide to PromptFoo and LLM Testing
A senior QA engineer recently shared something that stopped me mid-scroll: she had set up PromptFoo to evaluate a quantized Llama-3.2 3B model running locally via LM Studio, and her results showed 4 passed, 2 failed, and 1 partial test at a 57.1% pass rate. That is exactly the kind of rigorous, structured evaluation that QA engineers should be doing before any LLM-powered feature reaches production. This guide shows you how to build the same setup — and take it further into your CI/CD pipeline.
The reason this matters is that LLM features are being shipped to production without the same quality rigor we apply to every other component. A REST API goes through unit tests, integration tests, contract tests, and end-to-end tests before deployment. An LLM integration often goes through… manual spot-checking by the developer who built it. The gap in testing discipline is staggering, and it exists because most QA teams do not yet know how to apply their testing expertise to language models.
This article bridges that gap. If you can write a Playwright test, you can write an LLM evaluation. The concepts are the same — define expected behavior, execute the system under test, verify the output. The tools are different, and the assertions are probabilistic rather than deterministic, but the core QA mindset transfers directly.
Contents
Setting Up PromptFoo
PromptFoo is an open-source evaluation framework that lets you define test cases for LLM behavior and run them against any model endpoint. Install it with npm install -g promptfoo and initialize a project with promptfoo init. This creates a configuration file (promptfooconfig.yaml) where you define your model providers, prompts, and test cases.
The configuration has three key sections. Providers specify which models to evaluate — this can be an OpenAI API endpoint, a local model running via LM Studio or Ollama, or any custom API. Prompts define the system prompts and user message templates you want to test. Tests define the inputs and expected outputs, with assertions that specify what constitutes a pass or fail.
For local evaluation — which I recommend starting with because it is free, private, and reproducible — install LM Studio, download a model (Llama 3.2 3B is a good starting point for evaluation purposes), and start the local server. PromptFoo connects to it as an OpenAI-compatible endpoint at http://localhost:1234/v1. Your evaluations run entirely on your machine with no API costs and no data leaving your network.
The Six Critical LLM Test Categories
Role boundary tests verify that the model stays within its defined role. If your chatbot is a customer support agent for a software product, it should decline to provide legal advice, refuse to discuss competitors in detail, and redirect off-topic questions back to its domain. Write test cases with prompts that attempt to push the model outside its boundaries and assert that it refuses or redirects appropriately.
Hallucination resistance tests present the model with questions it should not have answers to and verify it responds with appropriate uncertainty. Ask about fictional products, made-up events, or highly specific technical details that are not in its training data. A model that confidently answers “The Xylophone Framework 4.2 was released in March 2025 with support for quantum testing” is hallucinating — and that hallucination in a production feature could misinform users with serious consequences.
Format adherence tests confirm the model produces output in the required structure. If your application expects JSON responses, test that the model returns valid, parseable JSON across a diverse set of inputs. If it should produce markdown tables, test that. Format violations break downstream processing and are among the most common LLM integration failures.
Multilingual quality tests evaluate response quality across the languages your product supports. A model that produces excellent English responses but garbled Hindi or Portuguese responses is a quality defect for your international users. Test each supported language with equivalent prompts and compare the coherence and accuracy of responses.
Response latency tests measure how long the model takes to respond under various conditions — short prompts, long prompts, prompts requiring complex reasoning, and prompts under concurrent load. Establish latency budgets based on your UX requirements and flag responses that exceed them.
Regression tests compare the current model version against the previous version to detect quality degradation. When you update your model, fine-tune it, or change your prompt templates, regression tests tell you whether the change improved, maintained, or degraded response quality. This is the LLM equivalent of running your test suite before and after a code change.
Integrating LLM Evals into CI/CD
PromptFoo supports command-line execution with exit codes, which means it integrates directly into any CI/CD system. Add a step to your GitHub Actions workflow that runs promptfoo eval --no-cache and fails the build if the pass rate drops below your threshold. The no-cache flag ensures every CI run produces fresh results rather than using cached model responses from previous runs.
The practical challenge is cost and speed. Running a comprehensive evaluation suite against a cloud-hosted LLM takes time and consumes API credits. For CI integration, consider maintaining two evaluation tiers: a fast, focused “smoke” evaluation that runs on every PR (10-20 critical test cases, targeting the most important quality dimensions) and a comprehensive evaluation that runs nightly or on model version changes (100+ test cases covering all six categories).
For teams using local models, CI integration is trickier because you need GPU-enabled CI runners. GitHub’s larger runners support GPU instances, and self-hosted runners with GPU access are another option. The investment is justified if LLM quality is critical to your product.
Comparing Evaluation Frameworks
PromptFoo is the most QA-friendly evaluation framework because it mirrors the test-case-and-assertion model that testers already understand. LangSmith excels at observability and tracing — seeing exactly what happened inside a multi-step LLM chain — making it better for debugging than for automated quality gates. RAGAS focuses specifically on RAG system quality with metrics like faithfulness and context relevance. DeepEval provides a pytest-like interface for Python teams, with built-in metrics for hallucination, toxicity, and bias.
Most teams will benefit from using PromptFoo for automated evaluation and LangSmith for debugging and observability. The tools are complementary rather than competing — PromptFoo tells you that a test failed, and LangSmith helps you understand why.
The Honest Caveats
LLM evaluation is fundamentally more ambiguous than traditional testing. A traditional test has a clear expected output. An LLM evaluation often has a range of acceptable outputs, and defining “acceptable” requires judgment calls that are themselves debatable. Two QA engineers may disagree on whether a given response is a pass or fail, and both may be right depending on their interpretation of the requirements.
The 57.1% pass rate from the evaluation I referenced at the opening is for a small, quantized, locally-running model. Larger models (GPT-4, Claude) would score significantly higher on the same tests. The value of evaluation is not in achieving a specific score but in establishing a baseline, tracking changes over time, and preventing quality regressions — just like traditional test automation.
The tooling ecosystem is moving fast. PromptFoo’s API and capabilities as I describe them reflect early 2026. By the time you read this, there may be new features, new evaluation metrics, or new competing tools. The underlying principle — structured, automated evaluation of LLM behavior — will remain valid regardless of which tool implements it.
The QA Engineer’s AI Testing Journey
Start with PromptFoo and a local model. Write five test cases covering the most critical quality dimensions for your product’s AI features. Run them. Review the results. Iterate on your test cases based on what you learn. This is exactly the same cycle you followed when you wrote your first Selenium tests — start small, learn the tool, expand coverage as your confidence grows.
The QA engineers who invest in LLM evaluation skills now are positioning themselves at the intersection of the two most in-demand skill sets in the industry: quality engineering and AI. That intersection is where the most interesting work, the highest salaries, and the greatest career growth will be over the next five years.
Building LLM evaluation pipelines — from your first PromptFoo test case to a full CI/CD-integrated quality gate for AI features — is the focus of Module 10 in my AI-Powered Testing Mastery course, with hands-on labs, downloadable evaluation templates, and real model comparison exercises.
