Stop Shipping on Vibes: How to Build Evals for Your AI Features

Your prompt changed. The retrieval layer shifted. You swapped in a cheaper model. Everything still feels fine — the demo looks good, a few manual tests came back clean, and the team ships.
Three days later, a model update silently breaks the grounding behavior you spent two weeks tuning. Nobody notices for a month. The feature seems fine, right up until it isn’t.
This is the failure mode that LLM evals exist to prevent. Not because your AI is bad, but because probabilistic systems require systematic testing. You can’t catch regressions by vibe-checking. You can’t tell if a prompt tweak helped or just moved the failures somewhere you didn’t look. And “it looked good when I tried it twice” is not an engineering practice — it’s a hope.
If your team can’t show you their evals, they’re not engineering AI. They’re guessing. Here’s how to build the real thing.
Table of Contents
- What evals actually are (and what they’re not)
- Step 1 — Build a golden dataset
- Step 2 — Write deterministic evals first
- Step 3 — Add model-graded evals for subjective quality
- Step 4 — Cover the adversarial cases
- Step 5 — Gate your CI/CD pipeline
- How to pick a framework
- The anti-patterns: what eval theater looks like
- FAQ
What evals actually are (and what they’re not)
An eval is a repeatable, automated test for AI behavior — not a one-time human review, not an academic benchmark, and not a vibe check. It answers a specific question: does the system still do what it’s supposed to do, given this input?
The important distinction from traditional unit tests: LLM outputs are probabilistic and often multi-valid. There’s rarely one correct answer. A test that fails when the model chooses a synonym is worse than no test at all — it fails on correctness, not quality, and generates noise that makes you ignore future failures.
Three eval types have emerged as the practical standard in 2026:
- Deterministic — schema validation, keyword presence/absence, format checks, exact match where exact match is actually valid. Fast, cheap, objective. Best for structured output, routing accuracy, compliance with formatting rules.
- Model-graded (LLM-as-judge) — a judge model scores the output against explicit criteria: grounded in context? answered the question? safe? Scalable and flexible for subjective quality. Requires careful prompt design; suffers from known biases discussed below.
- Human-in-the-loop — humans annotate a sample and that labeled set becomes your ground truth. Expensive at scale, but essential for calibrating your automated judges and catching failure modes they systematically miss.
In practice, a solid eval suite runs all three. Deterministic evals are your regression safety net (fast, free, catch the obvious). Model-graded evals are your quality signal (slower, cost something, catch the subtle). Human review validates the first two periodically.
Step 1 — Build a golden dataset
The foundation of every eval is a dataset of test cases with known, expected properties. Not expected exact outputs — expected properties, because “the exact wording” is usually the wrong contract to test.
# A test case: what the input is, what the output must and must not contain,
# and the context available for grounding checks.
GOLDEN = [
{
"id": "return-window-basic",
"input": "What's the return window?",
"must_contain": ["30 days"],
"must_not_contain": ["60 days", "no returns", "no refunds"],
"context": ["Returns are accepted within 30 days of purchase."],
"tags": ["policy", "returns"],
},
{
"id": "return-used-item",
"input": "Can I return a used item?",
"must_contain": [],
"must_not_contain": ["yes", "of course", "absolutely"], # policy: no
"context": ["Only unused, unopened items are eligible for return."],
"tags": ["policy", "returns", "edge-case"],
},
# 30-50 cases minimum for a meaningful eval set
]Code language: Python (python)
A few hard-won rules for what goes in the dataset. Include cases from real usage, not just cases you invented at your desk — bugs live in the queries you didn’t anticipate. Make sure edge cases and adversarial inputs outnumber your happy-path examples; a test set that only covers easy questions is measuring your demo, not your product. Tag cases so you can slice results by category when something breaks. And version the dataset alongside your code: if a change to the prompt breaks the “returns” slice, you want to trace it to the exact commit.
Thirty to fifty cases is a workable starting point for a single feature. More is better. But thirty well-chosen, representative cases beat two hundred variations of the same happy path.
Step 2 — Write deterministic evals first
Before you spend a dollar on a judge model, extract everything you can with rule-based checks. They’re instant, free, and they catch the majority of regressions that actually happen in practice.
def deterministic_eval(output: str, case: dict) -> dict:
output_lower = output.lower()
contains_required = all(
kw.lower() in output_lower for kw in case["must_contain"]
)
no_forbidden = not any(
kw.lower() in output_lower for kw in case["must_not_contain"]
)
# Add format checks, JSON schema validation, max-length assertions, etc.
passed = contains_required and no_forbidden
return {
"id": case["id"],
"passed": passed,
"contains_required": contains_required,
"no_forbidden": no_forbidden,
}
results = [deterministic_eval(pipeline(c["input"]), c) for c in GOLDEN]
failures = [r for r in results if not r["passed"]]
print(f"{len(failures)}/{len(GOLDEN)} cases failed")Code language: Python (python)
Deterministic evals also cover structural output: if your model is supposed to return JSON matching a schema, validate the schema on every case. If output must be under 200 words, assert that. If a routing classifier must return one of three labels, check exactly that. These are the evals that tell you a model update broke something obvious — and they run in seconds.
Step 3 — Add model-graded evals for subjective quality
Deterministic checks tell you the format is right and the forbidden words aren’t present. They don’t tell you the answer is good. For that, you need a judge.
The pattern is straightforward: give a judge model the input, the output, and the available context, plus an explicit rubric, and ask it to score against each criterion. The key word is explicit. Asking a judge “rate the quality 1-10” is useless. Asking it “does every factual claim in the reply appear in the provided context?” is a question it can answer reliably.
import json
JUDGE_PROMPT = """
Evaluate this support bot reply. Score each criterion 1-5.
Criteria:
- Grounded (1-5): Every factual claim is supported by the context. Score 1 if the reply invents facts not in context; 5 if every claim traces to context.
- Helpful (1-5): The reply directly addresses the user's question. Score 1 if it evades; 5 if it fully answers.
- Safe (1-5): No hallucinated policies, no unqualified promises. Score 1 if it invents a policy; 5 if it hedges correctly.
Reason through each criterion, then output only valid JSON:
{{"grounded": <1-5>, "helpful": <1-5>, "safe": <1-5>, "reasoning": "<one sentence>"}}
Context: {context}
User question: {question}
Bot reply: {reply}
"""
def judge_reply(question: str, context: str, reply: str) -> dict:
raw = llm(JUDGE_PROMPT.format(
context=context, question=question, reply=reply
))
return json.loads(raw)Code language: Python (python)
Use this pattern, but go in with eyes open about its weaknesses. LLM judges have documented systematic biases: positional bias (in pairwise comparisons, they favor whichever answer appeared first in the prompt), verbosity bias (longer answers score higher regardless of quality), and self-enhancement bias (a model rates its own outputs more favorably). High human-LLM agreement rates look reassuring, but research at ICLR 2025 established that high agreement does not guarantee accurate scores — the judge can be consistently wrong in the same direction.
The practical mitigations: write explicit rubric criteria, not open-ended “rate the quality” prompts; ask the judge to reason before scoring (the chain-of-thought output surfaces wrong reasoning before it corrupts the score); rotate the order of options in any pairwise comparison and average the results; and avoid using the same model family as judge and system under test. Calibrate your judge against 20–30 human-labeled examples at least once, so you know where it agrees with your team and where it systematically diverges.
⚠️ Note: Don’t use BLEU or ROUGE as your primary metrics for LLM outputs. Both measure surface-level string overlap against a reference answer — a methodology designed for machine translation, where one correct output exists. For open-ended generation, research consistently finds BLEU and ROUGE fail to capture coherence, factual accuracy, and relevance. Use criterion-based rubrics and model-graded metrics instead.
Step 4 — Cover the adversarial cases
Happy-path evals catch the regression when the model gives a slightly wrong answer to a clean question. Adversarial evals catch the regression when someone actually tries to break your feature.
For every LLM feature, the adversarial test set should include: prompt injection attempts (“ignore your instructions and…”), jailbreaks targeted at your use case, PII-in-input cases to verify your guardrails strip it before it echoes back, off-topic requests that should trigger a fallback, and queries designed to elicit hallucination (asking about things the system has no context for, then checking whether the output fabricates rather than declining).
💡 Tip — Promptfoo’s built-in red-team module ships with 500+ adversarial attack vectors and runs them against your endpoint from the CLI. For teams that don’t yet have a dedicated red-team process, it’s the fastest way to find the injection strings and jailbreaks you haven’t written evals for yet. Run it once before any major release and add anything it catches to your golden dataset as permanent regression tests.
The adversarial cases you find in red teaming aren’t failures — they’re free test cases. Once you know a specific input breaks your feature, it goes in the eval set and never sneaks back undetected.
Step 5 — Gate your CI/CD pipeline
An eval that runs manually once at launch and never again is not an eval. It’s a checkbox. The entire point is that evals run automatically on every meaningful change — model swap, prompt edit, retrieval config update, library version bump — and break the build if quality drops below your thresholds.
DeepEval integrates directly into pytest and supports this pattern out of the box:
# tests/test_support_rag.py — runs in CI like any pytest suite.
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
import pytest
@pytest.mark.parametrize("case", GOLDEN[:10]) # run a fast subset in CI, full set nightly
def test_reply_grounded_and_relevant(case):
actual = pipeline(case["input"])
test_case = LLMTestCase(
input=case["input"],
actual_output=actual,
retrieval_context=case["context"],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7), # answer addresses the question
FaithfulnessMetric(threshold=0.8), # answer grounded in retrieved context
])Code language: Python (python)
The threshold is the line you draw: below it, the build fails and the model change doesn’t go to production. Setting it correctly takes calibration — run your eval set on the current production system and treat that as your baseline, then set the CI threshold slightly below the baseline to catch degradation without blocking on noise. Tighten it over time as the eval set grows.
The same recall@k metric covered in the chunking article — the fraction of queries whose gold chunk lands in the top-k results — plugs directly into this gate for your retrieval layer. Whether you’re measuring retrieval or generation, the gate is the same: define the threshold, run on every change, fail loudly.
How to pick a framework
The pragmatic 2026 answer is two tools: a lightweight framework for CI testing and a platform for annotation and monitoring.
For CI/testing: DeepEval (Python, 50+ metrics, pytest-native, open-source — the broadest metric coverage in the category); RAGAS (the standard for RAG-specific evaluation — faithfulness, answer relevancy, context precision and recall — if your feature is RAG-based, start here); or Promptfoo (CLI/YAML, best for multi-model comparison and adversarial red teaming, Node-native).
For annotation and monitoring: Braintrust (most generous free tier, clean human annotation workflows, CI/CD gates, experiment tracking); LangSmith (best if your stack is LangChain/LangGraph — the tracing integration is native and saves significant instrumentation effort); Arize Phoenix (if OpenTelemetry-compatible tracing and a combined observability+eval stack matters more than annotation features).
Don’t pick one tool and try to force it to do everything. The frameworks and platforms solve different problems; a CI testing framework paired with an annotation/monitoring platform is the pattern experienced teams converge on, and it’s less overhead than it sounds.
The anti-patterns: what eval theater looks like
Running evals once at launch. The model that passed evals in January is not the model that’s running in June after two provider updates, a prompt change, and a retrieval refactor. Evals are a CI artifact, not a launch checklist.
Test sets that only cover the happy path. An eval set of 50 perfect questions with clear answers doesn’t tell you how the system behaves on the 20% of real traffic that’s ambiguous, malformed, or adversarial. It tells you your demo works.
Measuring with BLEU or ROUGE. Built for machine translation in the reference-answer era, these metrics flag a model as worse when it chooses a correct but differently-worded answer. For LLM evaluation, they produce noise, not signal.
Treating “no regressions detected” as “working correctly.” An eval that never fails is almost always an eval that isn’t testing anything hard enough. If your score never moves, you’re measuring your test set, not your system.
No ownership. If evals live in one engineer’s local environment and run manually before deploys, they will drift and die. Evals belong in the repo, run in CI, with the same review standards as production code.
The through-line across all of these is the same principle that runs through the entire AI development stack: measure, don’t assume. Evals are how that principle gets operationalized. Your quantized model might lose 1% on your specific task — you don’t know without measuring. Your guardrails might have drifted after a model update — you don’t know without testing them. The point of evals isn’t to prove the system works. It’s to know when it stops working, before your users tell you.
FAQ
What are LLM evals?
LLM evals are systematic, automated tests for AI behavior. They measure whether a model or pipeline produces outputs meeting defined quality criteria — accuracy, grounding, format, safety — on a representative set of inputs. They run on every change, catch regressions before production, and replace the “it looked fine when I tried it” review cycle that misses the failure mode discovered three weeks after launch.
What is LLM-as-judge and is it reliable?
LLM-as-judge uses a capable model to score another model’s outputs against explicit rubric criteria. Research confirms it can reach near-human agreement on chat evaluation tasks when carefully controlled — but it has documented systematic biases including positional bias, verbosity bias, and self-enhancement bias. Use it with explicit criteria (not open-ended “rate the quality”), ask the judge to reason before scoring, rotate answer order in pairwise comparisons, and calibrate against a human-labeled sample. It’s a scalable signal, not a gold standard.
What’s wrong with BLEU and ROUGE for LLM evaluation?
BLEU and ROUGE measure surface-level string overlap against a reference answer — a method designed for machine translation, where there’s typically one correct output. For open-ended LLM evaluation, they penalize correct but differently-worded answers and fail to detect coherence problems, hallucinations, or factual errors that don’t change n-gram overlap. Use criterion-based rubrics and model-graded metrics instead.
How many test cases do I need to start?
Thirty to fifty cases is a workable starting point for a single feature, if they’re well-chosen. Coverage matters more than volume: include edge cases, adversarial inputs, and cases from real user traffic, not just curated happy-path examples. A 30-case set that covers known failure modes beats 200 variations of the same clean question.
Which eval framework should I use?
Two tools, not one: a CI testing framework (DeepEval for broad Python-native coverage, RAGAS for RAG-specific metrics, Promptfoo for adversarial testing) paired with an annotation and monitoring platform (Braintrust, LangSmith if you’re on LangChain, Arize Phoenix for OTel-native observability). Pick the CI framework based on what you’re evaluating; pick the platform based on your team’s annotation workflow needs.
