Here is a scenario every LLM engineer has lived through. You tweak a prompt to fix one bad answer. You test it on three examples. It works. You deploy. The next morning, support tickets roll in because a different set of queries now return nonsense. You had no evaluation pipeline, so you had no way to know that fixing one case broke five others.
Evaluation is the single most important practice that separates production LLM applications from demos. It is also the most neglected, because it is hard. LLM outputs are open-ended text — there is no simple assertEqual that tells you if the answer is correct. But that difficulty is not an excuse to skip it. This lesson gives you concrete tools and code to build a real evaluation system.
Why Evaluation Is Hard (and Non-Negotiable)
Traditional software testing is binary: the function returns the right value or it does not. LLM evaluation exists on a spectrum. An answer can be partially correct, correct but poorly worded, correct but missing context, technically accurate but unhelpful, or completely right but formatted wrong for your use case.
Three reasons you must invest in evaluation:
-
Prompt changes have unpredictable side effects. Changing one word in a system prompt can improve answers for one category of queries and degrade answers for another. Without evaluation, you are flying blind.
-
Models change under you. API providers update their models. GPT-4o in January is not the same as GPT-4o in June. Regression testing catches silent degradation.
-
Stakeholders need numbers. “It seems to work pretty well” does not survive a meeting with your engineering director. “We score 87% on factual accuracy across 200 test cases, up from 82% last sprint” does.
Types of Evaluation
| Method | Speed | Cost | Best For |
|---|---|---|---|
| Exact match | Instant | Free | Classification, extraction, structured output |
| String similarity (fuzzy) | Instant | Free | Short answers with minor variations |
| Semantic similarity | Fast | Low (embedding API) | Paraphrased correct answers |
| LLM-as-judge | Slow | Medium (LLM API) | Open-ended generation, subjective quality |
| Human evaluation | Very slow | High (human time) | Final validation, ambiguous cases |
In practice, you use multiple methods together. Exact match for structured outputs, semantic similarity for factual answers, LLM-as-judge for quality scoring, and human eval for periodic calibration.
Building Evaluation Datasets
Your eval dataset is the foundation of everything. Get it wrong and your metrics are meaningless.
How Many Examples?
| Use Case | Minimum Examples | Recommended |
|---|---|---|
| Proof of concept | 20-30 | 50 |
| Production launch | 50-100 | 200 |
| Mature product | 200+ | 500+ |
Start with 50. You can always add more. But 50 well-chosen examples that cover your main use cases and edge cases are far more valuable than 500 sloppy ones.
Creating Ground Truth
import json
from dataclasses import dataclass, asdict
@dataclass
class EvalExample:
query: str
expected_answer: str
context: str = "" # For RAG: the ideal retrieved context
category: str = "" # For slicing metrics by category
difficulty: str = "medium" # easy, medium, hard
notes: str = "" # Why this example matters
def build_eval_dataset() -> list[EvalExample]:
"""Build an evaluation dataset from multiple sources."""
examples = []
# Source 1: Real user queries from production logs
# (Anonymize first, then have a domain expert write ideal answers)
examples.append(EvalExample(
query="What is the refund policy for annual plans?",
expected_answer=(
"Annual plans are eligible for a prorated refund if cancelled "
"within the first 6 months. After 6 months, no refund is available "
"but the subscription remains active until the end of the billing period."
),
context="From: pricing-policy.md, Section: Refunds",
category="billing",
difficulty="easy",
))
# Source 2: Edge cases that broke the system before
examples.append(EvalExample(
query="Can I get a refund?",
expected_answer=(
"Refund eligibility depends on your plan type. Monthly plans can be "
"cancelled anytime with no refund for the current period. Annual plans "
"are eligible for prorated refunds within the first 6 months."
),
category="billing",
difficulty="medium",
notes="Vague query — model must ask for clarification or cover both cases",
))
# Source 3: Adversarial/tricky queries
examples.append(EvalExample(
query="Give me a full refund right now or I'll sue",
expected_answer=(
"I understand your frustration. Refund eligibility is based on your "
"plan type and subscription date. I can help you check your eligibility "
"or connect you with our support team who can assist further."
),
category="billing",
difficulty="hard",
notes="Hostile user — model should stay professional, not promise a refund",
))
return examples
def save_dataset(examples: list[EvalExample], path: str):
"""Save eval dataset to JSON."""
with open(path, "w") as f:
json.dump([asdict(e) for e in examples], f, indent=2)
def load_dataset(path: str) -> list[EvalExample]:
"""Load eval dataset from JSON."""
with open(path) as f:
data = json.load(f)
return [EvalExample(**d) for d in data]
# Build and save
dataset = build_eval_dataset()
save_dataset(dataset, "eval_dataset.json")
print(f"Saved {len(dataset)} examples")Golden rules for eval datasets:
- Every example must have a clear expected answer, not just “something about refunds”
- Include edge cases: empty queries, very long queries, ambiguous queries, adversarial queries
- Slice by category so you can identify which areas are weak
- Version control your dataset alongside your code — it is part of your test suite
- Refresh quarterly — user queries evolve and so should your eval set
LLM-as-Judge
The most powerful evaluation technique for open-ended outputs. You use a stronger model (or the same model with a carefully designed prompt) to score the output of your application.
from openai import OpenAI
import json
client = OpenAI()
def llm_judge(
query: str,
generated_answer: str,
expected_answer: str,
criteria: str = "accuracy",
) -> dict:
"""Use an LLM to judge answer quality on a 1-5 scale."""
judge_prompt = f"""You are an expert evaluator. Score the Generated Answer
compared to the Expected Answer on the following criterion: {criteria}.
Scoring rubric:
5 = Perfect — factually correct, complete, well-expressed
4 = Good — mostly correct, minor omissions or imprecisions
3 = Acceptable — partially correct, missing important details
2 = Poor — significant errors or missing critical information
1 = Unacceptable — wrong, irrelevant, or harmful
User Query: {query}
Expected Answer: {expected_answer}
Generated Answer: {generated_answer}
Respond in JSON format:
{{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
# Example
result = llm_judge(
query="What is the refund policy for annual plans?",
generated_answer=(
"You can get a refund within the first 6 months. "
"After that, your plan stays active until it expires."
),
expected_answer=(
"Annual plans are eligible for a prorated refund if cancelled within "
"the first 6 months. After 6 months, no refund is available but the "
"subscription remains active until the end of the billing period."
),
)
print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")
# Score: 4/5
# Reasoning: Captures the key facts (6-month window, plan stays active)
# but misses that the refund is prorated, not full.Multi-Criteria Judging
For production, score on multiple dimensions:
EVAL_CRITERIA = {
"accuracy": "Is the information factually correct?",
"completeness": "Does it cover all key points from the expected answer?",
"relevance": "Does it directly address the user's question?",
"tone": "Is the tone appropriate for a customer-facing response?",
"conciseness": "Is it appropriately concise without unnecessary filler?",
}
def multi_criteria_judge(
query: str,
generated_answer: str,
expected_answer: str,
) -> dict:
"""Score on multiple criteria."""
scores = {}
for criterion, description in EVAL_CRITERIA.items():
result = llm_judge(
query=query,
generated_answer=generated_answer,
expected_answer=expected_answer,
criteria=f"{criterion}: {description}",
)
scores[criterion] = result
# Compute weighted average
weights = {
"accuracy": 0.35,
"completeness": 0.25,
"relevance": 0.20,
"tone": 0.10,
"conciseness": 0.10,
}
weighted_score = sum(
scores[c]["score"] * weights[c] for c in weights
)
scores["weighted_total"] = round(weighted_score, 2)
return scores
result = multi_criteria_judge(
query="What is the refund policy?",
generated_answer="You can get a refund within 6 months.",
expected_answer="Annual plans offer prorated refunds within 6 months...",
)
for criterion, data in result.items():
if isinstance(data, dict):
print(f" {criterion}: {data['score']}/5 — {data['reasoning']}")
else:
print(f" TOTAL: {data}/5")Reducing judge cost: Each multi-criteria evaluation makes 5 API calls to GPT-4o. For a 200-example dataset, that is 1,000 calls. Two optimizations:
- Batch criteria into one prompt. Instead of five separate calls, ask the judge to score all criteria at once. Less reliable per criterion but 5x cheaper.
- Use a cheaper judge for coarse filtering. Run GPT-4o-mini first to identify obvious failures (score 1-2), then only run GPT-4o on the ambiguous cases (score 3-4).
Traditional Metrics (and Why They’re Not Enough)
BLEU, ROUGE, and other NLP metrics compare generated text against reference text at the word/token level. They are fast and free, but they miss semantic equivalence.
from rouge_score import rouge_scorer
def compute_rouge(generated: str, reference: str) -> dict:
"""Compute ROUGE scores between generated and reference text."""
scorer = rouge_scorer.RougeScorer(
["rouge1", "rouge2", "rougeL"], use_stemmer=True
)
scores = scorer.score(reference, generated)
return {
key: {
"precision": round(value.precision, 3),
"recall": round(value.recall, 3),
"f1": round(value.fmeasure, 3),
}
for key, value in scores.items()
}
# These two answers mean the same thing but ROUGE gives a low score
generated = "Customers on annual billing can receive partial refunds within six months."
reference = "Annual plans are eligible for prorated refunds if cancelled in the first 6 months."
scores = compute_rouge(generated, reference)
for metric, values in scores.items():
print(f"{metric}: F1={values['f1']}")
# rouge1: F1=0.30 (low! even though meaning is identical)
# rouge2: F1=0.10
# rougeL: F1=0.25When traditional metrics work: Extraction tasks where the output should be very close to the reference (entity extraction, summarization with fixed templates). When they fail: Any task where paraphrasing is acceptable — which is most LLM tasks.
Semantic Similarity
A better alternative for factual answers: embed both the generated and expected answer, then compute cosine similarity.
import numpy as np
from openai import OpenAI
client = OpenAI()
def semantic_similarity(text_a: str, text_b: str) -> float:
"""Compute semantic similarity between two texts using embeddings."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=[text_a, text_b],
)
emb_a = np.array(response.data[0].embedding)
emb_b = np.array(response.data[1].embedding)
return float(np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b)))
# Same example as above
generated = "Customers on annual billing can receive partial refunds within six months."
reference = "Annual plans are eligible for prorated refunds if cancelled in the first 6 months."
similarity = semantic_similarity(generated, reference)
print(f"Semantic similarity: {similarity:.3f}") # 0.92 — much better than ROUGESemantic similarity above 0.85 usually indicates a correct answer. Below 0.70 usually indicates a wrong one. The 0.70-0.85 range is ambiguous and benefits from LLM-as-judge.
RAG-Specific Evaluation
RAG pipelines have two failure modes: bad retrieval (wrong documents) and bad generation (right documents, wrong answer). You must measure both.
The Four RAG Metrics
| Metric | What It Measures | Failure Example |
|---|---|---|
| Context Precision | Are the retrieved docs relevant? | Retrieved 5 docs but only 1 was about the query topic |
| Context Recall | Did retrieval find all relevant docs? | The key document was ranked #15 and missed top-k |
| Faithfulness | Is the answer grounded in retrieved context? | Model hallucinated facts not in any retrieved doc |
| Answer Correctness | Is the final answer right? | Retrieved right docs but model misinterpreted them |
Using Ragas for RAG Evaluation
Ragas is the standard framework for RAG evaluation. It computes all four metrics using LLM-based assessment.
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_correctness,
)
from datasets import Dataset
def evaluate_rag_pipeline(
questions: list[str],
answers: list[str],
contexts: list[list[str]],
ground_truths: list[str],
) -> dict:
"""Evaluate a RAG pipeline using Ragas metrics."""
# Ragas expects a HuggingFace Dataset
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths,
})
result = evaluate(
dataset=eval_dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_correctness,
],
)
return result
# Example: evaluate your RAG pipeline on 3 questions
questions = [
"What is the refund policy for annual plans?",
"How do I reset my API key?",
"What programming languages are supported?",
]
# These come from your RAG pipeline
answers = [
"Annual plans offer prorated refunds within the first 6 months.",
"Go to Settings > API Keys > click Regenerate.",
"We support Python, JavaScript, Go, and Ruby.",
]
# The retrieved contexts (list of strings per question)
contexts = [
["Annual plans: prorated refund within 6 months. After 6 months, no refund."],
["API Key Management: Navigate to Settings > API Keys. Click Regenerate."],
["Supported Languages: Python, JavaScript, Go, Ruby, and Java SDKs."],
]
# Your ground truth answers
ground_truths = [
"Annual plans are eligible for a prorated refund within the first 6 months.",
"Navigate to Settings, then API Keys, and click the Regenerate button.",
"Python, JavaScript, Go, Ruby, and Java are supported.",
]
results = evaluate_rag_pipeline(questions, answers, contexts, ground_truths)
print(results)
# {'context_precision': 0.95, 'context_recall': 0.90,
# 'faithfulness': 0.93, 'answer_correctness': 0.88}Interpreting Ragas scores:
- Above 0.85 on all four metrics: your pipeline is production-ready
- Faithfulness below 0.80: the model is hallucinating — tighten your prompt
- Context recall below 0.70: your retrieval is missing documents — fix chunking or embeddings
- Context precision below 0.70: too much noise in retrieved docs — add reranking
Ragas at Scale
For larger evaluations, Ragas supports async execution and can use different LLMs as the evaluator:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
# Use a cheaper model for evaluation at scale
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_correctness],
llm=eval_llm,
)Using DeepEval for General Evaluation
DeepEval provides a broader set of metrics and a test-runner interface that integrates with pytest.
from deepeval import evaluate as deepeval_evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
GEval,
)
from deepeval.test_case import LLMTestCase
def run_deepeval(
queries: list[str],
outputs: list[str],
expected_outputs: list[str],
contexts: list[list[str]] | None = None,
) -> list[dict]:
"""Run DeepEval metrics on a batch of test cases."""
# Define metrics
relevancy = AnswerRelevancyMetric(threshold=0.7)
# Custom G-Eval metric for domain-specific quality
helpfulness = GEval(
name="Helpfulness",
criteria=(
"Determine whether the actual output is helpful and actionable "
"for the user. A helpful response directly addresses the query, "
"provides specific steps or information, and avoids vague generalities."
),
evaluation_params=["input", "actual_output"],
threshold=0.7,
)
# Build test cases
test_cases = []
for i in range(len(queries)):
tc = LLMTestCase(
input=queries[i],
actual_output=outputs[i],
expected_output=expected_outputs[i],
retrieval_context=contexts[i] if contexts else None,
)
test_cases.append(tc)
# Run evaluation
metrics_list = [relevancy, helpfulness]
if contexts:
metrics_list.append(FaithfulnessMetric(threshold=0.7))
results = []
for tc in test_cases:
case_results = {}
for metric in metrics_list:
metric.measure(tc)
case_results[metric.__class__.__name__] = {
"score": metric.score,
"passed": metric.is_successful(),
"reason": metric.reason,
}
results.append(case_results)
return results
# Usage
results = run_deepeval(
queries=["What is the refund policy?"],
outputs=["You can get a refund within 6 months."],
expected_outputs=["Annual plans offer prorated refunds within 6 months."],
contexts=[["Refund policy: prorated refund within 6 months for annual plans."]],
)
for i, r in enumerate(results):
print(f"Test case {i}:")
for metric, data in r.items():
status = "PASS" if data["passed"] else "FAIL"
print(f" {metric}: {data['score']:.2f} [{status}] — {data['reason']}")DeepEval with Pytest
DeepEval integrates with pytest for CI/CD:
# test_llm_quality.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, GEval
from deepeval.test_case import LLMTestCase
def generate_answer(query: str) -> str:
"""Your LLM application's answer generation."""
# Replace with your actual pipeline
from your_app import rag_pipeline
return rag_pipeline(query)
# Load eval dataset
import json
with open("eval_dataset.json") as f:
eval_data = json.load(f)
@pytest.mark.parametrize("example", eval_data, ids=[e["query"][:50] for e in eval_data])
def test_answer_quality(example):
"""Test that generated answers meet quality thresholds."""
generated = generate_answer(example["query"])
test_case = LLMTestCase(
input=example["query"],
actual_output=generated,
expected_output=example["expected_answer"],
)
relevancy = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [relevancy])Run with: pytest test_llm_quality.py -v
Regression Testing in CI/CD
The goal: every prompt change, every model upgrade, every pipeline modification gets automatically evaluated before reaching production.
# eval_runner.py — run this in CI
import json
import sys
from datetime import datetime
from pathlib import Path
def run_regression_eval(
dataset_path: str,
output_path: str,
threshold: float = 0.80,
) -> bool:
"""Run evaluation and check against regression threshold."""
# Load dataset
with open(dataset_path) as f:
dataset = json.load(f)
# Run your pipeline on each example
from your_app import generate_answer
results = []
for example in dataset:
generated = generate_answer(example["query"])
score = llm_judge(
query=example["query"],
generated_answer=generated,
expected_answer=example["expected_answer"],
)
results.append({
"query": example["query"],
"category": example.get("category", "general"),
"score": score["score"],
"reasoning": score["reasoning"],
"generated": generated,
})
# Compute aggregate metrics
scores = [r["score"] for r in results]
avg_score = sum(scores) / len(scores)
pass_rate = sum(1 for s in scores if s >= 4) / len(scores)
# Compute per-category metrics
categories = set(r["category"] for r in results)
category_metrics = {}
for cat in categories:
cat_scores = [r["score"] for r in results if r["category"] == cat]
category_metrics[cat] = {
"avg_score": round(sum(cat_scores) / len(cat_scores), 2),
"count": len(cat_scores),
}
# Save results
report = {
"timestamp": datetime.utcnow().isoformat(),
"dataset_size": len(dataset),
"avg_score": round(avg_score, 2),
"pass_rate": round(pass_rate, 3),
"category_metrics": category_metrics,
"threshold": threshold,
"passed": avg_score >= threshold * 5, # Convert to 1-5 scale
"results": results,
}
with open(output_path, "w") as f:
json.dump(report, f, indent=2)
print(f"\nEvaluation Report")
print(f"{'=' * 50}")
print(f"Examples evaluated: {len(dataset)}")
print(f"Average score: {avg_score:.2f}/5")
print(f"Pass rate (>=4/5): {pass_rate:.1%}")
print(f"Threshold: {threshold * 5:.2f}/5")
print(f"\nPer-category:")
for cat, metrics in sorted(category_metrics.items()):
print(f" {cat}: {metrics['avg_score']}/5 (n={metrics['count']})")
if not report["passed"]:
print(f"\nFAILED: Average score {avg_score:.2f} < threshold {threshold * 5:.2f}")
return False
print(f"\nPASSED")
return True
if __name__ == "__main__":
success = run_regression_eval(
dataset_path="eval_dataset.json",
output_path="eval_results.json",
threshold=0.80,
)
sys.exit(0 if success else 1)GitHub Actions Integration
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- 'eval_dataset.json'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run LLM evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python eval_runner.py
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval_results.json
- name: Comment on PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval_results.json'));
const body = `## LLM Evaluation Results
| Metric | Value |
|--------|-------|
| Average Score | ${results.avg_score}/5 |
| Pass Rate | ${(results.pass_rate * 100).toFixed(1)}% |
| Status | ${results.passed ? '✅ PASSED' : '❌ FAILED'} |
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});A/B Testing Prompts in Production
Sometimes evaluation datasets are not enough — you need to test with real users on real queries.
import hashlib
import random
from dataclasses import dataclass
from datetime import datetime
@dataclass
class ABTestConfig:
name: str
variant_a_prompt: str
variant_b_prompt: str
traffic_split: float = 0.5 # 50/50 by default
start_date: str = ""
end_date: str = ""
class PromptABTest:
"""Simple A/B testing for prompts."""
def __init__(self, config: ABTestConfig):
self.config = config
self.results: list[dict] = []
def get_variant(self, user_id: str) -> str:
"""Deterministically assign user to variant (consistent hashing)."""
hash_val = int(hashlib.md5(
f"{self.config.name}:{user_id}".encode()
).hexdigest(), 16)
if (hash_val % 100) / 100 < self.config.traffic_split:
return "A"
return "B"
def get_prompt(self, user_id: str) -> str:
"""Get the prompt for this user's variant."""
variant = self.get_variant(user_id)
if variant == "A":
return self.config.variant_a_prompt
return self.config.variant_b_prompt
def log_result(
self,
user_id: str,
query: str,
response: str,
feedback: int | None = None, # 1 = thumbs up, -1 = thumbs down
latency_ms: float = 0,
):
"""Log a result for analysis."""
self.results.append({
"user_id": user_id,
"variant": self.get_variant(user_id),
"query": query,
"response": response,
"feedback": feedback,
"latency_ms": latency_ms,
"timestamp": datetime.utcnow().isoformat(),
})
def analyze(self) -> dict:
"""Compute metrics per variant."""
analysis = {}
for variant in ["A", "B"]:
variant_results = [r for r in self.results if r["variant"] == variant]
feedbacks = [r["feedback"] for r in variant_results if r["feedback"]]
latencies = [r["latency_ms"] for r in variant_results if r["latency_ms"]]
positive = sum(1 for f in feedbacks if f == 1)
total_feedback = len(feedbacks)
analysis[variant] = {
"total_queries": len(variant_results),
"feedback_count": total_feedback,
"positive_rate": positive / total_feedback if total_feedback else 0,
"avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
}
return analysis
# Usage
test = PromptABTest(ABTestConfig(
name="system_prompt_v2_test",
variant_a_prompt="You are a helpful customer support agent. Answer concisely.",
variant_b_prompt=(
"You are a customer support agent for Acme Corp. "
"Answer questions using only the provided context. "
"If you don't know, say so. Be specific and cite sources."
),
traffic_split=0.5,
))
# In your API handler:
# prompt = test.get_prompt(user_id)
# response = generate(query, system_prompt=prompt)
# test.log_result(user_id, query, response, feedback=user_feedback)Human Evaluation
Automated metrics get you 80% of the way there. The last 20% requires human judgment, especially for subjective qualities like tone, helpfulness, and appropriateness.
Evaluation Rubric
Define a rubric so evaluators are consistent:
HUMAN_EVAL_RUBRIC = {
"accuracy": {
5: "All facts are correct. No fabricated information.",
4: "Mostly correct. One minor inaccuracy that doesn't change the answer.",
3: "Partially correct. Contains a meaningful error but shows understanding.",
2: "Mostly incorrect. Key facts are wrong.",
1: "Completely wrong or fabricated.",
},
"helpfulness": {
5: "Directly and completely answers the question. Actionable.",
4: "Answers the question with minor gaps. Mostly actionable.",
3: "Partially answers the question. User would need follow-up.",
2: "Vaguely related but doesn't answer the question.",
1: "Irrelevant, evasive, or unhelpful.",
},
"safety": {
5: "Appropriate, professional, no harmful content.",
4: "Appropriate with minor tone issues.",
3: "Borderline — could be misinterpreted.",
2: "Contains problematic content or bias.",
1: "Harmful, offensive, or dangerous.",
},
}
def create_human_eval_batch(
examples: list[dict],
output_path: str,
):
"""Create a spreadsheet-ready batch for human evaluators."""
import csv
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow([
"example_id", "query", "generated_answer", "expected_answer",
"accuracy_score", "helpfulness_score", "safety_score", "notes",
])
for i, ex in enumerate(examples):
writer.writerow([
i, ex["query"], ex["generated"], ex["expected"],
"", "", "", "", # To be filled by evaluator
])
print(f"Created eval batch with {len(examples)} examples at {output_path}")Inter-Annotator Agreement
When multiple humans evaluate the same examples, measure how much they agree. Low agreement means your rubric is ambiguous.
def compute_agreement(ratings_a: list[int], ratings_b: list[int]) -> dict:
"""Compute inter-annotator agreement metrics."""
assert len(ratings_a) == len(ratings_b)
n = len(ratings_a)
# Exact agreement
exact_match = sum(1 for a, b in zip(ratings_a, ratings_b) if a == b) / n
# Within-1 agreement (scores differ by at most 1)
within_one = sum(
1 for a, b in zip(ratings_a, ratings_b) if abs(a - b) <= 1
) / n
# Cohen's Kappa (chance-corrected agreement)
from collections import Counter
observed_agreement = exact_match
# Expected agreement by chance
count_a = Counter(ratings_a)
count_b = Counter(ratings_b)
all_values = set(ratings_a) | set(ratings_b)
expected_agreement = sum(
(count_a[v] / n) * (count_b[v] / n) for v in all_values
)
kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement) \
if expected_agreement < 1 else 1.0
return {
"exact_agreement": round(exact_match, 3),
"within_one_agreement": round(within_one, 3),
"cohens_kappa": round(kappa, 3),
}
# Example
evaluator_1 = [5, 4, 3, 5, 2, 4, 3, 5, 4, 4]
evaluator_2 = [5, 4, 4, 5, 3, 4, 3, 4, 4, 5]
agreement = compute_agreement(evaluator_1, evaluator_2)
print(f"Exact agreement: {agreement['exact_agreement']:.1%}")
print(f"Within-1: {agreement['within_one_agreement']:.1%}")
print(f"Cohen's Kappa: {agreement['cohens_kappa']:.3f}")
# Exact agreement: 60.0%
# Within-1: 100.0%
# Cohen's Kappa: 0.385Interpreting Kappa: Below 0.20 is poor (rubric needs work), 0.20-0.40 is fair, 0.40-0.60 is moderate, 0.60-0.80 is substantial, above 0.80 is excellent.
Cost of Evaluation
Evaluation is not free. Here is a realistic cost breakdown for a 200-example eval dataset:
| Method | Per Example | 200 Examples | Time |
|---|---|---|---|
| Exact match | $0 | $0 | < 1 second |
| Semantic similarity | ~$0.0001 | ~$0.02 | ~10 seconds |
| LLM-as-judge (GPT-4o-mini) | ~$0.002 | ~$0.40 | ~5 minutes |
| LLM-as-judge (GPT-4o) | ~$0.02 | ~$4.00 | ~10 minutes |
| Multi-criteria (5x GPT-4o) | ~$0.10 | ~$20.00 | ~45 minutes |
| Human evaluation | ~$0.50-2.00 | ~$100-400 | ~4-8 hours |
Budget strategy: Run cheap metrics (exact match, semantic similarity) on every PR. Run LLM-as-judge on daily builds. Run human evaluation monthly or before major releases.
Building a Complete Eval Harness
Here is a production-ready eval harness that combines everything:
import json
import time
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path
from typing import Callable
@dataclass
class EvalResult:
query: str
expected: str
generated: str
scores: dict = field(default_factory=dict)
latency_ms: float = 0
category: str = ""
passed: bool = False
class EvalHarness:
"""Production evaluation harness for LLM applications."""
def __init__(
self,
generator: Callable[[str], str], # Your LLM pipeline
dataset_path: str,
output_dir: str = "eval_results",
):
self.generator = generator
self.dataset = self._load_dataset(dataset_path)
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
def _load_dataset(self, path: str) -> list[dict]:
with open(path) as f:
return json.load(f)
def run(
self,
metrics: list[str] = None,
pass_threshold: float = 0.80,
) -> dict:
"""Run full evaluation suite."""
if metrics is None:
metrics = ["semantic_similarity", "llm_judge"]
results: list[EvalResult] = []
for i, example in enumerate(self.dataset):
print(f"Evaluating {i + 1}/{len(self.dataset)}: "
f"{example['query'][:50]}...")
# Generate answer
start = time.time()
generated = self.generator(example["query"])
latency = (time.time() - start) * 1000
result = EvalResult(
query=example["query"],
expected=example["expected_answer"],
generated=generated,
latency_ms=round(latency, 1),
category=example.get("category", "general"),
)
# Run selected metrics
if "semantic_similarity" in metrics:
sim = semantic_similarity(generated, example["expected_answer"])
result.scores["semantic_similarity"] = round(sim, 3)
if "llm_judge" in metrics:
judgment = llm_judge(
query=example["query"],
generated_answer=generated,
expected_answer=example["expected_answer"],
)
result.scores["llm_judge"] = judgment["score"]
result.scores["judge_reasoning"] = judgment["reasoning"]
if "rouge" in metrics:
rouge = compute_rouge(generated, example["expected_answer"])
result.scores["rouge_l_f1"] = rouge["rougeL"]["f1"]
# Determine pass/fail
if "llm_judge" in result.scores:
result.passed = result.scores["llm_judge"] >= 4
elif "semantic_similarity" in result.scores:
result.passed = result.scores["semantic_similarity"] >= 0.85
results.append(result)
# Compute aggregates
report = self._build_report(results, pass_threshold)
# Save
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
report_path = self.output_dir / f"eval_{timestamp}.json"
with open(report_path, "w") as f:
json.dump(report, f, indent=2, default=str)
self._print_report(report)
return report
def _build_report(self, results: list[EvalResult], threshold: float) -> dict:
"""Build aggregate report from individual results."""
scores_by_metric = {}
for r in results:
for metric, value in r.scores.items():
if isinstance(value, (int, float)):
scores_by_metric.setdefault(metric, []).append(value)
aggregates = {}
for metric, values in scores_by_metric.items():
aggregates[metric] = {
"mean": round(sum(values) / len(values), 3),
"min": round(min(values), 3),
"max": round(max(values), 3),
}
# Per-category breakdown
categories = set(r.category for r in results)
category_breakdown = {}
for cat in sorted(categories):
cat_results = [r for r in results if r.category == cat]
cat_scores = [r.scores.get("llm_judge", 0) for r in cat_results]
category_breakdown[cat] = {
"count": len(cat_results),
"avg_score": round(sum(cat_scores) / len(cat_scores), 2)
if cat_scores else 0,
"pass_rate": round(
sum(1 for r in cat_results if r.passed) / len(cat_results), 3
),
}
pass_rate = sum(1 for r in results if r.passed) / len(results)
avg_latency = sum(r.latency_ms for r in results) / len(results)
return {
"timestamp": datetime.utcnow().isoformat(),
"total_examples": len(results),
"pass_rate": round(pass_rate, 3),
"passed_threshold": pass_rate >= threshold,
"threshold": threshold,
"avg_latency_ms": round(avg_latency, 1),
"aggregates": aggregates,
"category_breakdown": category_breakdown,
"results": [asdict(r) for r in results],
}
def _print_report(self, report: dict):
"""Print a human-readable evaluation report."""
print(f"\n{'=' * 60}")
print(f"EVALUATION REPORT")
print(f"{'=' * 60}")
print(f"Examples: {report['total_examples']}")
print(f"Pass rate: {report['pass_rate']:.1%}")
print(f"Threshold: {report['threshold']:.1%}")
print(f"Status: {'PASSED' if report['passed_threshold'] else 'FAILED'}")
print(f"Avg latency: {report['avg_latency_ms']:.0f}ms")
print(f"\nMetric Averages:")
for metric, stats in report["aggregates"].items():
print(f" {metric}: {stats['mean']:.3f} "
f"(min={stats['min']}, max={stats['max']})")
print(f"\nCategory Breakdown:")
for cat, metrics in report["category_breakdown"].items():
print(f" {cat}: {metrics['avg_score']}/5 "
f"({metrics['pass_rate']:.0%} pass, n={metrics['count']})")
print(f"{'=' * 60}\n")
# Usage
def my_rag_pipeline(query: str) -> str:
"""Your actual RAG pipeline."""
# ... retrieval + generation logic ...
pass
harness = EvalHarness(
generator=my_rag_pipeline,
dataset_path="eval_dataset.json",
output_dir="eval_results",
)
report = harness.run(
metrics=["semantic_similarity", "llm_judge"],
pass_threshold=0.80,
)
if not report["passed_threshold"]:
print("Evaluation FAILED — do not deploy this change.")Common Mistakes
1. Evaluating on training examples. If your eval dataset includes examples you used to develop the prompt, your scores are inflated. Keep a held-out test set that you never look at during development.
2. Binary scoring. “Correct” or “incorrect” loses too much information. A response that is 80% correct should score differently from one that is 20% correct. Use the 1-5 scale.
3. Ignoring edge cases. If 95% of your eval dataset is “easy” queries, you will report 95% accuracy while completely failing on the 5% of hard queries that matter most to users.
4. Not versioning your eval dataset. If the dataset changes between evaluations, you cannot compare results. Version it alongside your code.
5. Evaluating only the final answer. In a RAG pipeline, the final answer depends on retrieval quality. If you only measure the answer, you will not know whether a failure was caused by bad retrieval or bad generation. Measure both.
6. Running evaluation manually. If evaluation requires someone to remember to run it, it will not happen. Automate it in CI.
Key Takeaways
-
Build an eval dataset before writing your first prompt. 50 examples minimum. Include easy, medium, and hard queries across all your categories.
-
LLM-as-judge is the most practical automated evaluation. Use GPT-4o to judge your application’s output on a 1-5 scale with explicit criteria. It correlates well with human judgment.
-
For RAG, measure retrieval and generation separately. Use Ragas metrics: context precision, context recall, faithfulness, answer correctness. Fix retrieval first — bad retrieval makes generation impossible.
-
Automate regression testing in CI. Every prompt change should trigger evaluation. If accuracy drops below your threshold, block the deployment.
-
ROUGE and BLEU are not enough for LLMs. They measure word overlap, not semantic correctness. Use semantic similarity or LLM-as-judge instead.
-
Human evaluation calibrates everything else. Run it quarterly. Use a rubric. Measure inter-annotator agreement. If annotators disagree, your rubric is broken.
-
Evaluation costs money but saves more. A bad answer that reaches production costs more in user trust than $20 worth of GPT-4o evaluation calls.
-
Version your eval dataset like code. Changes to the dataset change your metrics. Track both in source control.
-
Slice metrics by category. An overall 85% pass rate can hide a 40% pass rate in your most important category. Always look at the breakdown.
-
Eval is not a one-time setup. Queries evolve, models change, edge cases surface. Budget ongoing time for maintaining and expanding your evaluation pipeline.