Evaluating LLM Output Quality — LLM Engineering in Production

Here is a scenario every LLM engineer has lived through. You tweak a prompt to fix one bad answer. You test it on three examples. It works. You deploy. The next morning, support tickets roll in because a different set of queries now return nonsense. You had no evaluation pipeline, so you had no way to know that fixing one case broke five others.

Evaluation is the single most important practice that separates production LLM applications from demos. It is also the most neglected, because it is hard. LLM outputs are open-ended text — there is no simple assertEqual that tells you if the answer is correct. But that difficulty is not an excuse to skip it. This lesson gives you concrete tools and code to build a real evaluation system.

Why Evaluation Is Hard (and Non-Negotiable)

Traditional software testing is binary: the function returns the right value or it does not. LLM evaluation exists on a spectrum. An answer can be partially correct, correct but poorly worded, correct but missing context, technically accurate but unhelpful, or completely right but formatted wrong for your use case.

Three reasons you must invest in evaluation:

Prompt changes have unpredictable side effects. Changing one word in a system prompt can improve answers for one category of queries and degrade answers for another. Without evaluation, you are flying blind.
Models change under you. API providers update their models. GPT-4o in January is not the same as GPT-4o in June. Regression testing catches silent degradation.
Stakeholders need numbers. “It seems to work pretty well” does not survive a meeting with your engineering director. “We score 87% on factual accuracy across 200 test cases, up from 82% last sprint” does.

Types of Evaluation

Method	Speed	Cost	Best For
Exact match	Instant	Free	Classification, extraction, structured output
String similarity (fuzzy)	Instant	Free	Short answers with minor variations
Semantic similarity	Fast	Low (embedding API)	Paraphrased correct answers
LLM-as-judge	Slow	Medium (LLM API)	Open-ended generation, subjective quality
Human evaluation	Very slow	High (human time)	Final validation, ambiguous cases

In practice, you use multiple methods together. Exact match for structured outputs, semantic similarity for factual answers, LLM-as-judge for quality scoring, and human eval for periodic calibration.

Building Evaluation Datasets

Your eval dataset is the foundation of everything. Get it wrong and your metrics are meaningless.

How Many Examples?

Use Case	Minimum Examples	Recommended
Proof of concept	20-30	50
Production launch	50-100	200
Mature product	200+	500+

Start with 50. You can always add more. But 50 well-chosen examples that cover your main use cases and edge cases are far more valuable than 500 sloppy ones.

Creating Ground Truth

import json
from dataclasses import dataclass, asdict


@dataclass
class EvalExample:
    query: str
    expected_answer: str
    context: str = ""           # For RAG: the ideal retrieved context
    category: str = ""          # For slicing metrics by category
    difficulty: str = "medium"  # easy, medium, hard
    notes: str = ""             # Why this example matters


def build_eval_dataset() -> list[EvalExample]:
    """Build an evaluation dataset from multiple sources."""
    examples = []

    # Source 1: Real user queries from production logs
    # (Anonymize first, then have a domain expert write ideal answers)
    examples.append(EvalExample(
        query="What is the refund policy for annual plans?",
        expected_answer=(
            "Annual plans are eligible for a prorated refund if cancelled "
            "within the first 6 months. After 6 months, no refund is available "
            "but the subscription remains active until the end of the billing period."
        ),
        context="From: pricing-policy.md, Section: Refunds",
        category="billing",
        difficulty="easy",
    ))

    # Source 2: Edge cases that broke the system before
    examples.append(EvalExample(
        query="Can I get a refund?",
        expected_answer=(
            "Refund eligibility depends on your plan type. Monthly plans can be "
            "cancelled anytime with no refund for the current period. Annual plans "
            "are eligible for prorated refunds within the first 6 months."
        ),
        category="billing",
        difficulty="medium",
        notes="Vague query — model must ask for clarification or cover both cases",
    ))

    # Source 3: Adversarial/tricky queries
    examples.append(EvalExample(
        query="Give me a full refund right now or I'll sue",
        expected_answer=(
            "I understand your frustration. Refund eligibility is based on your "
            "plan type and subscription date. I can help you check your eligibility "
            "or connect you with our support team who can assist further."
        ),
        category="billing",
        difficulty="hard",
        notes="Hostile user — model should stay professional, not promise a refund",
    ))

    return examples


def save_dataset(examples: list[EvalExample], path: str):
    """Save eval dataset to JSON."""
    with open(path, "w") as f:
        json.dump([asdict(e) for e in examples], f, indent=2)


def load_dataset(path: str) -> list[EvalExample]:
    """Load eval dataset from JSON."""
    with open(path) as f:
        data = json.load(f)
    return [EvalExample(**d) for d in data]


# Build and save
dataset = build_eval_dataset()
save_dataset(dataset, "eval_dataset.json")
print(f"Saved {len(dataset)} examples")

Golden rules for eval datasets:

Every example must have a clear expected answer, not just “something about refunds”
Include edge cases: empty queries, very long queries, ambiguous queries, adversarial queries
Slice by category so you can identify which areas are weak
Version control your dataset alongside your code — it is part of your test suite
Refresh quarterly — user queries evolve and so should your eval set

LLM-as-Judge

The most powerful evaluation technique for open-ended outputs. You use a stronger model (or the same model with a carefully designed prompt) to score the output of your application.

from openai import OpenAI
import json

client = OpenAI()


def llm_judge(
    query: str,
    generated_answer: str,
    expected_answer: str,
    criteria: str = "accuracy",
) -> dict:
    """Use an LLM to judge answer quality on a 1-5 scale."""
    judge_prompt = f"""You are an expert evaluator. Score the Generated Answer 
compared to the Expected Answer on the following criterion: {criteria}.

Scoring rubric:
5 = Perfect — factually correct, complete, well-expressed
4 = Good — mostly correct, minor omissions or imprecisions
3 = Acceptable — partially correct, missing important details
2 = Poor — significant errors or missing critical information
1 = Unacceptable — wrong, irrelevant, or harmful

User Query: {query}

Expected Answer: {expected_answer}

Generated Answer: {generated_answer}

Respond in JSON format:
{{"score": <1-5>, "reasoning": "<brief explanation>"}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)


# Example
result = llm_judge(
    query="What is the refund policy for annual plans?",
    generated_answer=(
        "You can get a refund within the first 6 months. "
        "After that, your plan stays active until it expires."
    ),
    expected_answer=(
        "Annual plans are eligible for a prorated refund if cancelled within "
        "the first 6 months. After 6 months, no refund is available but the "
        "subscription remains active until the end of the billing period."
    ),
)

print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")
# Score: 4/5
# Reasoning: Captures the key facts (6-month window, plan stays active) 
# but misses that the refund is prorated, not full.

Multi-Criteria Judging

For production, score on multiple dimensions:

EVAL_CRITERIA = {
    "accuracy": "Is the information factually correct?",
    "completeness": "Does it cover all key points from the expected answer?",
    "relevance": "Does it directly address the user's question?",
    "tone": "Is the tone appropriate for a customer-facing response?",
    "conciseness": "Is it appropriately concise without unnecessary filler?",
}


def multi_criteria_judge(
    query: str,
    generated_answer: str,
    expected_answer: str,
) -> dict:
    """Score on multiple criteria."""
    scores = {}
    for criterion, description in EVAL_CRITERIA.items():
        result = llm_judge(
            query=query,
            generated_answer=generated_answer,
            expected_answer=expected_answer,
            criteria=f"{criterion}: {description}",
        )
        scores[criterion] = result

    # Compute weighted average
    weights = {
        "accuracy": 0.35,
        "completeness": 0.25,
        "relevance": 0.20,
        "tone": 0.10,
        "conciseness": 0.10,
    }
    weighted_score = sum(
        scores[c]["score"] * weights[c] for c in weights
    )
    scores["weighted_total"] = round(weighted_score, 2)

    return scores


result = multi_criteria_judge(
    query="What is the refund policy?",
    generated_answer="You can get a refund within 6 months.",
    expected_answer="Annual plans offer prorated refunds within 6 months...",
)

for criterion, data in result.items():
    if isinstance(data, dict):
        print(f"  {criterion}: {data['score']}/5 — {data['reasoning']}")
    else:
        print(f"  TOTAL: {data}/5")

Reducing judge cost: Each multi-criteria evaluation makes 5 API calls to GPT-4o. For a 200-example dataset, that is 1,000 calls. Two optimizations:

Batch criteria into one prompt. Instead of five separate calls, ask the judge to score all criteria at once. Less reliable per criterion but 5x cheaper.
Use a cheaper judge for coarse filtering. Run GPT-4o-mini first to identify obvious failures (score 1-2), then only run GPT-4o on the ambiguous cases (score 3-4).

Traditional Metrics (and Why They’re Not Enough)

BLEU, ROUGE, and other NLP metrics compare generated text against reference text at the word/token level. They are fast and free, but they miss semantic equivalence.

from rouge_score import rouge_scorer


def compute_rouge(generated: str, reference: str) -> dict:
    """Compute ROUGE scores between generated and reference text."""
    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL"], use_stemmer=True
    )
    scores = scorer.score(reference, generated)
    return {
        key: {
            "precision": round(value.precision, 3),
            "recall": round(value.recall, 3),
            "f1": round(value.fmeasure, 3),
        }
        for key, value in scores.items()
    }


# These two answers mean the same thing but ROUGE gives a low score
generated = "Customers on annual billing can receive partial refunds within six months."
reference = "Annual plans are eligible for prorated refunds if cancelled in the first 6 months."

scores = compute_rouge(generated, reference)
for metric, values in scores.items():
    print(f"{metric}: F1={values['f1']}")

# rouge1: F1=0.30   (low! even though meaning is identical)
# rouge2: F1=0.10
# rougeL: F1=0.25

When traditional metrics work: Extraction tasks where the output should be very close to the reference (entity extraction, summarization with fixed templates). When they fail: Any task where paraphrasing is acceptable — which is most LLM tasks.

Semantic Similarity

A better alternative for factual answers: embed both the generated and expected answer, then compute cosine similarity.

import numpy as np
from openai import OpenAI

client = OpenAI()


def semantic_similarity(text_a: str, text_b: str) -> float:
    """Compute semantic similarity between two texts using embeddings."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[text_a, text_b],
    )
    emb_a = np.array(response.data[0].embedding)
    emb_b = np.array(response.data[1].embedding)
    return float(np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b)))


# Same example as above
generated = "Customers on annual billing can receive partial refunds within six months."
reference = "Annual plans are eligible for prorated refunds if cancelled in the first 6 months."

similarity = semantic_similarity(generated, reference)
print(f"Semantic similarity: {similarity:.3f}")  # 0.92 — much better than ROUGE

Semantic similarity above 0.85 usually indicates a correct answer. Below 0.70 usually indicates a wrong one. The 0.70-0.85 range is ambiguous and benefits from LLM-as-judge.

RAG-Specific Evaluation

RAG pipelines have two failure modes: bad retrieval (wrong documents) and bad generation (right documents, wrong answer). You must measure both.

The Four RAG Metrics

Metric	What It Measures	Failure Example
Context Precision	Are the retrieved docs relevant?	Retrieved 5 docs but only 1 was about the query topic
Context Recall	Did retrieval find all relevant docs?	The key document was ranked #15 and missed top-k
Faithfulness	Is the answer grounded in retrieved context?	Model hallucinated facts not in any retrieved doc
Answer Correctness	Is the final answer right?	Retrieved right docs but model misinterpreted them

Using Ragas for RAG Evaluation

Ragas is the standard framework for RAG evaluation. It computes all four metrics using LLM-based assessment.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_correctness,
)
from datasets import Dataset


def evaluate_rag_pipeline(
    questions: list[str],
    answers: list[str],
    contexts: list[list[str]],
    ground_truths: list[str],
) -> dict:
    """Evaluate a RAG pipeline using Ragas metrics."""
    # Ragas expects a HuggingFace Dataset
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    result = evaluate(
        dataset=eval_dataset,
        metrics=[
            context_precision,
            context_recall,
            faithfulness,
            answer_correctness,
        ],
    )

    return result


# Example: evaluate your RAG pipeline on 3 questions
questions = [
    "What is the refund policy for annual plans?",
    "How do I reset my API key?",
    "What programming languages are supported?",
]

# These come from your RAG pipeline
answers = [
    "Annual plans offer prorated refunds within the first 6 months.",
    "Go to Settings > API Keys > click Regenerate.",
    "We support Python, JavaScript, Go, and Ruby.",
]

# The retrieved contexts (list of strings per question)
contexts = [
    ["Annual plans: prorated refund within 6 months. After 6 months, no refund."],
    ["API Key Management: Navigate to Settings > API Keys. Click Regenerate."],
    ["Supported Languages: Python, JavaScript, Go, Ruby, and Java SDKs."],
]

# Your ground truth answers
ground_truths = [
    "Annual plans are eligible for a prorated refund within the first 6 months.",
    "Navigate to Settings, then API Keys, and click the Regenerate button.",
    "Python, JavaScript, Go, Ruby, and Java are supported.",
]

results = evaluate_rag_pipeline(questions, answers, contexts, ground_truths)
print(results)
# {'context_precision': 0.95, 'context_recall': 0.90,
#  'faithfulness': 0.93, 'answer_correctness': 0.88}

Interpreting Ragas scores:

Above 0.85 on all four metrics: your pipeline is production-ready
Faithfulness below 0.80: the model is hallucinating — tighten your prompt
Context recall below 0.70: your retrieval is missing documents — fix chunking or embeddings
Context precision below 0.70: too much noise in retrieved docs — add reranking

Ragas at Scale

For larger evaluations, Ragas supports async execution and can use different LLMs as the evaluator:

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Use a cheaper model for evaluation at scale
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_correctness],
    llm=eval_llm,
)

Using DeepEval for General Evaluation

DeepEval provides a broader set of metrics and a test-runner interface that integrates with pytest.

from deepeval import evaluate as deepeval_evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase


def run_deepeval(
    queries: list[str],
    outputs: list[str],
    expected_outputs: list[str],
    contexts: list[list[str]] | None = None,
) -> list[dict]:
    """Run DeepEval metrics on a batch of test cases."""
    # Define metrics
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    
    # Custom G-Eval metric for domain-specific quality
    helpfulness = GEval(
        name="Helpfulness",
        criteria=(
            "Determine whether the actual output is helpful and actionable "
            "for the user. A helpful response directly addresses the query, "
            "provides specific steps or information, and avoids vague generalities."
        ),
        evaluation_params=["input", "actual_output"],
        threshold=0.7,
    )

    # Build test cases
    test_cases = []
    for i in range(len(queries)):
        tc = LLMTestCase(
            input=queries[i],
            actual_output=outputs[i],
            expected_output=expected_outputs[i],
            retrieval_context=contexts[i] if contexts else None,
        )
        test_cases.append(tc)

    # Run evaluation
    metrics_list = [relevancy, helpfulness]
    if contexts:
        metrics_list.append(FaithfulnessMetric(threshold=0.7))

    results = []
    for tc in test_cases:
        case_results = {}
        for metric in metrics_list:
            metric.measure(tc)
            case_results[metric.__class__.__name__] = {
                "score": metric.score,
                "passed": metric.is_successful(),
                "reason": metric.reason,
            }
        results.append(case_results)

    return results


# Usage
results = run_deepeval(
    queries=["What is the refund policy?"],
    outputs=["You can get a refund within 6 months."],
    expected_outputs=["Annual plans offer prorated refunds within 6 months."],
    contexts=[["Refund policy: prorated refund within 6 months for annual plans."]],
)

for i, r in enumerate(results):
    print(f"Test case {i}:")
    for metric, data in r.items():
        status = "PASS" if data["passed"] else "FAIL"
        print(f"  {metric}: {data['score']:.2f} [{status}] — {data['reason']}")

DeepEval with Pytest

DeepEval integrates with pytest for CI/CD:

# test_llm_quality.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, GEval
from deepeval.test_case import LLMTestCase


def generate_answer(query: str) -> str:
    """Your LLM application's answer generation."""
    # Replace with your actual pipeline
    from your_app import rag_pipeline
    return rag_pipeline(query)


# Load eval dataset
import json
with open("eval_dataset.json") as f:
    eval_data = json.load(f)


@pytest.mark.parametrize("example", eval_data, ids=[e["query"][:50] for e in eval_data])
def test_answer_quality(example):
    """Test that generated answers meet quality thresholds."""
    generated = generate_answer(example["query"])

    test_case = LLMTestCase(
        input=example["query"],
        actual_output=generated,
        expected_output=example["expected_answer"],
    )

    relevancy = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [relevancy])

Run with: pytest test_llm_quality.py -v

Regression Testing in CI/CD

The goal: every prompt change, every model upgrade, every pipeline modification gets automatically evaluated before reaching production.

# eval_runner.py — run this in CI
import json
import sys
from datetime import datetime
from pathlib import Path


def run_regression_eval(
    dataset_path: str,
    output_path: str,
    threshold: float = 0.80,
) -> bool:
    """Run evaluation and check against regression threshold."""
    # Load dataset
    with open(dataset_path) as f:
        dataset = json.load(f)

    # Run your pipeline on each example
    from your_app import generate_answer

    results = []
    for example in dataset:
        generated = generate_answer(example["query"])
        score = llm_judge(
            query=example["query"],
            generated_answer=generated,
            expected_answer=example["expected_answer"],
        )
        results.append({
            "query": example["query"],
            "category": example.get("category", "general"),
            "score": score["score"],
            "reasoning": score["reasoning"],
            "generated": generated,
        })

    # Compute aggregate metrics
    scores = [r["score"] for r in results]
    avg_score = sum(scores) / len(scores)
    pass_rate = sum(1 for s in scores if s >= 4) / len(scores)

    # Compute per-category metrics
    categories = set(r["category"] for r in results)
    category_metrics = {}
    for cat in categories:
        cat_scores = [r["score"] for r in results if r["category"] == cat]
        category_metrics[cat] = {
            "avg_score": round(sum(cat_scores) / len(cat_scores), 2),
            "count": len(cat_scores),
        }

    # Save results
    report = {
        "timestamp": datetime.utcnow().isoformat(),
        "dataset_size": len(dataset),
        "avg_score": round(avg_score, 2),
        "pass_rate": round(pass_rate, 3),
        "category_metrics": category_metrics,
        "threshold": threshold,
        "passed": avg_score >= threshold * 5,  # Convert to 1-5 scale
        "results": results,
    }

    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)

    print(f"\nEvaluation Report")
    print(f"{'=' * 50}")
    print(f"Examples evaluated: {len(dataset)}")
    print(f"Average score:      {avg_score:.2f}/5")
    print(f"Pass rate (>=4/5):  {pass_rate:.1%}")
    print(f"Threshold:          {threshold * 5:.2f}/5")
    print(f"\nPer-category:")
    for cat, metrics in sorted(category_metrics.items()):
        print(f"  {cat}: {metrics['avg_score']}/5 (n={metrics['count']})")

    if not report["passed"]:
        print(f"\nFAILED: Average score {avg_score:.2f} < threshold {threshold * 5:.2f}")
        return False

    print(f"\nPASSED")
    return True


if __name__ == "__main__":
    success = run_regression_eval(
        dataset_path="eval_dataset.json",
        output_path="eval_results.json",
        threshold=0.80,
    )
    sys.exit(0 if success else 1)

GitHub Actions Integration

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'eval_dataset.json'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run LLM evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python eval_runner.py

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results.json

      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json'));
            const body = `## LLM Evaluation Results
            
            | Metric | Value |
            |--------|-------|
            | Average Score | ${results.avg_score}/5 |
            | Pass Rate | ${(results.pass_rate * 100).toFixed(1)}% |
            | Status | ${results.passed ? '✅ PASSED' : '❌ FAILED'} |
            `;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

A/B Testing Prompts in Production

Sometimes evaluation datasets are not enough — you need to test with real users on real queries.

import hashlib
import random
from dataclasses import dataclass
from datetime import datetime


@dataclass
class ABTestConfig:
    name: str
    variant_a_prompt: str
    variant_b_prompt: str
    traffic_split: float = 0.5  # 50/50 by default
    start_date: str = ""
    end_date: str = ""


class PromptABTest:
    """Simple A/B testing for prompts."""

    def __init__(self, config: ABTestConfig):
        self.config = config
        self.results: list[dict] = []

    def get_variant(self, user_id: str) -> str:
        """Deterministically assign user to variant (consistent hashing)."""
        hash_val = int(hashlib.md5(
            f"{self.config.name}:{user_id}".encode()
        ).hexdigest(), 16)
        if (hash_val % 100) / 100 < self.config.traffic_split:
            return "A"
        return "B"

    def get_prompt(self, user_id: str) -> str:
        """Get the prompt for this user's variant."""
        variant = self.get_variant(user_id)
        if variant == "A":
            return self.config.variant_a_prompt
        return self.config.variant_b_prompt

    def log_result(
        self,
        user_id: str,
        query: str,
        response: str,
        feedback: int | None = None,  # 1 = thumbs up, -1 = thumbs down
        latency_ms: float = 0,
    ):
        """Log a result for analysis."""
        self.results.append({
            "user_id": user_id,
            "variant": self.get_variant(user_id),
            "query": query,
            "response": response,
            "feedback": feedback,
            "latency_ms": latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

    def analyze(self) -> dict:
        """Compute metrics per variant."""
        analysis = {}
        for variant in ["A", "B"]:
            variant_results = [r for r in self.results if r["variant"] == variant]
            feedbacks = [r["feedback"] for r in variant_results if r["feedback"]]
            latencies = [r["latency_ms"] for r in variant_results if r["latency_ms"]]

            positive = sum(1 for f in feedbacks if f == 1)
            total_feedback = len(feedbacks)

            analysis[variant] = {
                "total_queries": len(variant_results),
                "feedback_count": total_feedback,
                "positive_rate": positive / total_feedback if total_feedback else 0,
                "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
            }

        return analysis


# Usage
test = PromptABTest(ABTestConfig(
    name="system_prompt_v2_test",
    variant_a_prompt="You are a helpful customer support agent. Answer concisely.",
    variant_b_prompt=(
        "You are a customer support agent for Acme Corp. "
        "Answer questions using only the provided context. "
        "If you don't know, say so. Be specific and cite sources."
    ),
    traffic_split=0.5,
))

# In your API handler:
# prompt = test.get_prompt(user_id)
# response = generate(query, system_prompt=prompt)
# test.log_result(user_id, query, response, feedback=user_feedback)

Human Evaluation

Automated metrics get you 80% of the way there. The last 20% requires human judgment, especially for subjective qualities like tone, helpfulness, and appropriateness.

Evaluation Rubric

Define a rubric so evaluators are consistent:

HUMAN_EVAL_RUBRIC = {
    "accuracy": {
        5: "All facts are correct. No fabricated information.",
        4: "Mostly correct. One minor inaccuracy that doesn't change the answer.",
        3: "Partially correct. Contains a meaningful error but shows understanding.",
        2: "Mostly incorrect. Key facts are wrong.",
        1: "Completely wrong or fabricated.",
    },
    "helpfulness": {
        5: "Directly and completely answers the question. Actionable.",
        4: "Answers the question with minor gaps. Mostly actionable.",
        3: "Partially answers the question. User would need follow-up.",
        2: "Vaguely related but doesn't answer the question.",
        1: "Irrelevant, evasive, or unhelpful.",
    },
    "safety": {
        5: "Appropriate, professional, no harmful content.",
        4: "Appropriate with minor tone issues.",
        3: "Borderline — could be misinterpreted.",
        2: "Contains problematic content or bias.",
        1: "Harmful, offensive, or dangerous.",
    },
}


def create_human_eval_batch(
    examples: list[dict],
    output_path: str,
):
    """Create a spreadsheet-ready batch for human evaluators."""
    import csv

    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow([
            "example_id", "query", "generated_answer", "expected_answer",
            "accuracy_score", "helpfulness_score", "safety_score", "notes",
        ])
        for i, ex in enumerate(examples):
            writer.writerow([
                i, ex["query"], ex["generated"], ex["expected"],
                "", "", "", "",  # To be filled by evaluator
            ])

    print(f"Created eval batch with {len(examples)} examples at {output_path}")

Inter-Annotator Agreement

When multiple humans evaluate the same examples, measure how much they agree. Low agreement means your rubric is ambiguous.

def compute_agreement(ratings_a: list[int], ratings_b: list[int]) -> dict:
    """Compute inter-annotator agreement metrics."""
    assert len(ratings_a) == len(ratings_b)
    n = len(ratings_a)

    # Exact agreement
    exact_match = sum(1 for a, b in zip(ratings_a, ratings_b) if a == b) / n

    # Within-1 agreement (scores differ by at most 1)
    within_one = sum(
        1 for a, b in zip(ratings_a, ratings_b) if abs(a - b) <= 1
    ) / n

    # Cohen's Kappa (chance-corrected agreement)
    from collections import Counter
    observed_agreement = exact_match
    # Expected agreement by chance
    count_a = Counter(ratings_a)
    count_b = Counter(ratings_b)
    all_values = set(ratings_a) | set(ratings_b)
    expected_agreement = sum(
        (count_a[v] / n) * (count_b[v] / n) for v in all_values
    )
    kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement) \
        if expected_agreement < 1 else 1.0

    return {
        "exact_agreement": round(exact_match, 3),
        "within_one_agreement": round(within_one, 3),
        "cohens_kappa": round(kappa, 3),
    }


# Example
evaluator_1 = [5, 4, 3, 5, 2, 4, 3, 5, 4, 4]
evaluator_2 = [5, 4, 4, 5, 3, 4, 3, 4, 4, 5]

agreement = compute_agreement(evaluator_1, evaluator_2)
print(f"Exact agreement:  {agreement['exact_agreement']:.1%}")
print(f"Within-1:         {agreement['within_one_agreement']:.1%}")
print(f"Cohen's Kappa:    {agreement['cohens_kappa']:.3f}")
# Exact agreement:  60.0%
# Within-1:         100.0%
# Cohen's Kappa:    0.385

Interpreting Kappa: Below 0.20 is poor (rubric needs work), 0.20-0.40 is fair, 0.40-0.60 is moderate, 0.60-0.80 is substantial, above 0.80 is excellent.

Cost of Evaluation

Evaluation is not free. Here is a realistic cost breakdown for a 200-example eval dataset:

Method	Per Example	200 Examples	Time
Exact match	$0	$0	< 1 second
Semantic similarity	~$0.0001	~$0.02	~10 seconds
LLM-as-judge (GPT-4o-mini)	~$0.002	~$0.40	~5 minutes
LLM-as-judge (GPT-4o)	~$0.02	~$4.00	~10 minutes
Multi-criteria (5x GPT-4o)	~$0.10	~$20.00	~45 minutes
Human evaluation	~$0.50-2.00	~$100-400	~4-8 hours

Budget strategy: Run cheap metrics (exact match, semantic similarity) on every PR. Run LLM-as-judge on daily builds. Run human evaluation monthly or before major releases.

Building a Complete Eval Harness

Here is a production-ready eval harness that combines everything:

import json
import time
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path
from typing import Callable


@dataclass
class EvalResult:
    query: str
    expected: str
    generated: str
    scores: dict = field(default_factory=dict)
    latency_ms: float = 0
    category: str = ""
    passed: bool = False


class EvalHarness:
    """Production evaluation harness for LLM applications."""

    def __init__(
        self,
        generator: Callable[[str], str],  # Your LLM pipeline
        dataset_path: str,
        output_dir: str = "eval_results",
    ):
        self.generator = generator
        self.dataset = self._load_dataset(dataset_path)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

    def _load_dataset(self, path: str) -> list[dict]:
        with open(path) as f:
            return json.load(f)

    def run(
        self,
        metrics: list[str] = None,
        pass_threshold: float = 0.80,
    ) -> dict:
        """Run full evaluation suite."""
        if metrics is None:
            metrics = ["semantic_similarity", "llm_judge"]

        results: list[EvalResult] = []

        for i, example in enumerate(self.dataset):
            print(f"Evaluating {i + 1}/{len(self.dataset)}: "
                  f"{example['query'][:50]}...")

            # Generate answer
            start = time.time()
            generated = self.generator(example["query"])
            latency = (time.time() - start) * 1000

            result = EvalResult(
                query=example["query"],
                expected=example["expected_answer"],
                generated=generated,
                latency_ms=round(latency, 1),
                category=example.get("category", "general"),
            )

            # Run selected metrics
            if "semantic_similarity" in metrics:
                sim = semantic_similarity(generated, example["expected_answer"])
                result.scores["semantic_similarity"] = round(sim, 3)

            if "llm_judge" in metrics:
                judgment = llm_judge(
                    query=example["query"],
                    generated_answer=generated,
                    expected_answer=example["expected_answer"],
                )
                result.scores["llm_judge"] = judgment["score"]
                result.scores["judge_reasoning"] = judgment["reasoning"]

            if "rouge" in metrics:
                rouge = compute_rouge(generated, example["expected_answer"])
                result.scores["rouge_l_f1"] = rouge["rougeL"]["f1"]

            # Determine pass/fail
            if "llm_judge" in result.scores:
                result.passed = result.scores["llm_judge"] >= 4
            elif "semantic_similarity" in result.scores:
                result.passed = result.scores["semantic_similarity"] >= 0.85

            results.append(result)

        # Compute aggregates
        report = self._build_report(results, pass_threshold)

        # Save
        timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
        report_path = self.output_dir / f"eval_{timestamp}.json"
        with open(report_path, "w") as f:
            json.dump(report, f, indent=2, default=str)

        self._print_report(report)
        return report

    def _build_report(self, results: list[EvalResult], threshold: float) -> dict:
        """Build aggregate report from individual results."""
        scores_by_metric = {}
        for r in results:
            for metric, value in r.scores.items():
                if isinstance(value, (int, float)):
                    scores_by_metric.setdefault(metric, []).append(value)

        aggregates = {}
        for metric, values in scores_by_metric.items():
            aggregates[metric] = {
                "mean": round(sum(values) / len(values), 3),
                "min": round(min(values), 3),
                "max": round(max(values), 3),
            }

        # Per-category breakdown
        categories = set(r.category for r in results)
        category_breakdown = {}
        for cat in sorted(categories):
            cat_results = [r for r in results if r.category == cat]
            cat_scores = [r.scores.get("llm_judge", 0) for r in cat_results]
            category_breakdown[cat] = {
                "count": len(cat_results),
                "avg_score": round(sum(cat_scores) / len(cat_scores), 2)
                    if cat_scores else 0,
                "pass_rate": round(
                    sum(1 for r in cat_results if r.passed) / len(cat_results), 3
                ),
            }

        pass_rate = sum(1 for r in results if r.passed) / len(results)
        avg_latency = sum(r.latency_ms for r in results) / len(results)

        return {
            "timestamp": datetime.utcnow().isoformat(),
            "total_examples": len(results),
            "pass_rate": round(pass_rate, 3),
            "passed_threshold": pass_rate >= threshold,
            "threshold": threshold,
            "avg_latency_ms": round(avg_latency, 1),
            "aggregates": aggregates,
            "category_breakdown": category_breakdown,
            "results": [asdict(r) for r in results],
        }

    def _print_report(self, report: dict):
        """Print a human-readable evaluation report."""
        print(f"\n{'=' * 60}")
        print(f"EVALUATION REPORT")
        print(f"{'=' * 60}")
        print(f"Examples:       {report['total_examples']}")
        print(f"Pass rate:      {report['pass_rate']:.1%}")
        print(f"Threshold:      {report['threshold']:.1%}")
        print(f"Status:         {'PASSED' if report['passed_threshold'] else 'FAILED'}")
        print(f"Avg latency:    {report['avg_latency_ms']:.0f}ms")

        print(f"\nMetric Averages:")
        for metric, stats in report["aggregates"].items():
            print(f"  {metric}: {stats['mean']:.3f} "
                  f"(min={stats['min']}, max={stats['max']})")

        print(f"\nCategory Breakdown:")
        for cat, metrics in report["category_breakdown"].items():
            print(f"  {cat}: {metrics['avg_score']}/5 "
                  f"({metrics['pass_rate']:.0%} pass, n={metrics['count']})")

        print(f"{'=' * 60}\n")


# Usage
def my_rag_pipeline(query: str) -> str:
    """Your actual RAG pipeline."""
    # ... retrieval + generation logic ...
    pass


harness = EvalHarness(
    generator=my_rag_pipeline,
    dataset_path="eval_dataset.json",
    output_dir="eval_results",
)

report = harness.run(
    metrics=["semantic_similarity", "llm_judge"],
    pass_threshold=0.80,
)

if not report["passed_threshold"]:
    print("Evaluation FAILED — do not deploy this change.")

Common Mistakes

1. Evaluating on training examples. If your eval dataset includes examples you used to develop the prompt, your scores are inflated. Keep a held-out test set that you never look at during development.

2. Binary scoring. “Correct” or “incorrect” loses too much information. A response that is 80% correct should score differently from one that is 20% correct. Use the 1-5 scale.

3. Ignoring edge cases. If 95% of your eval dataset is “easy” queries, you will report 95% accuracy while completely failing on the 5% of hard queries that matter most to users.

4. Not versioning your eval dataset. If the dataset changes between evaluations, you cannot compare results. Version it alongside your code.

5. Evaluating only the final answer. In a RAG pipeline, the final answer depends on retrieval quality. If you only measure the answer, you will not know whether a failure was caused by bad retrieval or bad generation. Measure both.

6. Running evaluation manually. If evaluation requires someone to remember to run it, it will not happen. Automate it in CI.

Key Takeaways

Build an eval dataset before writing your first prompt. 50 examples minimum. Include easy, medium, and hard queries across all your categories.
LLM-as-judge is the most practical automated evaluation. Use GPT-4o to judge your application’s output on a 1-5 scale with explicit criteria. It correlates well with human judgment.
For RAG, measure retrieval and generation separately. Use Ragas metrics: context precision, context recall, faithfulness, answer correctness. Fix retrieval first — bad retrieval makes generation impossible.
Automate regression testing in CI. Every prompt change should trigger evaluation. If accuracy drops below your threshold, block the deployment.
ROUGE and BLEU are not enough for LLMs. They measure word overlap, not semantic correctness. Use semantic similarity or LLM-as-judge instead.
Human evaluation calibrates everything else. Run it quarterly. Use a rubric. Measure inter-annotator agreement. If annotators disagree, your rubric is broken.
Evaluation costs money but saves more. A bad answer that reaches production costs more in user trust than $20 worth of GPT-4o evaluation calls.
Version your eval dataset like code. Changes to the dataset change your metrics. Track both in source control.
Slice metrics by category. An overall 85% pass rate can hide a 40% pass rate in your most important category. Always look at the breakdown.
Eval is not a one-time setup. Queries evolve, models change, edge cases surface. Budget ongoing time for maintaining and expanding your evaluation pipeline.