arrow_backBACK TO LLM ENGINEERING IN PRODUCTION
Lesson 02LLM Engineering in Production15 min read

Choosing Between OpenAI, Claude, and Open Source

April 01, 2026

TL;DR

No single model wins everything. GPT-4o is the safe default with the best ecosystem. Claude excels at long-context tasks, nuanced instructions, and code. Llama/Mistral are best when you need data privacy or cost control at scale. Gemini offers massive context windows. Pick based on your constraints: latency, cost, privacy, context length, and task type — not hype.

Choosing a model is one of the first decisions you make when building an LLM application, and one of the most consequential. Get it wrong and you are locked into a provider that is too expensive, too slow, or not capable enough for your task. Get it right and you have a foundation you can optimize from.

This lesson gives you the data and framework to make that decision. No hype, no brand loyalty — just trade-offs.

1. The Major Model Families

The LLM landscape in 2026 has consolidated around five major providers. Each has a distinct philosophy and set of trade-offs.

OpenAI (GPT-4o, GPT-4o mini, o1/o3)

The market leader with the broadest ecosystem. GPT-4o is the “safe default” that most teams start with.

  • GPT-4o: Flagship multimodal model. Strong at everything, best ecosystem (tools, plugins, fine-tuning support)
  • GPT-4o mini: 90% of GPT-4o quality at 6% of the cost. The workhorse for production applications
  • o1 / o3: Reasoning models that “think” before answering. Slower and more expensive but significantly better at math, logic, and complex multi-step problems

Anthropic (Claude Sonnet, Haiku, Opus)

Anthropic’s Claude family is known for instruction-following, long-context performance, and code quality.

  • Claude Sonnet: Best balance of quality and speed. Excels at code, analysis, and following nuanced instructions
  • Claude Haiku: Fast and cheap. Good for classification, extraction, and high-throughput tasks
  • Claude Opus: Most capable model. Best for complex reasoning, research, and tasks requiring deep understanding

Meta (Llama 3.1 / 3.2 / 4)

Open-weight models you can run yourself. The go-to choice for data privacy, customization, and cost control at scale.

  • Llama 3.1 405B: Competitive with GPT-4o on many benchmarks. Requires significant infrastructure to run
  • Llama 3.1 70B: Sweet spot for self-hosted deployments. Strong quality-to-cost ratio
  • Llama 3.2 3B/1B: Small models for edge devices, on-device inference, and extremely cost-sensitive applications

Mistral (Mistral Large, Small, Codestral)

European AI company with strong open-weight offerings and focus on efficiency.

  • Mistral Large: Competitive with GPT-4o at lower cost
  • Mistral Small: Excellent for structured tasks — classification, extraction, routing
  • Codestral: Purpose-built for code generation and code understanding

Google (Gemini 1.5 / 2.0)

Massive context windows and strong multimodal capabilities. Deep integration with Google Cloud.

  • Gemini 1.5 Pro: 2M token context window — can process entire codebases or books in a single call
  • Gemini 2.0 Flash: Fast and cheap with good quality. Strong for high-throughput applications
  • Gemini 2.0 Pro: Google’s top model for complex reasoning

2. Head-to-Head Comparison

Here is the data that actually matters for production decisions.

Pricing comparison (per 1M tokens, as of early 2026)

Model Input price Output price Context window
GPT-4o $2.50 $10.00 128K
GPT-4o mini $0.15 $0.60 128K
o3 $10.00 $40.00 200K
Claude Sonnet $3.00 $15.00 200K
Claude Haiku $0.80 $4.00 200K
Claude Opus $15.00 $75.00 200K
Gemini 1.5 Pro $1.25 $5.00 2M
Gemini 2.0 Flash $0.10 $0.40 1M
Mistral Large $2.00 $6.00 128K
Mistral Small $0.10 $0.30 128K
Llama 3.1 70B (via Together) $0.88 $0.88 128K
Llama 3.1 8B (via Together) $0.18 $0.18 128K

Key takeaway: There is a 100x cost difference between the cheapest and most expensive options. A task running on Claude Opus at $75/M output tokens costs $0.75 per 10K output tokens. The same task on Mistral Small costs $0.003. If you can tolerate slightly lower quality, the savings are massive.

Latency comparison (typical, median)

Model Time to first token Tokens per second 500-token response
GPT-4o ~300ms ~80 tok/s ~6.5s
GPT-4o mini ~200ms ~120 tok/s ~4.4s
o3 ~2-30s (thinking) ~40 tok/s ~15-45s
Claude Sonnet ~400ms ~70 tok/s ~7.5s
Claude Haiku ~200ms ~150 tok/s ~3.5s
Gemini 2.0 Flash ~150ms ~200 tok/s ~2.7s
Llama 3.1 70B (self-hosted, A100) ~200ms ~50 tok/s ~10.2s

Reasoning models (o1, o3) are in a different category — they spend seconds to minutes “thinking” before generating output. Do not use them for latency-sensitive applications.

Strengths by task type

Task Best choice Why
Code generation Claude Sonnet, GPT-4o Best at following complex coding instructions
Data extraction / classification GPT-4o mini, Mistral Small, Haiku Fast, cheap, structured output support
Long document analysis Claude Sonnet (200K), Gemini 1.5 Pro (2M) Large context with strong recall
Complex reasoning / math o3, Claude Opus Designed for multi-step thinking
Creative writing Claude Sonnet, GPT-4o Best natural language quality
High throughput (>10K req/min) GPT-4o mini, Gemini Flash, self-hosted Llama Cost and speed at scale
Data-sensitive workloads Self-hosted Llama/Mistral Data never leaves your infrastructure
Multilingual GPT-4o, Gemini Best non-English performance

3. Benchmarks — And Why You Should Not Trust Them Blindly

Academic benchmarks give you a starting point, not a final answer.

Major benchmarks explained

Benchmark What it measures Score range
MMLU General knowledge across 57 subjects 0-100%
HumanEval Python code generation from docstrings 0-100% pass@1
MATH Mathematical problem solving 0-100%
GSM8K Grade school math word problems 0-100%
MT-Bench Multi-turn conversation quality 1-10 score
GPQA Graduate-level science questions 0-100%

Representative benchmark scores

Model MMLU HumanEval MATH GSM8K
GPT-4o 88.7 90.2 76.6 95.8
Claude Sonnet 88.7 92.0 78.3 96.4
Claude Opus 91.0 93.1 82.1 97.2
o3 89.5 92.7 96.7 98.9
Gemini 1.5 Pro 85.9 84.1 67.7 91.0
Llama 3.1 405B 88.6 89.0 73.8 96.8
Llama 3.1 70B 86.0 80.5 68.0 93.0
Mistral Large 84.0 81.0 65.2 91.2

Why benchmarks lie

Benchmarks are useful directionally but dangerous as decision criteria. Here is why:

  1. Benchmark gaming. Model providers optimize for benchmark scores. Some models are explicitly fine-tuned on benchmark-style questions, inflating scores without corresponding real-world improvement.

  2. Your task is not a benchmark. MMLU tests multiple-choice knowledge. HumanEval tests simple function generation. Your production task — extracting insurance claims from scanned PDFs, or generating personalized marketing copy — is nothing like these benchmarks.

  3. Benchmarks are static, models change. A model that scored 85% on MMLU six months ago might have been updated. Benchmark scores are snapshots in time.

The right approach: Use benchmarks to shortlist 2-3 candidates, then run your own evaluation on your actual task with your actual data.

import json
import time
from openai import OpenAI
from anthropic import Anthropic

def evaluate_model(provider, model, test_cases, judge_fn):
    """Run a set of test cases against a model and score results."""
    results = []

    for case in test_cases:
        start = time.time()

        if provider == "openai":
            client = OpenAI()
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": case["system"]},
                    {"role": "user", "content": case["input"]},
                ],
                temperature=0.0,
                max_tokens=case.get("max_tokens", 1024),
            )
            output = response.choices[0].message.content
            tokens_in = response.usage.prompt_tokens
            tokens_out = response.usage.completion_tokens

        elif provider == "anthropic":
            client = Anthropic()
            response = client.messages.create(
                model=model,
                system=case["system"],
                messages=[{"role": "user", "content": case["input"]}],
                temperature=0.0,
                max_tokens=case.get("max_tokens", 1024),
            )
            output = response.content[0].text
            tokens_in = response.usage.input_tokens
            tokens_out = response.usage.output_tokens

        latency = time.time() - start
        score = judge_fn(case["expected"], output)

        results.append({
            "case_id": case["id"],
            "score": score,
            "latency": latency,
            "tokens_in": tokens_in,
            "tokens_out": tokens_out,
        })

    return {
        "model": model,
        "avg_score": sum(r["score"] for r in results) / len(results),
        "avg_latency": sum(r["latency"] for r in results) / len(results),
        "total_tokens": sum(r["tokens_in"] + r["tokens_out"] for r in results),
        "details": results,
    }

4. The Decision Framework

Stop asking “which model is best?” and start asking “which model is best for my specific constraints?”

Step 1: Define your constraints

# Define your requirements as a checklist
requirements = {
    "max_latency_ms": 3000,       # User-facing? Need < 3s
    "max_cost_per_request": 0.01, # Budget constraint
    "min_quality_score": 0.85,    # From your own eval (0-1)
    "context_needed": 8000,       # Typical input size in tokens
    "data_privacy": "standard",   # "standard", "strict", "air-gapped"
    "output_format": "json",      # "text", "json", "code"
    "requests_per_month": 500000, # Scale requirement
    "availability": 0.999,        # Uptime requirement
}

Step 2: Apply the decision tree

Model selection decision tree — from requirements to model recommendation

Here is the decision logic in plain terms:

1. Does your data need to stay on your infrastructure?
   YES -> Self-hosted: Llama 3.1 70B or Mistral Large
   NO  -> Continue to step 2

2. Do you need >200K token context?
   YES -> Gemini 1.5 Pro (2M context)
   NO  -> Continue to step 3

3. Is this a complex reasoning / math / logic task?
   YES -> o3 (if budget allows) or Claude Opus
   NO  -> Continue to step 4

4. Is latency critical (< 2s response)?
   YES -> GPT-4o mini, Gemini 2.0 Flash, or Claude Haiku
   NO  -> Continue to step 5

5. Is cost the primary constraint (< $0.005/request)?
   YES -> GPT-4o mini, Mistral Small, or self-hosted Llama 8B
   NO  -> Continue to step 6

6. Is this a code-heavy or instruction-following task?
   YES -> Claude Sonnet
   NO  -> GPT-4o (safe default)

Step 3: Run your own eval

Never ship based on the decision tree alone. Always run a minimum viable evaluation:

# Minimum viable model evaluation
test_cases = [
    {
        "id": "extract_1",
        "system": "Extract the customer name, issue, and sentiment from the support ticket. Return JSON.",
        "input": "Hi, I'm John Smith. My order #4521 arrived damaged. The box was completely crushed. Very disappointed with the packaging quality.",
        "expected": {"name": "John Smith", "issue": "damaged order", "sentiment": "negative"},
        "max_tokens": 200,
    },
    # Add 20-50 representative test cases from your actual workload
]

def judge_extraction(expected: dict, output: str) -> float:
    """Score extraction quality. Returns 0.0-1.0."""
    try:
        parsed = json.loads(output)
        score = 0.0
        for key in expected:
            if key in parsed and str(parsed[key]).lower() == str(expected[key]).lower():
                score += 1.0 / len(expected)
        return score
    except json.JSONDecodeError:
        return 0.0  # Failed to produce valid JSON

# Test multiple models
candidates = [
    ("openai", "gpt-4o"),
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-sonnet-4-20250514"),
    ("anthropic", "claude-3-5-haiku-20241022"),
]

for provider, model in candidates:
    result = evaluate_model(provider, model, test_cases, judge_extraction)
    monthly_cost = (
        result["total_tokens"] / len(test_cases) *
        requirements["requests_per_month"] / 1_000_000 * 5.0  # Rough avg price
    )
    print(f"{model:>30}: score={result['avg_score']:.2f}  "
          f"latency={result['avg_latency']:.1f}s  "
          f"est_monthly=${monthly_cost:,.0f}")

5. Open Source vs. API: The Real Trade-offs

This is not a philosophical choice. It is an engineering decision with concrete trade-offs.

When to use API providers

Advantage Detail
Zero infrastructure No GPUs, no model loading, no memory management
Latest models Access to frontier models immediately
Managed scaling Provider handles traffic spikes
Lower upfront cost Pay per token, no hardware investment
Built-in features Function calling, structured output, vision, prompt caching

When to self-host open source

Advantage Detail
Data privacy Inputs and outputs never leave your network
No per-token cost Fixed infrastructure cost regardless of volume
Customization Fine-tune on your data, modify system behavior
No rate limits You control throughput
Predictable latency No shared infrastructure, no noisy neighbors
Compliance Required for some regulated industries (healthcare, finance, government)

The break-even calculation

At what point does self-hosting become cheaper than API calls?

def self_host_vs_api_breakeven(
    monthly_requests: int,
    avg_tokens_per_request: int,  # Input + output combined
    api_price_per_m_tokens: float,
    gpu_monthly_cost: float,      # e.g., A100 80GB at ~$2/hr = $1,460/mo
    gpu_throughput_tokens_per_sec: int,
):
    """Compare API vs self-hosted costs."""
    # API cost
    total_tokens = monthly_requests * avg_tokens_per_request
    api_cost = (total_tokens / 1_000_000) * api_price_per_m_tokens

    # Self-hosted cost: how many GPUs do you need?
    tokens_per_month = monthly_requests * avg_tokens_per_request
    seconds_needed = tokens_per_month / gpu_throughput_tokens_per_sec
    hours_needed = seconds_needed / 3600
    gpus_needed = max(1, int(hours_needed / (30 * 24)) + 1)  # Hours per month
    self_hosted_cost = gpus_needed * gpu_monthly_cost

    return {
        "api_monthly_cost": round(api_cost, 2),
        "self_hosted_monthly_cost": round(self_hosted_cost, 2),
        "gpus_needed": gpus_needed,
        "savings": round(api_cost - self_hosted_cost, 2),
        "cheaper": "self-hosted" if self_hosted_cost < api_cost else "api",
    }

# Scenario: 1M requests/month, 2K tokens each, using GPT-4o equivalent
result = self_host_vs_api_breakeven(
    monthly_requests=1_000_000,
    avg_tokens_per_request=2000,
    api_price_per_m_tokens=6.25,     # Blended GPT-4o rate
    gpu_monthly_cost=1460,           # A100 80GB on-demand
    gpu_throughput_tokens_per_sec=500, # Llama 70B on A100, batched
)
print(json.dumps(result, indent=2))

Output:

{
  "api_monthly_cost": 12500.0,
  "self_hosted_monthly_cost": 4380.0,
  "gpus_needed": 3,
  "savings": 8120.0,
  "cheaper": "self-hosted"
}

Rule of thumb: Below ~100K requests/month, APIs are almost always cheaper when you factor in engineering time for infrastructure. Above ~500K requests/month, self-hosting starts to make financial sense — if you have the team to operate it.

6. Running Open Source Models

If you decide to self-host, here are the practical options from simplest to most complex.

Option 1: Ollama (local development and prototyping)

# Install ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.1:70b
ollama run llama3.1:70b "Explain transformers in one paragraph"

Ollama exposes an OpenAI-compatible API by default:

from openai import OpenAI

# Point the OpenAI client at Ollama's local server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but unused
)

response = client.chat.completions.create(
    model="llama3.1:70b",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Use for: Local development, testing, demos. Not production.

Option 2: vLLM (production self-hosted inference)

vLLM is the standard for high-throughput LLM inference. It implements PagedAttention for efficient memory management and continuous batching for maximum GPU utilization.

# Install vLLM
pip install vllm

# Start the server (requires GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --port 8000
from openai import OpenAI

# vLLM also exposes an OpenAI-compatible API
client = OpenAI(
    base_url="http://your-server:8000/v1",
    api_key="unused",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain vLLM in 3 sentences."},
    ],
    temperature=0.0,
    max_tokens=512,
)

Use for: Production deployments where you need control over infrastructure, data privacy, or cost optimization at scale.

Option 3: Hosted open source (Together, Fireworks, Groq)

Get the benefits of open-source models without running infrastructure:

from openai import OpenAI

# Together.ai — hosted open-source models
client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="your-together-api-key",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7,
    max_tokens=1024,
)

# Groq — optimized for speed
client_groq = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key",
)

response = client_groq.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)

Use for: When you want open-source model quality and pricing without managing GPUs. Good middle ground.

7. Multi-Model Architectures

The most cost-effective production systems do not use a single model. They route requests to different models based on complexity, cost sensitivity, or task type.

The router pattern

from openai import OpenAI
from anthropic import Anthropic

openai_client = OpenAI()
anthropic_client = Anthropic()

def classify_complexity(query: str) -> str:
    """Use a cheap model to classify query complexity."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify the complexity of this query as 'simple' or 'complex'. "
                "Simple: factual questions, basic tasks, classification. "
                "Complex: multi-step reasoning, code generation, analysis. "
                "Return ONLY the word 'simple' or 'complex'."
            )},
            {"role": "user", "content": query},
        ],
        temperature=0.0,
        max_tokens=10,
    )
    return response.choices[0].message.content.strip().lower()

def route_and_respond(query: str, system_prompt: str) -> dict:
    """Route query to appropriate model based on complexity."""
    complexity = classify_complexity(query)

    if complexity == "simple":
        # Cheap and fast model for simple queries
        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            max_tokens=1024,
        )
        return {
            "model": "gpt-4o-mini",
            "response": response.choices[0].message.content,
            "cost_tier": "low",
        }
    else:
        # Powerful model for complex queries
        response = anthropic_client.messages.create(
            model="claude-sonnet-4-20250514",
            system=system_prompt,
            messages=[{"role": "user", "content": query}],
            temperature=0.0,
            max_tokens=4096,
        )
        return {
            "model": "claude-sonnet-4-20250514",
            "response": response.content[0].text,
            "cost_tier": "high",
        }

# Usage
result = route_and_respond(
    "What is the capital of France?",
    "You are a helpful assistant. Be concise."
)
print(f"Routed to: {result['model']} (cost tier: {result['cost_tier']})")
print(f"Response: {result['response']}")

Cost impact of routing

For a typical application where 70% of queries are simple and 30% are complex:

Strategy Cost per 1M requests Savings
All GPT-4o $12,500 Baseline
All Claude Sonnet $18,000 -44% (more expensive)
Routed (mini + Sonnet) $5,950 52% savings
Routed (mini + GPT-4o) $4,550 64% savings

The classification call itself costs ~$0.0001 per request — negligible compared to the savings from routing.

Task-specific model selection

MODEL_CONFIG = {
    "classification": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "temperature": 0.0,
        "max_tokens": 50,
    },
    "extraction": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "temperature": 0.0,
        "max_tokens": 500,
    },
    "code_generation": {
        "provider": "anthropic",
        "model": "claude-sonnet-4-20250514",
        "temperature": 0.0,
        "max_tokens": 4096,
    },
    "summarization": {
        "provider": "openai",
        "model": "gpt-4o",
        "temperature": 0.3,
        "max_tokens": 1024,
    },
    "reasoning": {
        "provider": "openai",
        "model": "o3",
        "temperature": 1.0,  # o3 requires temperature=1
        "max_tokens": 8192,
    },
}

def call_model(task_type: str, messages: list[dict]) -> str:
    """Call the appropriate model based on task type."""
    config = MODEL_CONFIG[task_type]

    if config["provider"] == "openai":
        client = OpenAI()
        response = client.chat.completions.create(
            model=config["model"],
            messages=messages,
            temperature=config["temperature"],
            max_tokens=config["max_tokens"],
        )
        return response.choices[0].message.content

    elif config["provider"] == "anthropic":
        client = Anthropic()
        system = next((m["content"] for m in messages if m["role"] == "system"), "")
        user_msgs = [m for m in messages if m["role"] != "system"]
        response = client.messages.create(
            model=config["model"],
            system=system,
            messages=user_msgs,
            temperature=config["temperature"],
            max_tokens=config["max_tokens"],
        )
        return response.content[0].text

8. Real Cost Comparison: 1M Requests/Month

Let us make this concrete. Assume a typical application with:

  • Average input: 1,500 tokens (system prompt + user query)
  • Average output: 500 tokens
  • 1,000,000 requests per month
def monthly_cost(
    input_tokens: int,
    output_tokens: int,
    requests: int,
    input_price: float,  # per 1M tokens
    output_price: float, # per 1M tokens
) -> float:
    total_input = input_tokens * requests
    total_output = output_tokens * requests
    return (
        (total_input / 1_000_000) * input_price +
        (total_output / 1_000_000) * output_price
    )

scenarios = [
    ("GPT-4o",           2.50,  10.00),
    ("GPT-4o mini",      0.15,   0.60),
    ("Claude Sonnet",    3.00,  15.00),
    ("Claude Haiku",     0.80,   4.00),
    ("Gemini 2.0 Flash", 0.10,   0.40),
    ("Mistral Small",    0.10,   0.30),
    ("Llama 70B (Together)", 0.88, 0.88),
    ("Llama 8B (Together)",  0.18, 0.18),
]

print(f"{'Model':<25} {'Monthly cost':>14} {'Per request':>12}")
print("-" * 53)
for name, inp, out in scenarios:
    cost = monthly_cost(1500, 500, 1_000_000, inp, out)
    per_req = cost / 1_000_000
    print(f"{name:<25} ${cost:>12,.2f} ${per_req:>10.6f}")

Output:

Model                      Monthly cost   Per request
-----------------------------------------------------
GPT-4o                    $   8,750.00 $  0.008750
GPT-4o mini               $     525.00 $  0.000525
Claude Sonnet             $  11,250.00 $  0.011250
Claude Haiku              $   3,200.00 $  0.003200
Gemini 2.0 Flash          $     350.00 $  0.000350
Mistral Small             $     300.00 $  0.000300
Llama 70B (Together)      $   1,760.00 $  0.001760
Llama 8B (Together)       $     360.00 $  0.000360

The difference between Claude Sonnet at $11,250/month and Mistral Small at $300/month is 37x. If Mistral Small achieves 85% of the quality you need, that is $131,400/year in savings.

9. Provider Abstraction — Swapping Models with Minimal Code Changes

Build your application so you can switch providers in minutes, not days.

The adapter pattern

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class LLMResponse:
    content: str
    input_tokens: int
    output_tokens: int
    model: str
    latency_ms: float

class LLMProvider(ABC):
    @abstractmethod
    def complete(
        self,
        messages: list[dict],
        temperature: float = 0.0,
        max_tokens: int = 1024,
    ) -> LLMResponse:
        pass

class OpenAIProvider(LLMProvider):
    def __init__(self, model: str = "gpt-4o"):
        from openai import OpenAI
        self.client = OpenAI()
        self.model = model

    def complete(self, messages, temperature=0.0, max_tokens=1024):
        import time
        start = time.time()
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        latency = (time.time() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            input_tokens=response.usage.prompt_tokens,
            output_tokens=response.usage.completion_tokens,
            model=self.model,
            latency_ms=latency,
        )

class AnthropicProvider(LLMProvider):
    def __init__(self, model: str = "claude-sonnet-4-20250514"):
        from anthropic import Anthropic
        self.client = Anthropic()
        self.model = model

    def complete(self, messages, temperature=0.0, max_tokens=1024):
        import time
        start = time.time()

        system = ""
        user_messages = []
        for m in messages:
            if m["role"] == "system":
                system = m["content"]
            else:
                user_messages.append(m)

        response = self.client.messages.create(
            model=self.model,
            system=system,
            messages=user_messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        latency = (time.time() - start) * 1000
        return LLMResponse(
            content=response.content[0].text,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            model=self.model,
            latency_ms=latency,
        )

class OllamaProvider(LLMProvider):
    def __init__(self, model: str = "llama3.1:70b"):
        from openai import OpenAI
        self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
        self.model = model

    def complete(self, messages, temperature=0.0, max_tokens=1024):
        import time
        start = time.time()
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        latency = (time.time() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            input_tokens=response.usage.prompt_tokens if response.usage else 0,
            output_tokens=response.usage.completion_tokens if response.usage else 0,
            model=self.model,
            latency_ms=latency,
        )

# Usage — swap providers by changing one line
provider = OpenAIProvider("gpt-4o-mini")
# provider = AnthropicProvider("claude-sonnet-4-20250514")
# provider = OllamaProvider("llama3.1:70b")

result = provider.complete(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain DNS in one sentence."},
    ],
    temperature=0.0,
)

print(f"Model: {result.model}")
print(f"Response: {result.content}")
print(f"Tokens: {result.input_tokens} in, {result.output_tokens} out")
print(f"Latency: {result.latency_ms:.0f}ms")

This pattern lets you:

  • A/B test models in production
  • Fall back to a secondary provider if the primary is down
  • Switch models without touching business logic
  • Log and compare costs across providers

10. Model Selection Checklist

Use this checklist when selecting a model for a new feature or project:

## Model Selection Checklist

### Requirements
- [ ] What is the task? (classification, generation, extraction, reasoning, code)
- [ ] What is the acceptable latency? (< 1s, < 5s, < 30s, doesn't matter)
- [ ] What is the budget per request? (< $0.001, < $0.01, < $0.10)
- [ ] What is the expected monthly volume? (< 10K, < 100K, < 1M, > 1M)
- [ ] What context length is needed? (< 4K, < 32K, < 128K, > 128K)

### Constraints
- [ ] Does data need to stay on-premise? (eliminates API providers)
- [ ] Are there compliance requirements? (SOC2, HIPAA, GDPR)
- [ ] Is the team able to operate self-hosted infrastructure?
- [ ] Is there an existing provider relationship or commitment?

### Evaluation
- [ ] Created 20+ test cases from real production data
- [ ] Tested at least 3 candidate models
- [ ] Measured: quality score, latency, cost per request
- [ ] Tested edge cases: long inputs, adversarial inputs, multilingual
- [ ] Confirmed output format reliability (JSON parsing success rate)

### Production Readiness
- [ ] Provider abstraction layer implemented (can swap models)
- [ ] Fallback model configured for primary provider outages
- [ ] Cost monitoring and alerting in place
- [ ] Rate limit handling with exponential backoff
- [ ] Model version pinned (not using "latest" alias in production)

Key Takeaways

  • No single model wins everything. GPT-4o is the safe default with the best ecosystem. Claude excels at code and long-context tasks. Open-source models win on privacy and cost at scale. Choose based on your specific constraints.
  • Benchmarks are directional, not decisive. Use them to shortlist candidates, then run your own evaluation on your actual data. Benchmark gaming is real and widespread.
  • Cost differences are enormous. There is a 37x cost difference between premium and budget models. Many production tasks work fine with cheaper models. Always test the cheapest viable option first.
  • Multi-model architectures save money. Route simple queries to cheap models and complex queries to expensive ones. A classifier costs fractions of a cent and can save 50-60% on total LLM spend.
  • Self-hosting makes sense above ~500K requests/month if you have the team to operate it. Below that, API providers are almost always cheaper when you factor in engineering time.
  • Build a provider abstraction layer from day one. You will switch models. You will add fallbacks. You will A/B test. Make it easy on yourself.
  • Pin model versions in production. Models get updated. “gpt-4o” today might behave differently next month. Use specific version identifiers like “gpt-4o-2024-08-06” for reproducibility.
  • The open-source ecosystem is production-ready. vLLM, Ollama, and hosted providers like Together.ai make it practical to run Llama and Mistral models at scale with minimal operational overhead.
  • Your model choice is not permanent. The best model today will not be the best model in six months. Design your system so switching is a configuration change, not a rewrite.