Choosing a model is one of the first decisions you make when building an LLM application, and one of the most consequential. Get it wrong and you are locked into a provider that is too expensive, too slow, or not capable enough for your task. Get it right and you have a foundation you can optimize from.
This lesson gives you the data and framework to make that decision. No hype, no brand loyalty — just trade-offs.
1. The Major Model Families
The LLM landscape in 2026 has consolidated around five major providers. Each has a distinct philosophy and set of trade-offs.
OpenAI (GPT-4o, GPT-4o mini, o1/o3)
The market leader with the broadest ecosystem. GPT-4o is the “safe default” that most teams start with.
- GPT-4o: Flagship multimodal model. Strong at everything, best ecosystem (tools, plugins, fine-tuning support)
- GPT-4o mini: 90% of GPT-4o quality at 6% of the cost. The workhorse for production applications
- o1 / o3: Reasoning models that “think” before answering. Slower and more expensive but significantly better at math, logic, and complex multi-step problems
Anthropic (Claude Sonnet, Haiku, Opus)
Anthropic’s Claude family is known for instruction-following, long-context performance, and code quality.
- Claude Sonnet: Best balance of quality and speed. Excels at code, analysis, and following nuanced instructions
- Claude Haiku: Fast and cheap. Good for classification, extraction, and high-throughput tasks
- Claude Opus: Most capable model. Best for complex reasoning, research, and tasks requiring deep understanding
Meta (Llama 3.1 / 3.2 / 4)
Open-weight models you can run yourself. The go-to choice for data privacy, customization, and cost control at scale.
- Llama 3.1 405B: Competitive with GPT-4o on many benchmarks. Requires significant infrastructure to run
- Llama 3.1 70B: Sweet spot for self-hosted deployments. Strong quality-to-cost ratio
- Llama 3.2 3B/1B: Small models for edge devices, on-device inference, and extremely cost-sensitive applications
Mistral (Mistral Large, Small, Codestral)
European AI company with strong open-weight offerings and focus on efficiency.
- Mistral Large: Competitive with GPT-4o at lower cost
- Mistral Small: Excellent for structured tasks — classification, extraction, routing
- Codestral: Purpose-built for code generation and code understanding
Google (Gemini 1.5 / 2.0)
Massive context windows and strong multimodal capabilities. Deep integration with Google Cloud.
- Gemini 1.5 Pro: 2M token context window — can process entire codebases or books in a single call
- Gemini 2.0 Flash: Fast and cheap with good quality. Strong for high-throughput applications
- Gemini 2.0 Pro: Google’s top model for complex reasoning
2. Head-to-Head Comparison
Here is the data that actually matters for production decisions.
Pricing comparison (per 1M tokens, as of early 2026)
| Model | Input price | Output price | Context window |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | $0.15 | $0.60 | 128K |
| o3 | $10.00 | $40.00 | 200K |
| Claude Sonnet | $3.00 | $15.00 | 200K |
| Claude Haiku | $0.80 | $4.00 | 200K |
| Claude Opus | $15.00 | $75.00 | 200K |
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M |
| Mistral Large | $2.00 | $6.00 | 128K |
| Mistral Small | $0.10 | $0.30 | 128K |
| Llama 3.1 70B (via Together) | $0.88 | $0.88 | 128K |
| Llama 3.1 8B (via Together) | $0.18 | $0.18 | 128K |
Key takeaway: There is a 100x cost difference between the cheapest and most expensive options. A task running on Claude Opus at $75/M output tokens costs $0.75 per 10K output tokens. The same task on Mistral Small costs $0.003. If you can tolerate slightly lower quality, the savings are massive.
Latency comparison (typical, median)
| Model | Time to first token | Tokens per second | 500-token response |
|---|---|---|---|
| GPT-4o | ~300ms | ~80 tok/s | ~6.5s |
| GPT-4o mini | ~200ms | ~120 tok/s | ~4.4s |
| o3 | ~2-30s (thinking) | ~40 tok/s | ~15-45s |
| Claude Sonnet | ~400ms | ~70 tok/s | ~7.5s |
| Claude Haiku | ~200ms | ~150 tok/s | ~3.5s |
| Gemini 2.0 Flash | ~150ms | ~200 tok/s | ~2.7s |
| Llama 3.1 70B (self-hosted, A100) | ~200ms | ~50 tok/s | ~10.2s |
Reasoning models (o1, o3) are in a different category — they spend seconds to minutes “thinking” before generating output. Do not use them for latency-sensitive applications.
Strengths by task type
| Task | Best choice | Why |
|---|---|---|
| Code generation | Claude Sonnet, GPT-4o | Best at following complex coding instructions |
| Data extraction / classification | GPT-4o mini, Mistral Small, Haiku | Fast, cheap, structured output support |
| Long document analysis | Claude Sonnet (200K), Gemini 1.5 Pro (2M) | Large context with strong recall |
| Complex reasoning / math | o3, Claude Opus | Designed for multi-step thinking |
| Creative writing | Claude Sonnet, GPT-4o | Best natural language quality |
| High throughput (>10K req/min) | GPT-4o mini, Gemini Flash, self-hosted Llama | Cost and speed at scale |
| Data-sensitive workloads | Self-hosted Llama/Mistral | Data never leaves your infrastructure |
| Multilingual | GPT-4o, Gemini | Best non-English performance |
3. Benchmarks — And Why You Should Not Trust Them Blindly
Academic benchmarks give you a starting point, not a final answer.
Major benchmarks explained
| Benchmark | What it measures | Score range |
|---|---|---|
| MMLU | General knowledge across 57 subjects | 0-100% |
| HumanEval | Python code generation from docstrings | 0-100% pass@1 |
| MATH | Mathematical problem solving | 0-100% |
| GSM8K | Grade school math word problems | 0-100% |
| MT-Bench | Multi-turn conversation quality | 1-10 score |
| GPQA | Graduate-level science questions | 0-100% |
Representative benchmark scores
| Model | MMLU | HumanEval | MATH | GSM8K |
|---|---|---|---|---|
| GPT-4o | 88.7 | 90.2 | 76.6 | 95.8 |
| Claude Sonnet | 88.7 | 92.0 | 78.3 | 96.4 |
| Claude Opus | 91.0 | 93.1 | 82.1 | 97.2 |
| o3 | 89.5 | 92.7 | 96.7 | 98.9 |
| Gemini 1.5 Pro | 85.9 | 84.1 | 67.7 | 91.0 |
| Llama 3.1 405B | 88.6 | 89.0 | 73.8 | 96.8 |
| Llama 3.1 70B | 86.0 | 80.5 | 68.0 | 93.0 |
| Mistral Large | 84.0 | 81.0 | 65.2 | 91.2 |
Why benchmarks lie
Benchmarks are useful directionally but dangerous as decision criteria. Here is why:
-
Benchmark gaming. Model providers optimize for benchmark scores. Some models are explicitly fine-tuned on benchmark-style questions, inflating scores without corresponding real-world improvement.
-
Your task is not a benchmark. MMLU tests multiple-choice knowledge. HumanEval tests simple function generation. Your production task — extracting insurance claims from scanned PDFs, or generating personalized marketing copy — is nothing like these benchmarks.
-
Benchmarks are static, models change. A model that scored 85% on MMLU six months ago might have been updated. Benchmark scores are snapshots in time.
The right approach: Use benchmarks to shortlist 2-3 candidates, then run your own evaluation on your actual task with your actual data.
import json
import time
from openai import OpenAI
from anthropic import Anthropic
def evaluate_model(provider, model, test_cases, judge_fn):
"""Run a set of test cases against a model and score results."""
results = []
for case in test_cases:
start = time.time()
if provider == "openai":
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": case["system"]},
{"role": "user", "content": case["input"]},
],
temperature=0.0,
max_tokens=case.get("max_tokens", 1024),
)
output = response.choices[0].message.content
tokens_in = response.usage.prompt_tokens
tokens_out = response.usage.completion_tokens
elif provider == "anthropic":
client = Anthropic()
response = client.messages.create(
model=model,
system=case["system"],
messages=[{"role": "user", "content": case["input"]}],
temperature=0.0,
max_tokens=case.get("max_tokens", 1024),
)
output = response.content[0].text
tokens_in = response.usage.input_tokens
tokens_out = response.usage.output_tokens
latency = time.time() - start
score = judge_fn(case["expected"], output)
results.append({
"case_id": case["id"],
"score": score,
"latency": latency,
"tokens_in": tokens_in,
"tokens_out": tokens_out,
})
return {
"model": model,
"avg_score": sum(r["score"] for r in results) / len(results),
"avg_latency": sum(r["latency"] for r in results) / len(results),
"total_tokens": sum(r["tokens_in"] + r["tokens_out"] for r in results),
"details": results,
}4. The Decision Framework
Stop asking “which model is best?” and start asking “which model is best for my specific constraints?”
Step 1: Define your constraints
# Define your requirements as a checklist
requirements = {
"max_latency_ms": 3000, # User-facing? Need < 3s
"max_cost_per_request": 0.01, # Budget constraint
"min_quality_score": 0.85, # From your own eval (0-1)
"context_needed": 8000, # Typical input size in tokens
"data_privacy": "standard", # "standard", "strict", "air-gapped"
"output_format": "json", # "text", "json", "code"
"requests_per_month": 500000, # Scale requirement
"availability": 0.999, # Uptime requirement
}Step 2: Apply the decision tree
Here is the decision logic in plain terms:
1. Does your data need to stay on your infrastructure?
YES -> Self-hosted: Llama 3.1 70B or Mistral Large
NO -> Continue to step 2
2. Do you need >200K token context?
YES -> Gemini 1.5 Pro (2M context)
NO -> Continue to step 3
3. Is this a complex reasoning / math / logic task?
YES -> o3 (if budget allows) or Claude Opus
NO -> Continue to step 4
4. Is latency critical (< 2s response)?
YES -> GPT-4o mini, Gemini 2.0 Flash, or Claude Haiku
NO -> Continue to step 5
5. Is cost the primary constraint (< $0.005/request)?
YES -> GPT-4o mini, Mistral Small, or self-hosted Llama 8B
NO -> Continue to step 6
6. Is this a code-heavy or instruction-following task?
YES -> Claude Sonnet
NO -> GPT-4o (safe default)Step 3: Run your own eval
Never ship based on the decision tree alone. Always run a minimum viable evaluation:
# Minimum viable model evaluation
test_cases = [
{
"id": "extract_1",
"system": "Extract the customer name, issue, and sentiment from the support ticket. Return JSON.",
"input": "Hi, I'm John Smith. My order #4521 arrived damaged. The box was completely crushed. Very disappointed with the packaging quality.",
"expected": {"name": "John Smith", "issue": "damaged order", "sentiment": "negative"},
"max_tokens": 200,
},
# Add 20-50 representative test cases from your actual workload
]
def judge_extraction(expected: dict, output: str) -> float:
"""Score extraction quality. Returns 0.0-1.0."""
try:
parsed = json.loads(output)
score = 0.0
for key in expected:
if key in parsed and str(parsed[key]).lower() == str(expected[key]).lower():
score += 1.0 / len(expected)
return score
except json.JSONDecodeError:
return 0.0 # Failed to produce valid JSON
# Test multiple models
candidates = [
("openai", "gpt-4o"),
("openai", "gpt-4o-mini"),
("anthropic", "claude-sonnet-4-20250514"),
("anthropic", "claude-3-5-haiku-20241022"),
]
for provider, model in candidates:
result = evaluate_model(provider, model, test_cases, judge_extraction)
monthly_cost = (
result["total_tokens"] / len(test_cases) *
requirements["requests_per_month"] / 1_000_000 * 5.0 # Rough avg price
)
print(f"{model:>30}: score={result['avg_score']:.2f} "
f"latency={result['avg_latency']:.1f}s "
f"est_monthly=${monthly_cost:,.0f}")5. Open Source vs. API: The Real Trade-offs
This is not a philosophical choice. It is an engineering decision with concrete trade-offs.
When to use API providers
| Advantage | Detail |
|---|---|
| Zero infrastructure | No GPUs, no model loading, no memory management |
| Latest models | Access to frontier models immediately |
| Managed scaling | Provider handles traffic spikes |
| Lower upfront cost | Pay per token, no hardware investment |
| Built-in features | Function calling, structured output, vision, prompt caching |
When to self-host open source
| Advantage | Detail |
|---|---|
| Data privacy | Inputs and outputs never leave your network |
| No per-token cost | Fixed infrastructure cost regardless of volume |
| Customization | Fine-tune on your data, modify system behavior |
| No rate limits | You control throughput |
| Predictable latency | No shared infrastructure, no noisy neighbors |
| Compliance | Required for some regulated industries (healthcare, finance, government) |
The break-even calculation
At what point does self-hosting become cheaper than API calls?
def self_host_vs_api_breakeven(
monthly_requests: int,
avg_tokens_per_request: int, # Input + output combined
api_price_per_m_tokens: float,
gpu_monthly_cost: float, # e.g., A100 80GB at ~$2/hr = $1,460/mo
gpu_throughput_tokens_per_sec: int,
):
"""Compare API vs self-hosted costs."""
# API cost
total_tokens = monthly_requests * avg_tokens_per_request
api_cost = (total_tokens / 1_000_000) * api_price_per_m_tokens
# Self-hosted cost: how many GPUs do you need?
tokens_per_month = monthly_requests * avg_tokens_per_request
seconds_needed = tokens_per_month / gpu_throughput_tokens_per_sec
hours_needed = seconds_needed / 3600
gpus_needed = max(1, int(hours_needed / (30 * 24)) + 1) # Hours per month
self_hosted_cost = gpus_needed * gpu_monthly_cost
return {
"api_monthly_cost": round(api_cost, 2),
"self_hosted_monthly_cost": round(self_hosted_cost, 2),
"gpus_needed": gpus_needed,
"savings": round(api_cost - self_hosted_cost, 2),
"cheaper": "self-hosted" if self_hosted_cost < api_cost else "api",
}
# Scenario: 1M requests/month, 2K tokens each, using GPT-4o equivalent
result = self_host_vs_api_breakeven(
monthly_requests=1_000_000,
avg_tokens_per_request=2000,
api_price_per_m_tokens=6.25, # Blended GPT-4o rate
gpu_monthly_cost=1460, # A100 80GB on-demand
gpu_throughput_tokens_per_sec=500, # Llama 70B on A100, batched
)
print(json.dumps(result, indent=2))Output:
{
"api_monthly_cost": 12500.0,
"self_hosted_monthly_cost": 4380.0,
"gpus_needed": 3,
"savings": 8120.0,
"cheaper": "self-hosted"
}Rule of thumb: Below ~100K requests/month, APIs are almost always cheaper when you factor in engineering time for infrastructure. Above ~500K requests/month, self-hosting starts to make financial sense — if you have the team to operate it.
6. Running Open Source Models
If you decide to self-host, here are the practical options from simplest to most complex.
Option 1: Ollama (local development and prototyping)
# Install ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.1:70b
ollama run llama3.1:70b "Explain transformers in one paragraph"Ollama exposes an OpenAI-compatible API by default:
from openai import OpenAI
# Point the OpenAI client at Ollama's local server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required but unused
)
response = client.chat.completions.create(
model="llama3.1:70b",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
max_tokens=1024,
)
print(response.choices[0].message.content)Use for: Local development, testing, demos. Not production.
Option 2: vLLM (production self-hosted inference)
vLLM is the standard for high-throughput LLM inference. It implements PagedAttention for efficient memory management and continuous batching for maximum GPU utilization.
# Install vLLM
pip install vllm
# Start the server (requires GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--port 8000from openai import OpenAI
# vLLM also exposes an OpenAI-compatible API
client = OpenAI(
base_url="http://your-server:8000/v1",
api_key="unused",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain vLLM in 3 sentences."},
],
temperature=0.0,
max_tokens=512,
)Use for: Production deployments where you need control over infrastructure, data privacy, or cost optimization at scale.
Option 3: Hosted open source (Together, Fireworks, Groq)
Get the benefits of open-source models without running infrastructure:
from openai import OpenAI
# Together.ai — hosted open-source models
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key="your-together-api-key",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
max_tokens=1024,
)
# Groq — optimized for speed
client_groq = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="your-groq-api-key",
)
response = client_groq.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[{"role": "user", "content": "Hello"}],
)Use for: When you want open-source model quality and pricing without managing GPUs. Good middle ground.
7. Multi-Model Architectures
The most cost-effective production systems do not use a single model. They route requests to different models based on complexity, cost sensitivity, or task type.
The router pattern
from openai import OpenAI
from anthropic import Anthropic
openai_client = OpenAI()
anthropic_client = Anthropic()
def classify_complexity(query: str) -> str:
"""Use a cheap model to classify query complexity."""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Classify the complexity of this query as 'simple' or 'complex'. "
"Simple: factual questions, basic tasks, classification. "
"Complex: multi-step reasoning, code generation, analysis. "
"Return ONLY the word 'simple' or 'complex'."
)},
{"role": "user", "content": query},
],
temperature=0.0,
max_tokens=10,
)
return response.choices[0].message.content.strip().lower()
def route_and_respond(query: str, system_prompt: str) -> dict:
"""Route query to appropriate model based on complexity."""
complexity = classify_complexity(query)
if complexity == "simple":
# Cheap and fast model for simple queries
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query},
],
temperature=0.0,
max_tokens=1024,
)
return {
"model": "gpt-4o-mini",
"response": response.choices[0].message.content,
"cost_tier": "low",
}
else:
# Powerful model for complex queries
response = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
system=system_prompt,
messages=[{"role": "user", "content": query}],
temperature=0.0,
max_tokens=4096,
)
return {
"model": "claude-sonnet-4-20250514",
"response": response.content[0].text,
"cost_tier": "high",
}
# Usage
result = route_and_respond(
"What is the capital of France?",
"You are a helpful assistant. Be concise."
)
print(f"Routed to: {result['model']} (cost tier: {result['cost_tier']})")
print(f"Response: {result['response']}")Cost impact of routing
For a typical application where 70% of queries are simple and 30% are complex:
| Strategy | Cost per 1M requests | Savings |
|---|---|---|
| All GPT-4o | $12,500 | Baseline |
| All Claude Sonnet | $18,000 | -44% (more expensive) |
| Routed (mini + Sonnet) | $5,950 | 52% savings |
| Routed (mini + GPT-4o) | $4,550 | 64% savings |
The classification call itself costs ~$0.0001 per request — negligible compared to the savings from routing.
Task-specific model selection
MODEL_CONFIG = {
"classification": {
"provider": "openai",
"model": "gpt-4o-mini",
"temperature": 0.0,
"max_tokens": 50,
},
"extraction": {
"provider": "openai",
"model": "gpt-4o-mini",
"temperature": 0.0,
"max_tokens": 500,
},
"code_generation": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"temperature": 0.0,
"max_tokens": 4096,
},
"summarization": {
"provider": "openai",
"model": "gpt-4o",
"temperature": 0.3,
"max_tokens": 1024,
},
"reasoning": {
"provider": "openai",
"model": "o3",
"temperature": 1.0, # o3 requires temperature=1
"max_tokens": 8192,
},
}
def call_model(task_type: str, messages: list[dict]) -> str:
"""Call the appropriate model based on task type."""
config = MODEL_CONFIG[task_type]
if config["provider"] == "openai":
client = OpenAI()
response = client.chat.completions.create(
model=config["model"],
messages=messages,
temperature=config["temperature"],
max_tokens=config["max_tokens"],
)
return response.choices[0].message.content
elif config["provider"] == "anthropic":
client = Anthropic()
system = next((m["content"] for m in messages if m["role"] == "system"), "")
user_msgs = [m for m in messages if m["role"] != "system"]
response = client.messages.create(
model=config["model"],
system=system,
messages=user_msgs,
temperature=config["temperature"],
max_tokens=config["max_tokens"],
)
return response.content[0].text8. Real Cost Comparison: 1M Requests/Month
Let us make this concrete. Assume a typical application with:
- Average input: 1,500 tokens (system prompt + user query)
- Average output: 500 tokens
- 1,000,000 requests per month
def monthly_cost(
input_tokens: int,
output_tokens: int,
requests: int,
input_price: float, # per 1M tokens
output_price: float, # per 1M tokens
) -> float:
total_input = input_tokens * requests
total_output = output_tokens * requests
return (
(total_input / 1_000_000) * input_price +
(total_output / 1_000_000) * output_price
)
scenarios = [
("GPT-4o", 2.50, 10.00),
("GPT-4o mini", 0.15, 0.60),
("Claude Sonnet", 3.00, 15.00),
("Claude Haiku", 0.80, 4.00),
("Gemini 2.0 Flash", 0.10, 0.40),
("Mistral Small", 0.10, 0.30),
("Llama 70B (Together)", 0.88, 0.88),
("Llama 8B (Together)", 0.18, 0.18),
]
print(f"{'Model':<25} {'Monthly cost':>14} {'Per request':>12}")
print("-" * 53)
for name, inp, out in scenarios:
cost = monthly_cost(1500, 500, 1_000_000, inp, out)
per_req = cost / 1_000_000
print(f"{name:<25} ${cost:>12,.2f} ${per_req:>10.6f}")Output:
Model Monthly cost Per request
-----------------------------------------------------
GPT-4o $ 8,750.00 $ 0.008750
GPT-4o mini $ 525.00 $ 0.000525
Claude Sonnet $ 11,250.00 $ 0.011250
Claude Haiku $ 3,200.00 $ 0.003200
Gemini 2.0 Flash $ 350.00 $ 0.000350
Mistral Small $ 300.00 $ 0.000300
Llama 70B (Together) $ 1,760.00 $ 0.001760
Llama 8B (Together) $ 360.00 $ 0.000360The difference between Claude Sonnet at $11,250/month and Mistral Small at $300/month is 37x. If Mistral Small achieves 85% of the quality you need, that is $131,400/year in savings.
9. Provider Abstraction — Swapping Models with Minimal Code Changes
Build your application so you can switch providers in minutes, not days.
The adapter pattern
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class LLMResponse:
content: str
input_tokens: int
output_tokens: int
model: str
latency_ms: float
class LLMProvider(ABC):
@abstractmethod
def complete(
self,
messages: list[dict],
temperature: float = 0.0,
max_tokens: int = 1024,
) -> LLMResponse:
pass
class OpenAIProvider(LLMProvider):
def __init__(self, model: str = "gpt-4o"):
from openai import OpenAI
self.client = OpenAI()
self.model = model
def complete(self, messages, temperature=0.0, max_tokens=1024):
import time
start = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
latency = (time.time() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
model=self.model,
latency_ms=latency,
)
class AnthropicProvider(LLMProvider):
def __init__(self, model: str = "claude-sonnet-4-20250514"):
from anthropic import Anthropic
self.client = Anthropic()
self.model = model
def complete(self, messages, temperature=0.0, max_tokens=1024):
import time
start = time.time()
system = ""
user_messages = []
for m in messages:
if m["role"] == "system":
system = m["content"]
else:
user_messages.append(m)
response = self.client.messages.create(
model=self.model,
system=system,
messages=user_messages,
temperature=temperature,
max_tokens=max_tokens,
)
latency = (time.time() - start) * 1000
return LLMResponse(
content=response.content[0].text,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
model=self.model,
latency_ms=latency,
)
class OllamaProvider(LLMProvider):
def __init__(self, model: str = "llama3.1:70b"):
from openai import OpenAI
self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
self.model = model
def complete(self, messages, temperature=0.0, max_tokens=1024):
import time
start = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
latency = (time.time() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
input_tokens=response.usage.prompt_tokens if response.usage else 0,
output_tokens=response.usage.completion_tokens if response.usage else 0,
model=self.model,
latency_ms=latency,
)
# Usage — swap providers by changing one line
provider = OpenAIProvider("gpt-4o-mini")
# provider = AnthropicProvider("claude-sonnet-4-20250514")
# provider = OllamaProvider("llama3.1:70b")
result = provider.complete(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain DNS in one sentence."},
],
temperature=0.0,
)
print(f"Model: {result.model}")
print(f"Response: {result.content}")
print(f"Tokens: {result.input_tokens} in, {result.output_tokens} out")
print(f"Latency: {result.latency_ms:.0f}ms")This pattern lets you:
- A/B test models in production
- Fall back to a secondary provider if the primary is down
- Switch models without touching business logic
- Log and compare costs across providers
10. Model Selection Checklist
Use this checklist when selecting a model for a new feature or project:
## Model Selection Checklist
### Requirements
- [ ] What is the task? (classification, generation, extraction, reasoning, code)
- [ ] What is the acceptable latency? (< 1s, < 5s, < 30s, doesn't matter)
- [ ] What is the budget per request? (< $0.001, < $0.01, < $0.10)
- [ ] What is the expected monthly volume? (< 10K, < 100K, < 1M, > 1M)
- [ ] What context length is needed? (< 4K, < 32K, < 128K, > 128K)
### Constraints
- [ ] Does data need to stay on-premise? (eliminates API providers)
- [ ] Are there compliance requirements? (SOC2, HIPAA, GDPR)
- [ ] Is the team able to operate self-hosted infrastructure?
- [ ] Is there an existing provider relationship or commitment?
### Evaluation
- [ ] Created 20+ test cases from real production data
- [ ] Tested at least 3 candidate models
- [ ] Measured: quality score, latency, cost per request
- [ ] Tested edge cases: long inputs, adversarial inputs, multilingual
- [ ] Confirmed output format reliability (JSON parsing success rate)
### Production Readiness
- [ ] Provider abstraction layer implemented (can swap models)
- [ ] Fallback model configured for primary provider outages
- [ ] Cost monitoring and alerting in place
- [ ] Rate limit handling with exponential backoff
- [ ] Model version pinned (not using "latest" alias in production)Key Takeaways
- No single model wins everything. GPT-4o is the safe default with the best ecosystem. Claude excels at code and long-context tasks. Open-source models win on privacy and cost at scale. Choose based on your specific constraints.
- Benchmarks are directional, not decisive. Use them to shortlist candidates, then run your own evaluation on your actual data. Benchmark gaming is real and widespread.
- Cost differences are enormous. There is a 37x cost difference between premium and budget models. Many production tasks work fine with cheaper models. Always test the cheapest viable option first.
- Multi-model architectures save money. Route simple queries to cheap models and complex queries to expensive ones. A classifier costs fractions of a cent and can save 50-60% on total LLM spend.
- Self-hosting makes sense above ~500K requests/month if you have the team to operate it. Below that, API providers are almost always cheaper when you factor in engineering time.
- Build a provider abstraction layer from day one. You will switch models. You will add fallbacks. You will A/B test. Make it easy on yourself.
- Pin model versions in production. Models get updated. “gpt-4o” today might behave differently next month. Use specific version identifiers like “gpt-4o-2024-08-06” for reproducibility.
- The open-source ecosystem is production-ready. vLLM, Ollama, and hosted providers like Together.ai make it practical to run Llama and Mistral models at scale with minimal operational overhead.
- Your model choice is not permanent. The best model today will not be the best model in six months. Design your system so switching is a configuration change, not a rewrite.