In Lesson 3 you learned the fundamentals — system prompts, output formatting, role patterns. Those get you 80% of the way. This lesson covers the techniques that get you the remaining 20%, which is often the difference between “demo that works sometimes” and “production system your team trusts.” Chain of thought, few-shot learning, self-consistency, tree of thought, and the ReAct pattern are not academic novelties. They are engineering tools with measurable impact on output quality.
Zero-Shot vs Few-Shot Prompting
Zero-shot means you give the model a task with no examples. You rely entirely on the model’s training to figure out what you want.
Few-shot means you provide examples of the input-output mapping before asking the actual question. The model pattern-matches on your examples.
from openai import OpenAI
client = OpenAI()
# Zero-shot: no examples, just the task
zero_shot_prompt = """Classify the following customer message as one of:
billing, technical, account, general.
Message: "I can't log into my dashboard after resetting my password."
Category:"""
# Few-shot: provide examples first
few_shot_prompt = """Classify the following customer message as one of:
billing, technical, account, general.
Message: "My invoice shows the wrong amount for March."
Category: billing
Message: "The API returns 500 errors when I send batch requests."
Category: technical
Message: "I need to update the email address on my account."
Category: account
Message: "I can't log into my dashboard after resetting my password."
Category:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": few_shot_prompt}],
temperature=0,
max_tokens=10,
)
print(response.choices[0].message.content)
# Output: accountZero-shot works when the task is obvious and the model has seen similar patterns in training. Few-shot works when you need consistent formatting, edge-case handling, or domain-specific behavior.
How Many Examples Do You Need?
| Examples | When to Use | Tradeoff |
|---|---|---|
| 0 (zero-shot) | Simple, well-known tasks | Lowest cost, fastest |
| 1-2 | Format demonstration | Minimal token overhead |
| 3-5 | Classification, extraction | Good balance of quality vs cost |
| 5-10 | Complex or ambiguous tasks | Higher cost, diminishing returns |
| 10+ | Rarely worth it | Use fine-tuning instead |
The sweet spot for most production tasks is 3-5 examples. Beyond that, you are paying for tokens without getting proportional quality improvement. If you need 10+ examples to get the behavior right, fine-tuning is probably the better path.
Example Selection Strategies
Not all examples are equal. The examples you choose have a measurable impact on output quality.
def select_diverse_examples(query: str, example_pool: list, n: int = 5) -> list:
"""Select examples that cover different categories and edge cases.
Strategies:
1. Category coverage — at least one example per output class
2. Similarity-based — examples semantically close to the query
3. Difficulty-based — include at least one edge case
"""
from collections import defaultdict
# Group examples by their output category
by_category = defaultdict(list)
for ex in example_pool:
by_category[ex["category"]].append(ex)
selected = []
# First: one from each category for coverage
for category, examples in by_category.items():
if len(selected) >= n:
break
selected.append(examples[0])
# Then: fill remaining slots with examples most similar to the query
# (In production, you'd use embeddings for similarity)
remaining = [ex for ex in example_pool if ex not in selected]
for ex in remaining[:n - len(selected)]:
selected.append(ex)
return selectedKey insight: Static examples are fine for prototypes. In production, dynamically selecting examples based on the input query (using embedding similarity) consistently outperforms fixed example sets.
Chain of Thought (CoT)
Chain of Thought prompting tells the model to show its reasoning before giving the final answer. It sounds trivially simple — and it is. That’s why it’s powerful.
Why CoT Works
LLMs generate text token by token. Each token is conditioned on everything that came before. When you force the model to output intermediate reasoning steps, those steps become additional context that guides the final answer. The model literally thinks better when it writes its thoughts down.
Without CoT, the model has to jump directly from problem to answer in a single forward pass. With CoT, it gets multiple forward passes to refine its reasoning.
The Magic Phrase: “Let’s Think Step by Step”
The original CoT paper showed that simply appending “Let’s think step by step” to a prompt improved accuracy on math problems by 40-70%. Here is a comparison:
# Without CoT — model jumps to answer
prompt_direct = """A store has 45 apples. They sell 3/5 of them in the morning
and 1/3 of the remainder in the afternoon. How many apples are left?
Answer:"""
# With CoT — model shows its work
prompt_cot = """A store has 45 apples. They sell 3/5 of them in the morning
and 1/3 of the remainder in the afternoon. How many apples are left?
Let's think step by step."""But in production, you want something more structured than “let’s think step by step.” You want explicit reasoning frameworks.
Structured CoT for Production
import json
from openai import OpenAI
client = OpenAI()
def analyze_with_cot(question: str) -> dict:
"""Use structured CoT to get both reasoning and a final answer."""
system_prompt = """You are an analytical assistant. For every question:
1. IDENTIFY the key components of the problem
2. ANALYZE each component systematically
3. REASON through the relationships between components
4. CONCLUDE with a clear, specific answer
Always output valid JSON with this structure:
{
"reasoning_steps": [
{"step": 1, "description": "...", "analysis": "..."},
{"step": 2, "description": "...", "analysis": "..."}
],
"conclusion": "...",
"confidence": "high|medium|low",
"assumptions": ["..."]
}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
result = analyze_with_cot(
"Should we migrate our auth service from a monolith to a microservice? "
"We have 50K daily active users and a team of 4 engineers."
)
for step in result["reasoning_steps"]:
print(f"Step {step['step']}: {step['description']}")
print(f" -> {step['analysis']}")
print(f"\nConclusion: {result['conclusion']}")
print(f"Confidence: {result['confidence']}")The JSON structure is the key production pattern here. You get auditable reasoning (for logging and debugging), a clean final answer (for your application logic), and a confidence signal (for fallback decisions).
CoT with Anthropic’s Claude
Claude responds especially well to structured reasoning prompts. Here is the equivalent pattern using the Anthropic SDK:
import anthropic
import json
client = anthropic.Anthropic()
def claude_cot_analysis(question: str) -> dict:
"""Structured CoT analysis using Claude."""
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="""Analyze questions using structured reasoning. Output JSON:
{
"reasoning_steps": [{"step": N, "thought": "...", "evidence": "..."}],
"answer": "...",
"confidence": "high|medium|low"
}""",
messages=[
{"role": "user", "content": question}
],
)
return json.loads(message.content[0].text)Few-Shot Learning: Teaching by Demonstration
Few-shot learning is the most reliable way to control output format and handle edge cases. Instead of describing what you want, you show what you want.
The Anatomy of a Good Few-Shot Example
Each example should demonstrate exactly one pattern. Include:
- The input — representative of real queries
- The output — in the exact format you expect
- At least one edge case — show how to handle tricky inputs
def build_extraction_prompt(text: str) -> str:
"""Extract structured data from unstructured text using few-shot examples."""
examples = """
Extract company information from the text. Return JSON.
Text: "Acme Corp, founded in 2019 by Jane Smith, raised $5M in Series A.
They have 45 employees in San Francisco."
Result: {"company": "Acme Corp", "founded": 2019, "founder": "Jane Smith",
"funding": "$5M", "funding_round": "Series A", "employees": 45,
"location": "San Francisco"}
Text: "TechStartup is a pre-revenue company with 3 cofounders."
Result: {"company": "TechStartup", "founded": null, "founder": null,
"funding": null, "funding_round": null, "employees": null,
"location": null, "notes": "pre-revenue, 3 cofounders"}
Text: "BigCo (NYSE: BIG) reported $2.3B revenue in Q4 2025.
Headquartered in Austin with over 10,000 employees globally."
Result: {"company": "BigCo", "founded": null, "founder": null,
"funding": null, "funding_round": null, "employees": 10000,
"location": "Austin", "notes": "NYSE: BIG, $2.3B Q4 2025 revenue"}
"""
return f"""{examples}
Text: "{text}"
Result:"""Notice the second example — it handles missing data (nulls) and adds a notes field for information that does not fit the schema. That single edge-case example prevents a whole class of errors in production.
Dynamic Few-Shot Selection
Static examples work for simple tasks. For production systems handling diverse queries, select examples dynamically based on the input:
import numpy as np
from openai import OpenAI
client = OpenAI()
class DynamicFewShotSelector:
"""Select the most relevant few-shot examples using embedding similarity."""
def __init__(self, examples: list[dict]):
self.examples = examples
self.embeddings = self._embed_examples()
def _embed_examples(self) -> list[list[float]]:
"""Pre-compute embeddings for all examples."""
texts = [ex["input"] for ex in self.examples]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
def select(self, query: str, n: int = 3) -> list[dict]:
"""Find the n most similar examples to the query."""
query_response = client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
query_embedding = query_response.data[0].embedding
# Cosine similarity
similarities = []
for i, emb in enumerate(self.embeddings):
dot = np.dot(query_embedding, emb)
norm = np.linalg.norm(query_embedding) * np.linalg.norm(emb)
similarities.append((i, dot / norm))
similarities.sort(key=lambda x: x[1], reverse=True)
return [self.examples[i] for i, _ in similarities[:n]]
# Usage
example_pool = [
{"input": "Cancel my subscription", "output": "account", "reasoning": "subscription management"},
{"input": "API returns 403 forbidden", "output": "technical", "reasoning": "API error"},
{"input": "Wrong charge on my card", "output": "billing", "reasoning": "payment dispute"},
{"input": "How do I export my data?", "output": "technical", "reasoning": "feature question"},
{"input": "Update my company name", "output": "account", "reasoning": "account modification"},
{"input": "Refund for last month", "output": "billing", "reasoning": "refund request"},
]
selector = DynamicFewShotSelector(example_pool)
relevant_examples = selector.select("I was double-charged yesterday", n=3)
# Returns the billing-related examples, which are most relevantSelf-Consistency: Multiple Samples, Majority Vote
Self-consistency is the simplest way to improve accuracy on reasoning tasks. The idea: ask the model the same question multiple times with temperature > 0, then take the majority answer.
Why It Works
LLMs are probabilistic. On any single run, the model might take a wrong reasoning path. But if you run it 5 times and 4 out of 5 arrive at the same answer, that answer is very likely correct. The incorrect paths are random; the correct path is consistent.
Implementation
import json
from collections import Counter
from openai import OpenAI
client = OpenAI()
def self_consistent_answer(
question: str,
n_samples: int = 5,
model: str = "gpt-4o",
) -> dict:
"""Run multiple CoT samples and return the consensus answer."""
system = """Solve the problem step by step.
At the end, output your final answer on a new line in the format:
FINAL_ANSWER: <your answer>"""
# Generate multiple reasoning paths
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": question},
],
temperature=0.7, # Must be > 0 for diversity
n=n_samples, # Request multiple completions in one API call
)
# Extract final answers from each path
answers = []
reasoning_paths = []
for choice in response.choices:
text = choice.message.content
reasoning_paths.append(text)
# Extract the final answer
for line in text.strip().split("\n"):
if line.startswith("FINAL_ANSWER:"):
answer = line.replace("FINAL_ANSWER:", "").strip()
answers.append(answer)
break
# Majority vote
if not answers:
return {"answer": None, "confidence": 0, "error": "No answers extracted"}
counter = Counter(answers)
most_common_answer, count = counter.most_common(1)[0]
confidence = count / len(answers)
return {
"answer": most_common_answer,
"confidence": confidence,
"vote_distribution": dict(counter),
"n_samples": len(answers),
"reasoning_paths": reasoning_paths,
}
result = self_consistent_answer(
"A bat and a ball cost $1.10 in total. The bat costs $1.00 more than "
"the ball. How much does the ball cost?"
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Votes: {result['vote_distribution']}")
# Answer: $0.05
# Confidence: 100%
# Votes: {"$0.05": 5}Cost and Latency Tradeoffs
Self-consistency multiplies your costs linearly. Five samples means 5x the tokens.
| Samples | Cost Multiplier | Typical Accuracy Gain | When to Use |
|---|---|---|---|
| 1 | 1x | Baseline | Default for most tasks |
| 3 | 3x | +5-10% on reasoning | Math, logic, classification |
| 5 | 5x | +8-15% on reasoning | High-stakes decisions |
| 10+ | 10x+ | Diminishing returns | Rarely justified |
Production tip: Use the n parameter instead of making separate API calls. It is faster (single round trip) and sometimes cheaper (some providers batch internally).
Tree of Thought (ToT)
Tree of Thought extends CoT by exploring multiple reasoning branches explicitly. Instead of a single chain, you generate several possible next steps at each stage, evaluate them, and only continue the promising ones.
When ToT Helps
ToT is overkill for most production tasks. Use it when:
- The problem has multiple valid approaches
- Early decisions constrain later options (planning, strategy)
- You need to compare tradeoffs between approaches
- The answer space is large and one wrong step cascades
Implementation
from openai import OpenAI
client = OpenAI()
def tree_of_thought(problem: str, breadth: int = 3, depth: int = 3) -> dict:
"""Explore multiple reasoning branches and select the best path."""
def generate_thoughts(context: str, step: int) -> list[str]:
"""Generate multiple possible next steps."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
f"You are solving a problem step by step. "
f"This is step {step}. Generate {breadth} different "
f"possible next steps. Output each on a separate line "
f"prefixed with a number."
)},
{"role": "user", "content": context},
],
temperature=0.8,
)
text = response.choices[0].message.content
thoughts = [
line.strip()
for line in text.split("\n")
if line.strip() and line.strip()[0].isdigit()
]
return thoughts[:breadth]
def evaluate_thought(problem: str, path: list[str], thought: str) -> float:
"""Score a thought on how promising it is (0-1)."""
context = f"Problem: {problem}\nSteps so far: {path}\nNext step: {thought}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Rate how promising this reasoning step is for solving "
"the problem. Output only a number between 0.0 and 1.0."
)},
{"role": "user", "content": context},
],
temperature=0,
max_tokens=10,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.5
# Explore the tree
best_path = []
current_context = f"Problem: {problem}"
for step in range(1, depth + 1):
thoughts = generate_thoughts(current_context, step)
# Score each thought
scored = []
for thought in thoughts:
score = evaluate_thought(problem, best_path, thought)
scored.append((thought, score))
# Pick the best thought
scored.sort(key=lambda x: x[1], reverse=True)
best_thought = scored[0][0]
best_path.append(best_thought)
current_context += f"\nStep {step}: {best_thought}"
# Generate final answer from the best path
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Synthesize the reasoning into a final answer."},
{"role": "user", "content": (
f"Problem: {problem}\n\n"
f"Reasoning path:\n" +
"\n".join(f"Step {i+1}: {s}" for i, s in enumerate(best_path))
)},
],
temperature=0,
)
return {
"reasoning_path": best_path,
"answer": response.choices[0].message.content,
}Caution: ToT is expensive. A 3-wide, 3-deep tree makes roughly 3 x 3 + 3 x 3 + 1 = 19 API calls. Use it only when accuracy justifies the cost.
ReAct Pattern: Reasoning + Acting
ReAct interleaves reasoning (thinking about what to do) with acting (actually doing it — calling tools, querying databases, making API calls). This is the foundation of modern AI agents.
The ReAct Loop
Thought: I need to find the user's account status
Action: query_database(user_id=12345)
Observation: {"status": "active", "plan": "pro", "created": "2024-01-15"}
Thought: The user is active on Pro. Now I need to check their billing history
Action: get_billing_history(user_id=12345, last_n=3)
Observation: [{"date": "2025-03", "amount": 49.99}, ...]
Thought: I have all the info I need to answer the question
Answer: Your account is active on the Pro plan...Implementation
import json
from openai import OpenAI
client = OpenAI()
# Define available tools
TOOLS = {
"search_docs": {
"description": "Search the knowledge base for relevant documents",
"parameters": {"query": "string"},
},
"get_user": {
"description": "Get user details by ID",
"parameters": {"user_id": "integer"},
},
"calculate": {
"description": "Evaluate a mathematical expression",
"parameters": {"expression": "string"},
},
}
def execute_tool(name: str, args: dict) -> str:
"""Execute a tool and return the result. Replace with real implementations."""
if name == "search_docs":
return json.dumps({"results": [
{"title": "Refund Policy", "content": "Full refund within 30 days..."}
]})
elif name == "get_user":
return json.dumps({"name": "Alice", "plan": "pro", "status": "active"})
elif name == "calculate":
return str(eval(args["expression"])) # Use a safe evaluator in production
return json.dumps({"error": f"Unknown tool: {name}"})
def react_loop(question: str, max_steps: int = 5) -> str:
"""Run a ReAct loop: interleave reasoning and tool use."""
tool_descriptions = "\n".join(
f"- {name}: {info['description']} (params: {info['parameters']})"
for name, info in TOOLS.items()
)
system = f"""You solve problems by reasoning and using tools.
Available tools:
{tool_descriptions}
At each step, output exactly one of:
THOUGHT: <your reasoning>
ACTION: <tool_name>({{"param": "value"}})
ANSWER: <final answer to the user>
Rules:
- Always THINK before acting
- After an observation, THINK about what it means
- When you have enough information, output ANSWER"""
messages = [
{"role": "system", "content": system},
{"role": "user", "content": question},
]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0,
max_tokens=500,
)
output = response.choices[0].message.content.strip()
messages.append({"role": "assistant", "content": output})
# Check if we have a final answer
if output.startswith("ANSWER:"):
return output.replace("ANSWER:", "").strip()
# If there's an action, execute it
if "ACTION:" in output:
action_line = [l for l in output.split("\n") if l.startswith("ACTION:")][0]
action_str = action_line.replace("ACTION:", "").strip()
# Parse tool name and arguments
tool_name = action_str.split("(")[0].strip()
args_str = action_str.split("(", 1)[1].rsplit(")", 1)[0]
args = json.loads(args_str)
# Execute and feed back observation
result = execute_tool(tool_name, args)
observation = f"OBSERVATION: {result}"
messages.append({"role": "user", "content": observation})
return "I wasn't able to find a complete answer within the step limit."
answer = react_loop("What is the refund policy for user 42?")
print(answer)The ReAct pattern is covered in depth in Lesson 13 (Building AI Agents with Tool Use). The key point here is that it combines CoT reasoning with real-world actions — the model reasons about what information it needs, fetches it, and reasons again.
Structured Output with CoT
One of the most common production problems: you want the model to reason (CoT) but also return structured data (JSON). Here is how to get both.
The Two-Pass Pattern
import json
from openai import OpenAI
client = OpenAI()
def analyze_and_extract(text: str) -> dict:
"""Two-pass: first reason, then extract structured data."""
# Pass 1: Reason about the text
reasoning_response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Analyze this customer support message. Think about: "
"1) What is the customer's problem? "
"2) How urgent is it? "
"3) What department should handle it? "
"4) What's the sentiment?"
)},
{"role": "user", "content": text},
],
temperature=0,
)
reasoning = reasoning_response.choices[0].message.content
# Pass 2: Extract structured data informed by the reasoning
extraction_response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Based on the analysis, extract structured data. "
"Output valid JSON only."
)},
{"role": "user", "content": f"Original message: {text}\n\nAnalysis: {reasoning}"},
],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(extraction_response.choices[0].message.content)
result["_reasoning"] = reasoning # Attach reasoning for debugging
return resultThe Single-Pass Pattern (Preferred)
The two-pass pattern doubles your latency and cost. Often you can get the same result in one call by structuring the output format:
def analyze_and_extract_single_pass(text: str) -> dict:
"""Single pass: reason inside the JSON structure itself."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Analyze the customer message.
Output JSON with this exact structure:
{
"reasoning": {
"problem_identified": "what the customer's issue is",
"urgency_analysis": "why this urgency level",
"routing_logic": "why this department"
},
"result": {
"category": "billing|technical|account|general",
"urgency": "low|medium|high|critical",
"department": "string",
"sentiment": "positive|neutral|negative|angry",
"summary": "one sentence summary"
}
}"""},
{"role": "user", "content": text},
],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
result = analyze_and_extract_single_pass(
"This is the THIRD time I've asked about my refund. It's been 45 days. "
"Your policy says 30 days. I want to speak to a manager NOW."
)
print(json.dumps(result, indent=2))
# {
# "reasoning": {
# "problem_identified": "Customer has been waiting 45 days for a refund...",
# "urgency_analysis": "High urgency — repeat contact, policy violation...",
# "routing_logic": "Billing team with escalation to management..."
# },
# "result": {
# "category": "billing",
# "urgency": "critical",
# "department": "billing-escalations",
# "sentiment": "angry",
# "summary": "Repeat request for overdue refund, requesting manager."
# }
# }The reasoning lives inside the JSON. You get auditability, structured output, and a single API call. This is the pattern to use in production.
When Each Technique Helps (and When It Hurts)
| Technique | Best For | Accuracy Gain | Latency Impact | Cost Impact |
|---|---|---|---|---|
| Zero-shot | Simple, well-defined tasks | Baseline | None | None |
| Few-shot (3-5 examples) | Format consistency, classification | +10-20% | +5-15% (more tokens) | +5-15% |
| CoT | Math, logic, multi-step reasoning | +20-40% | +30-100% (more output) | +30-100% |
| Self-consistency (n=5) | High-stakes reasoning | +5-15% on top of CoT | 5x (parallel) | 5x |
| Tree of Thought | Complex planning, strategy | +5-20% on top of CoT | 10-20x | 10-20x |
| ReAct | Tasks needing external data | Enables new capabilities | Variable | Variable |
When Each Technique Hurts
CoT hurts when:
- The task is simple (adding reasoning to “translate this word” adds cost without benefit)
- Latency is critical (CoT generates 3-10x more tokens)
- The model is small (small models generate plausible-sounding but wrong reasoning)
Few-shot hurts when:
- Your examples are misleading or unrepresentative
- You are near the context window limit (examples eat tokens)
- The task changes frequently (maintaining example sets is overhead)
Self-consistency hurts when:
- The task is generative, not convergent (creative writing has no “correct” answer)
- Budget is tight (5x cost is real money at scale)
- Latency SLA is strict (even with parallel calls, you wait for the slowest one)
Production Patterns: Combining Techniques
In production, you rarely use one technique in isolation. Here is a pattern that combines several techniques with a fallback chain:
import json
from openai import OpenAI
from collections import Counter
client = OpenAI()
class ProductionPromptChain:
"""Combine multiple prompting techniques with fallbacks."""
def __init__(self, model: str = "gpt-4o"):
self.model = model
self.fast_model = "gpt-4o-mini" # For cheaper initial attempts
def classify_with_fallback(self, text: str, categories: list[str]) -> dict:
"""
Tiered approach:
1. Try fast model with few-shot (cheap, fast)
2. If confidence is low, try full model with CoT
3. If still uncertain, use self-consistency
"""
# Tier 1: Fast model + few-shot
tier1_result = self._few_shot_classify(text, categories, self.fast_model)
if tier1_result["confidence"] >= 0.9:
tier1_result["tier"] = 1
return tier1_result
# Tier 2: Full model + CoT
tier2_result = self._cot_classify(text, categories, self.model)
if tier2_result["confidence"] >= 0.8:
tier2_result["tier"] = 2
return tier2_result
# Tier 3: Self-consistency with CoT
tier3_result = self._self_consistent_classify(
text, categories, self.model, n=5
)
tier3_result["tier"] = 3
return tier3_result
def _few_shot_classify(self, text: str, categories: list, model: str) -> dict:
"""Simple few-shot classification."""
prompt = f"""Classify the text into one of: {', '.join(categories)}.
Text: "My payment failed but I was still charged"
Category: billing
Confidence: 0.95
Text: "The API keeps timing out"
Category: technical
Confidence: 0.92
Text: "{text}"
Category:"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=50,
)
output = response.choices[0].message.content.strip()
lines = output.split("\n")
category = lines[0].strip()
confidence = 0.5
for line in lines:
if "Confidence:" in line:
try:
confidence = float(line.split(":")[-1].strip())
except ValueError:
pass
return {"category": category, "confidence": confidence}
def _cot_classify(self, text: str, categories: list, model: str) -> dict:
"""Classification with chain-of-thought reasoning."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": f"""Classify text into: {', '.join(categories)}.
Think through your reasoning:
1. What is the main topic?
2. What keywords indicate the category?
3. Are there ambiguities?
Then output JSON: {{"reasoning": "...", "category": "...", "confidence": 0.0-1.0}}"""},
{"role": "user", "content": text},
],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
def _self_consistent_classify(
self, text: str, categories: list, model: str, n: int = 5
) -> dict:
"""Self-consistency: multiple CoT paths + majority vote."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": f"""Classify into: {', '.join(categories)}.
Think step by step. End with: CATEGORY: <category>"""},
{"role": "user", "content": text},
],
temperature=0.7,
n=n,
)
answers = []
for choice in response.choices:
text_out = choice.message.content
for line in text_out.split("\n"):
if line.startswith("CATEGORY:"):
answers.append(line.replace("CATEGORY:", "").strip().lower())
break
if not answers:
return {"category": "unknown", "confidence": 0.0}
counter = Counter(answers)
best, count = counter.most_common(1)[0]
return {
"category": best,
"confidence": count / len(answers),
"votes": dict(counter),
}
# Usage
chain = ProductionPromptChain()
result = chain.classify_with_fallback(
"I got charged twice and the API is showing an error when I try to check my invoices",
categories=["billing", "technical", "account", "general"],
)
print(f"Category: {result['category']} (tier {result['tier']}, confidence {result['confidence']})")This pattern saves cost on easy cases (tier 1 uses the cheap model) while maintaining accuracy on hard cases (tier 3 uses self-consistency). In production, 70-80% of queries resolve at tier 1.
Real Benchmarks
These are representative numbers from production systems. Your mileage will vary depending on the task, model, and prompt quality.
Classification Task (5 categories, 500 test samples, GPT-4o)
| Technique | Accuracy | Avg Latency | Cost per 1K queries |
|---|---|---|---|
| Zero-shot | 82% | 0.4s | $0.30 |
| Few-shot (3 examples) | 89% | 0.5s | $0.38 |
| CoT | 91% | 1.1s | $0.85 |
| Few-shot + CoT | 93% | 1.2s | $0.92 |
| Self-consistency (n=5) | 95% | 1.3s (parallel) | $4.25 |
| Tiered fallback chain | 94% | 0.6s avg | $0.65 |
The tiered fallback chain is the winner for production: near-best accuracy at a fraction of the cost, because most queries resolve cheaply at tier 1.
Math/Reasoning Task (GSM8K subset, 200 samples)
| Technique | Accuracy | Notes |
|---|---|---|
| Direct (GPT-4o) | 76% | No reasoning |
| CoT (GPT-4o) | 92% | “Let’s think step by step” |
| CoT (GPT-4o-mini) | 71% | Smaller model, CoT helps less |
| Self-consistency (GPT-4o, n=5) | 95% | Majority vote over CoT |
| Tree of Thought (GPT-4o) | 94% | Similar to self-consistency but 4x the cost |
Key takeaway: CoT gives the biggest bang for the buck on reasoning tasks. Self-consistency adds another 3% but at 5x cost. Tree of Thought is rarely worth the extra cost over self-consistency.
Prompting Technique Decision Flowchart
Use this to decide which technique to apply:
Is the task simple and well-defined?
├── YES → Zero-shot (maybe few-shot for format control)
└── NO → Does it require multi-step reasoning?
├── YES → Use CoT
│ └── Is accuracy critical and latency budget > 2s?
│ ├── YES → Add self-consistency (n=3-5)
│ └── NO → CoT alone is sufficient
└── NO → Does it require consistent output format?
├── YES → Few-shot (3-5 examples)
└── NO → Does it require external data?
├── YES → ReAct pattern
└── NO → Zero-shot with a good system promptFor production systems handling diverse queries, the tiered fallback pattern is almost always the right answer. Start cheap, escalate when needed.
Key Takeaways
-
Few-shot examples are the most reliable way to control output format. Use 3-5 examples. Select them dynamically based on the query for best results.
-
Chain of Thought improves accuracy on reasoning tasks by 15-40%. The cost is more output tokens. Worth it for math, logic, multi-step analysis. Not worth it for simple lookups.
-
Self-consistency is the simplest accuracy booster. Run 5 samples with temperature 0.7, take the majority vote. Use the
nparameter for a single API call. -
Tree of Thought is rarely worth the cost in production. Self-consistency gives similar accuracy gains at a fraction of the price. Reserve ToT for offline analysis of complex problems.
-
ReAct is the foundation of AI agents. Interleaving reasoning with tool use enables the model to gather information and reason about it iteratively. Lesson 13 covers this in depth.
-
The tiered fallback pattern saves 50-70% on costs compared to using the best technique for every query. Most queries are easy — let the cheap model handle them.
-
Structure your CoT output as JSON with reasoning fields alongside result fields. This gives you auditability (log the reasoning), structured data (parse the result), and single-pass efficiency.
-
Measure before optimizing. Run your baseline (zero-shot), measure accuracy, then add techniques one at a time. Each technique adds cost and complexity — make sure the accuracy gain justifies it.