arrow_backBACK TO LLM ENGINEERING IN PRODUCTION
Lesson 04LLM Engineering in Production18 min read

Advanced Prompting — Chain of Thought, Few-Shot

April 01, 2026

TL;DR

Chain of thought (CoT) makes models show their work — and the work is better because of it. Few-shot examples teach by demonstration. Self-consistency runs multiple reasoning paths and picks the consensus. These aren't tricks — they're how you get reliable, auditable output from LLMs in production. Use CoT for math/logic, few-shot for format consistency, and self-consistency when accuracy matters more than latency.

In Lesson 3 you learned the fundamentals — system prompts, output formatting, role patterns. Those get you 80% of the way. This lesson covers the techniques that get you the remaining 20%, which is often the difference between “demo that works sometimes” and “production system your team trusts.” Chain of thought, few-shot learning, self-consistency, tree of thought, and the ReAct pattern are not academic novelties. They are engineering tools with measurable impact on output quality.

Zero-Shot vs Few-Shot Prompting

Zero-shot means you give the model a task with no examples. You rely entirely on the model’s training to figure out what you want.

Few-shot means you provide examples of the input-output mapping before asking the actual question. The model pattern-matches on your examples.

from openai import OpenAI

client = OpenAI()

# Zero-shot: no examples, just the task
zero_shot_prompt = """Classify the following customer message as one of: 
billing, technical, account, general.

Message: "I can't log into my dashboard after resetting my password."

Category:"""

# Few-shot: provide examples first
few_shot_prompt = """Classify the following customer message as one of: 
billing, technical, account, general.

Message: "My invoice shows the wrong amount for March."
Category: billing

Message: "The API returns 500 errors when I send batch requests."
Category: technical

Message: "I need to update the email address on my account."
Category: account

Message: "I can't log into my dashboard after resetting my password."
Category:"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": few_shot_prompt}],
    temperature=0,
    max_tokens=10,
)
print(response.choices[0].message.content)
# Output: account

Zero-shot works when the task is obvious and the model has seen similar patterns in training. Few-shot works when you need consistent formatting, edge-case handling, or domain-specific behavior.

How Many Examples Do You Need?

Examples When to Use Tradeoff
0 (zero-shot) Simple, well-known tasks Lowest cost, fastest
1-2 Format demonstration Minimal token overhead
3-5 Classification, extraction Good balance of quality vs cost
5-10 Complex or ambiguous tasks Higher cost, diminishing returns
10+ Rarely worth it Use fine-tuning instead

The sweet spot for most production tasks is 3-5 examples. Beyond that, you are paying for tokens without getting proportional quality improvement. If you need 10+ examples to get the behavior right, fine-tuning is probably the better path.

Example Selection Strategies

Not all examples are equal. The examples you choose have a measurable impact on output quality.

def select_diverse_examples(query: str, example_pool: list, n: int = 5) -> list:
    """Select examples that cover different categories and edge cases.
    
    Strategies:
    1. Category coverage — at least one example per output class
    2. Similarity-based — examples semantically close to the query
    3. Difficulty-based — include at least one edge case
    """
    from collections import defaultdict
    
    # Group examples by their output category
    by_category = defaultdict(list)
    for ex in example_pool:
        by_category[ex["category"]].append(ex)
    
    selected = []
    
    # First: one from each category for coverage
    for category, examples in by_category.items():
        if len(selected) >= n:
            break
        selected.append(examples[0])
    
    # Then: fill remaining slots with examples most similar to the query
    # (In production, you'd use embeddings for similarity)
    remaining = [ex for ex in example_pool if ex not in selected]
    for ex in remaining[:n - len(selected)]:
        selected.append(ex)
    
    return selected

Key insight: Static examples are fine for prototypes. In production, dynamically selecting examples based on the input query (using embedding similarity) consistently outperforms fixed example sets.

Chain of Thought (CoT)

Chain of Thought prompting tells the model to show its reasoning before giving the final answer. It sounds trivially simple — and it is. That’s why it’s powerful.

Why CoT Works

LLMs generate text token by token. Each token is conditioned on everything that came before. When you force the model to output intermediate reasoning steps, those steps become additional context that guides the final answer. The model literally thinks better when it writes its thoughts down.

Without CoT, the model has to jump directly from problem to answer in a single forward pass. With CoT, it gets multiple forward passes to refine its reasoning.

The Magic Phrase: “Let’s Think Step by Step”

The original CoT paper showed that simply appending “Let’s think step by step” to a prompt improved accuracy on math problems by 40-70%. Here is a comparison:

# Without CoT — model jumps to answer
prompt_direct = """A store has 45 apples. They sell 3/5 of them in the morning 
and 1/3 of the remainder in the afternoon. How many apples are left?

Answer:"""

# With CoT — model shows its work
prompt_cot = """A store has 45 apples. They sell 3/5 of them in the morning 
and 1/3 of the remainder in the afternoon. How many apples are left?

Let's think step by step."""

But in production, you want something more structured than “let’s think step by step.” You want explicit reasoning frameworks.

Structured CoT for Production

import json
from openai import OpenAI

client = OpenAI()

def analyze_with_cot(question: str) -> dict:
    """Use structured CoT to get both reasoning and a final answer."""
    
    system_prompt = """You are an analytical assistant. For every question:

1. IDENTIFY the key components of the problem
2. ANALYZE each component systematically  
3. REASON through the relationships between components
4. CONCLUDE with a clear, specific answer

Always output valid JSON with this structure:
{
    "reasoning_steps": [
        {"step": 1, "description": "...", "analysis": "..."},
        {"step": 2, "description": "...", "analysis": "..."}
    ],
    "conclusion": "...",
    "confidence": "high|medium|low",
    "assumptions": ["..."]
}"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    
    return json.loads(response.choices[0].message.content)


result = analyze_with_cot(
    "Should we migrate our auth service from a monolith to a microservice? "
    "We have 50K daily active users and a team of 4 engineers."
)

for step in result["reasoning_steps"]:
    print(f"Step {step['step']}: {step['description']}")
    print(f"  -> {step['analysis']}")
print(f"\nConclusion: {result['conclusion']}")
print(f"Confidence: {result['confidence']}")

The JSON structure is the key production pattern here. You get auditable reasoning (for logging and debugging), a clean final answer (for your application logic), and a confidence signal (for fallback decisions).

CoT with Anthropic’s Claude

Claude responds especially well to structured reasoning prompts. Here is the equivalent pattern using the Anthropic SDK:

import anthropic
import json

client = anthropic.Anthropic()

def claude_cot_analysis(question: str) -> dict:
    """Structured CoT analysis using Claude."""
    
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""Analyze questions using structured reasoning. Output JSON:
{
    "reasoning_steps": [{"step": N, "thought": "...", "evidence": "..."}],
    "answer": "...",
    "confidence": "high|medium|low"
}""",
        messages=[
            {"role": "user", "content": question}
        ],
    )
    
    return json.loads(message.content[0].text)

Few-Shot Learning: Teaching by Demonstration

Few-shot learning is the most reliable way to control output format and handle edge cases. Instead of describing what you want, you show what you want.

The Anatomy of a Good Few-Shot Example

Each example should demonstrate exactly one pattern. Include:

  1. The input — representative of real queries
  2. The output — in the exact format you expect
  3. At least one edge case — show how to handle tricky inputs
def build_extraction_prompt(text: str) -> str:
    """Extract structured data from unstructured text using few-shot examples."""
    
    examples = """
Extract company information from the text. Return JSON.

Text: "Acme Corp, founded in 2019 by Jane Smith, raised $5M in Series A. 
They have 45 employees in San Francisco."
Result: {"company": "Acme Corp", "founded": 2019, "founder": "Jane Smith", 
"funding": "$5M", "funding_round": "Series A", "employees": 45, 
"location": "San Francisco"}

Text: "TechStartup is a pre-revenue company with 3 cofounders."
Result: {"company": "TechStartup", "founded": null, "founder": null, 
"funding": null, "funding_round": null, "employees": null, 
"location": null, "notes": "pre-revenue, 3 cofounders"}

Text: "BigCo (NYSE: BIG) reported $2.3B revenue in Q4 2025. 
Headquartered in Austin with over 10,000 employees globally."
Result: {"company": "BigCo", "founded": null, "founder": null, 
"funding": null, "funding_round": null, "employees": 10000, 
"location": "Austin", "notes": "NYSE: BIG, $2.3B Q4 2025 revenue"}
"""
    
    return f"""{examples}
Text: "{text}"
Result:"""

Notice the second example — it handles missing data (nulls) and adds a notes field for information that does not fit the schema. That single edge-case example prevents a whole class of errors in production.

Dynamic Few-Shot Selection

Static examples work for simple tasks. For production systems handling diverse queries, select examples dynamically based on the input:

import numpy as np
from openai import OpenAI

client = OpenAI()

class DynamicFewShotSelector:
    """Select the most relevant few-shot examples using embedding similarity."""
    
    def __init__(self, examples: list[dict]):
        self.examples = examples
        self.embeddings = self._embed_examples()
    
    def _embed_examples(self) -> list[list[float]]:
        """Pre-compute embeddings for all examples."""
        texts = [ex["input"] for ex in self.examples]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts,
        )
        return [item.embedding for item in response.data]
    
    def select(self, query: str, n: int = 3) -> list[dict]:
        """Find the n most similar examples to the query."""
        query_response = client.embeddings.create(
            model="text-embedding-3-small",
            input=query,
        )
        query_embedding = query_response.data[0].embedding
        
        # Cosine similarity
        similarities = []
        for i, emb in enumerate(self.embeddings):
            dot = np.dot(query_embedding, emb)
            norm = np.linalg.norm(query_embedding) * np.linalg.norm(emb)
            similarities.append((i, dot / norm))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [self.examples[i] for i, _ in similarities[:n]]


# Usage
example_pool = [
    {"input": "Cancel my subscription", "output": "account", "reasoning": "subscription management"},
    {"input": "API returns 403 forbidden", "output": "technical", "reasoning": "API error"},
    {"input": "Wrong charge on my card", "output": "billing", "reasoning": "payment dispute"},
    {"input": "How do I export my data?", "output": "technical", "reasoning": "feature question"},
    {"input": "Update my company name", "output": "account", "reasoning": "account modification"},
    {"input": "Refund for last month", "output": "billing", "reasoning": "refund request"},
]

selector = DynamicFewShotSelector(example_pool)
relevant_examples = selector.select("I was double-charged yesterday", n=3)
# Returns the billing-related examples, which are most relevant

Self-Consistency: Multiple Samples, Majority Vote

Self-consistency is the simplest way to improve accuracy on reasoning tasks. The idea: ask the model the same question multiple times with temperature > 0, then take the majority answer.

Why It Works

LLMs are probabilistic. On any single run, the model might take a wrong reasoning path. But if you run it 5 times and 4 out of 5 arrive at the same answer, that answer is very likely correct. The incorrect paths are random; the correct path is consistent.

Implementation

import json
from collections import Counter
from openai import OpenAI

client = OpenAI()

def self_consistent_answer(
    question: str,
    n_samples: int = 5,
    model: str = "gpt-4o",
) -> dict:
    """Run multiple CoT samples and return the consensus answer."""
    
    system = """Solve the problem step by step. 
At the end, output your final answer on a new line in the format:
FINAL_ANSWER: <your answer>"""
    
    # Generate multiple reasoning paths
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": question},
        ],
        temperature=0.7,  # Must be > 0 for diversity
        n=n_samples,       # Request multiple completions in one API call
    )
    
    # Extract final answers from each path
    answers = []
    reasoning_paths = []
    
    for choice in response.choices:
        text = choice.message.content
        reasoning_paths.append(text)
        
        # Extract the final answer
        for line in text.strip().split("\n"):
            if line.startswith("FINAL_ANSWER:"):
                answer = line.replace("FINAL_ANSWER:", "").strip()
                answers.append(answer)
                break
    
    # Majority vote
    if not answers:
        return {"answer": None, "confidence": 0, "error": "No answers extracted"}
    
    counter = Counter(answers)
    most_common_answer, count = counter.most_common(1)[0]
    confidence = count / len(answers)
    
    return {
        "answer": most_common_answer,
        "confidence": confidence,
        "vote_distribution": dict(counter),
        "n_samples": len(answers),
        "reasoning_paths": reasoning_paths,
    }


result = self_consistent_answer(
    "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than "
    "the ball. How much does the ball cost?"
)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Votes: {result['vote_distribution']}")
# Answer: $0.05
# Confidence: 100%
# Votes: {"$0.05": 5}

Cost and Latency Tradeoffs

Self-consistency multiplies your costs linearly. Five samples means 5x the tokens.

Samples Cost Multiplier Typical Accuracy Gain When to Use
1 1x Baseline Default for most tasks
3 3x +5-10% on reasoning Math, logic, classification
5 5x +8-15% on reasoning High-stakes decisions
10+ 10x+ Diminishing returns Rarely justified

Production tip: Use the n parameter instead of making separate API calls. It is faster (single round trip) and sometimes cheaper (some providers batch internally).

Tree of Thought (ToT)

Tree of Thought extends CoT by exploring multiple reasoning branches explicitly. Instead of a single chain, you generate several possible next steps at each stage, evaluate them, and only continue the promising ones.

When ToT Helps

ToT is overkill for most production tasks. Use it when:

  • The problem has multiple valid approaches
  • Early decisions constrain later options (planning, strategy)
  • You need to compare tradeoffs between approaches
  • The answer space is large and one wrong step cascades

Implementation

from openai import OpenAI

client = OpenAI()

def tree_of_thought(problem: str, breadth: int = 3, depth: int = 3) -> dict:
    """Explore multiple reasoning branches and select the best path."""
    
    def generate_thoughts(context: str, step: int) -> list[str]:
        """Generate multiple possible next steps."""
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    f"You are solving a problem step by step. "
                    f"This is step {step}. Generate {breadth} different "
                    f"possible next steps. Output each on a separate line "
                    f"prefixed with a number."
                )},
                {"role": "user", "content": context},
            ],
            temperature=0.8,
        )
        text = response.choices[0].message.content
        thoughts = [
            line.strip() 
            for line in text.split("\n") 
            if line.strip() and line.strip()[0].isdigit()
        ]
        return thoughts[:breadth]
    
    def evaluate_thought(problem: str, path: list[str], thought: str) -> float:
        """Score a thought on how promising it is (0-1)."""
        context = f"Problem: {problem}\nSteps so far: {path}\nNext step: {thought}"
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "Rate how promising this reasoning step is for solving "
                    "the problem. Output only a number between 0.0 and 1.0."
                )},
                {"role": "user", "content": context},
            ],
            temperature=0,
            max_tokens=10,
        )
        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.5
    
    # Explore the tree
    best_path = []
    current_context = f"Problem: {problem}"
    
    for step in range(1, depth + 1):
        thoughts = generate_thoughts(current_context, step)
        
        # Score each thought
        scored = []
        for thought in thoughts:
            score = evaluate_thought(problem, best_path, thought)
            scored.append((thought, score))
        
        # Pick the best thought
        scored.sort(key=lambda x: x[1], reverse=True)
        best_thought = scored[0][0]
        best_path.append(best_thought)
        
        current_context += f"\nStep {step}: {best_thought}"
    
    # Generate final answer from the best path
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Synthesize the reasoning into a final answer."},
            {"role": "user", "content": (
                f"Problem: {problem}\n\n"
                f"Reasoning path:\n" + 
                "\n".join(f"Step {i+1}: {s}" for i, s in enumerate(best_path))
            )},
        ],
        temperature=0,
    )
    
    return {
        "reasoning_path": best_path,
        "answer": response.choices[0].message.content,
    }

Caution: ToT is expensive. A 3-wide, 3-deep tree makes roughly 3 x 3 + 3 x 3 + 1 = 19 API calls. Use it only when accuracy justifies the cost.

ReAct Pattern: Reasoning + Acting

ReAct interleaves reasoning (thinking about what to do) with acting (actually doing it — calling tools, querying databases, making API calls). This is the foundation of modern AI agents.

The ReAct Loop

Thought: I need to find the user's account status
Action: query_database(user_id=12345)
Observation: {"status": "active", "plan": "pro", "created": "2024-01-15"}
Thought: The user is active on Pro. Now I need to check their billing history
Action: get_billing_history(user_id=12345, last_n=3)
Observation: [{"date": "2025-03", "amount": 49.99}, ...]
Thought: I have all the info I need to answer the question
Answer: Your account is active on the Pro plan...

Implementation

import json
from openai import OpenAI

client = OpenAI()

# Define available tools
TOOLS = {
    "search_docs": {
        "description": "Search the knowledge base for relevant documents",
        "parameters": {"query": "string"},
    },
    "get_user": {
        "description": "Get user details by ID",
        "parameters": {"user_id": "integer"},
    },
    "calculate": {
        "description": "Evaluate a mathematical expression",
        "parameters": {"expression": "string"},
    },
}

def execute_tool(name: str, args: dict) -> str:
    """Execute a tool and return the result. Replace with real implementations."""
    if name == "search_docs":
        return json.dumps({"results": [
            {"title": "Refund Policy", "content": "Full refund within 30 days..."}
        ]})
    elif name == "get_user":
        return json.dumps({"name": "Alice", "plan": "pro", "status": "active"})
    elif name == "calculate":
        return str(eval(args["expression"]))  # Use a safe evaluator in production
    return json.dumps({"error": f"Unknown tool: {name}"})

def react_loop(question: str, max_steps: int = 5) -> str:
    """Run a ReAct loop: interleave reasoning and tool use."""
    
    tool_descriptions = "\n".join(
        f"- {name}: {info['description']} (params: {info['parameters']})"
        for name, info in TOOLS.items()
    )
    
    system = f"""You solve problems by reasoning and using tools.

Available tools:
{tool_descriptions}

At each step, output exactly one of:
THOUGHT: <your reasoning>
ACTION: <tool_name>({{"param": "value"}})
ANSWER: <final answer to the user>

Rules:
- Always THINK before acting
- After an observation, THINK about what it means
- When you have enough information, output ANSWER"""
    
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": question},
    ]
    
    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0,
            max_tokens=500,
        )
        
        output = response.choices[0].message.content.strip()
        messages.append({"role": "assistant", "content": output})
        
        # Check if we have a final answer
        if output.startswith("ANSWER:"):
            return output.replace("ANSWER:", "").strip()
        
        # If there's an action, execute it
        if "ACTION:" in output:
            action_line = [l for l in output.split("\n") if l.startswith("ACTION:")][0]
            action_str = action_line.replace("ACTION:", "").strip()
            
            # Parse tool name and arguments
            tool_name = action_str.split("(")[0].strip()
            args_str = action_str.split("(", 1)[1].rsplit(")", 1)[0]
            args = json.loads(args_str)
            
            # Execute and feed back observation
            result = execute_tool(tool_name, args)
            observation = f"OBSERVATION: {result}"
            messages.append({"role": "user", "content": observation})
    
    return "I wasn't able to find a complete answer within the step limit."


answer = react_loop("What is the refund policy for user 42?")
print(answer)

The ReAct pattern is covered in depth in Lesson 13 (Building AI Agents with Tool Use). The key point here is that it combines CoT reasoning with real-world actions — the model reasons about what information it needs, fetches it, and reasons again.

Structured Output with CoT

One of the most common production problems: you want the model to reason (CoT) but also return structured data (JSON). Here is how to get both.

The Two-Pass Pattern

import json
from openai import OpenAI

client = OpenAI()

def analyze_and_extract(text: str) -> dict:
    """Two-pass: first reason, then extract structured data."""
    
    # Pass 1: Reason about the text
    reasoning_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Analyze this customer support message. Think about: "
                "1) What is the customer's problem? "
                "2) How urgent is it? "
                "3) What department should handle it? "
                "4) What's the sentiment?"
            )},
            {"role": "user", "content": text},
        ],
        temperature=0,
    )
    reasoning = reasoning_response.choices[0].message.content
    
    # Pass 2: Extract structured data informed by the reasoning
    extraction_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Based on the analysis, extract structured data. "
                "Output valid JSON only."
            )},
            {"role": "user", "content": f"Original message: {text}\n\nAnalysis: {reasoning}"},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    
    result = json.loads(extraction_response.choices[0].message.content)
    result["_reasoning"] = reasoning  # Attach reasoning for debugging
    return result

The Single-Pass Pattern (Preferred)

The two-pass pattern doubles your latency and cost. Often you can get the same result in one call by structuring the output format:

def analyze_and_extract_single_pass(text: str) -> dict:
    """Single pass: reason inside the JSON structure itself."""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Analyze the customer message. 
Output JSON with this exact structure:
{
    "reasoning": {
        "problem_identified": "what the customer's issue is",
        "urgency_analysis": "why this urgency level",
        "routing_logic": "why this department"
    },
    "result": {
        "category": "billing|technical|account|general",
        "urgency": "low|medium|high|critical",
        "department": "string",
        "sentiment": "positive|neutral|negative|angry",
        "summary": "one sentence summary"
    }
}"""},
            {"role": "user", "content": text},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    
    return json.loads(response.choices[0].message.content)


result = analyze_and_extract_single_pass(
    "This is the THIRD time I've asked about my refund. It's been 45 days. "
    "Your policy says 30 days. I want to speak to a manager NOW."
)

print(json.dumps(result, indent=2))
# {
#   "reasoning": {
#     "problem_identified": "Customer has been waiting 45 days for a refund...",
#     "urgency_analysis": "High urgency — repeat contact, policy violation...",
#     "routing_logic": "Billing team with escalation to management..."
#   },
#   "result": {
#     "category": "billing",
#     "urgency": "critical",
#     "department": "billing-escalations",
#     "sentiment": "angry",
#     "summary": "Repeat request for overdue refund, requesting manager."
#   }
# }

The reasoning lives inside the JSON. You get auditability, structured output, and a single API call. This is the pattern to use in production.

When Each Technique Helps (and When It Hurts)

Technique Best For Accuracy Gain Latency Impact Cost Impact
Zero-shot Simple, well-defined tasks Baseline None None
Few-shot (3-5 examples) Format consistency, classification +10-20% +5-15% (more tokens) +5-15%
CoT Math, logic, multi-step reasoning +20-40% +30-100% (more output) +30-100%
Self-consistency (n=5) High-stakes reasoning +5-15% on top of CoT 5x (parallel) 5x
Tree of Thought Complex planning, strategy +5-20% on top of CoT 10-20x 10-20x
ReAct Tasks needing external data Enables new capabilities Variable Variable

When Each Technique Hurts

CoT hurts when:

  • The task is simple (adding reasoning to “translate this word” adds cost without benefit)
  • Latency is critical (CoT generates 3-10x more tokens)
  • The model is small (small models generate plausible-sounding but wrong reasoning)

Few-shot hurts when:

  • Your examples are misleading or unrepresentative
  • You are near the context window limit (examples eat tokens)
  • The task changes frequently (maintaining example sets is overhead)

Self-consistency hurts when:

  • The task is generative, not convergent (creative writing has no “correct” answer)
  • Budget is tight (5x cost is real money at scale)
  • Latency SLA is strict (even with parallel calls, you wait for the slowest one)

Production Patterns: Combining Techniques

In production, you rarely use one technique in isolation. Here is a pattern that combines several techniques with a fallback chain:

import json
from openai import OpenAI
from collections import Counter

client = OpenAI()

class ProductionPromptChain:
    """Combine multiple prompting techniques with fallbacks."""
    
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.fast_model = "gpt-4o-mini"  # For cheaper initial attempts
    
    def classify_with_fallback(self, text: str, categories: list[str]) -> dict:
        """
        Tiered approach:
        1. Try fast model with few-shot (cheap, fast)
        2. If confidence is low, try full model with CoT
        3. If still uncertain, use self-consistency
        """
        
        # Tier 1: Fast model + few-shot
        tier1_result = self._few_shot_classify(text, categories, self.fast_model)
        
        if tier1_result["confidence"] >= 0.9:
            tier1_result["tier"] = 1
            return tier1_result
        
        # Tier 2: Full model + CoT
        tier2_result = self._cot_classify(text, categories, self.model)
        
        if tier2_result["confidence"] >= 0.8:
            tier2_result["tier"] = 2
            return tier2_result
        
        # Tier 3: Self-consistency with CoT
        tier3_result = self._self_consistent_classify(
            text, categories, self.model, n=5
        )
        tier3_result["tier"] = 3
        return tier3_result
    
    def _few_shot_classify(self, text: str, categories: list, model: str) -> dict:
        """Simple few-shot classification."""
        prompt = f"""Classify the text into one of: {', '.join(categories)}.

Text: "My payment failed but I was still charged"
Category: billing
Confidence: 0.95

Text: "The API keeps timing out"  
Category: technical
Confidence: 0.92

Text: "{text}"
Category:"""
        
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=50,
        )
        
        output = response.choices[0].message.content.strip()
        lines = output.split("\n")
        category = lines[0].strip()
        confidence = 0.5
        
        for line in lines:
            if "Confidence:" in line:
                try:
                    confidence = float(line.split(":")[-1].strip())
                except ValueError:
                    pass
        
        return {"category": category, "confidence": confidence}
    
    def _cot_classify(self, text: str, categories: list, model: str) -> dict:
        """Classification with chain-of-thought reasoning."""
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": f"""Classify text into: {', '.join(categories)}.

Think through your reasoning:
1. What is the main topic?
2. What keywords indicate the category?
3. Are there ambiguities?

Then output JSON: {{"reasoning": "...", "category": "...", "confidence": 0.0-1.0}}"""},
                {"role": "user", "content": text},
            ],
            temperature=0,
            response_format={"type": "json_object"},
        )
        
        return json.loads(response.choices[0].message.content)
    
    def _self_consistent_classify(
        self, text: str, categories: list, model: str, n: int = 5
    ) -> dict:
        """Self-consistency: multiple CoT paths + majority vote."""
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": f"""Classify into: {', '.join(categories)}.
Think step by step. End with: CATEGORY: <category>"""},
                {"role": "user", "content": text},
            ],
            temperature=0.7,
            n=n,
        )
        
        answers = []
        for choice in response.choices:
            text_out = choice.message.content
            for line in text_out.split("\n"):
                if line.startswith("CATEGORY:"):
                    answers.append(line.replace("CATEGORY:", "").strip().lower())
                    break
        
        if not answers:
            return {"category": "unknown", "confidence": 0.0}
        
        counter = Counter(answers)
        best, count = counter.most_common(1)[0]
        
        return {
            "category": best,
            "confidence": count / len(answers),
            "votes": dict(counter),
        }


# Usage
chain = ProductionPromptChain()
result = chain.classify_with_fallback(
    "I got charged twice and the API is showing an error when I try to check my invoices",
    categories=["billing", "technical", "account", "general"],
)
print(f"Category: {result['category']} (tier {result['tier']}, confidence {result['confidence']})")

This pattern saves cost on easy cases (tier 1 uses the cheap model) while maintaining accuracy on hard cases (tier 3 uses self-consistency). In production, 70-80% of queries resolve at tier 1.

Real Benchmarks

These are representative numbers from production systems. Your mileage will vary depending on the task, model, and prompt quality.

Classification Task (5 categories, 500 test samples, GPT-4o)

Technique Accuracy Avg Latency Cost per 1K queries
Zero-shot 82% 0.4s $0.30
Few-shot (3 examples) 89% 0.5s $0.38
CoT 91% 1.1s $0.85
Few-shot + CoT 93% 1.2s $0.92
Self-consistency (n=5) 95% 1.3s (parallel) $4.25
Tiered fallback chain 94% 0.6s avg $0.65

The tiered fallback chain is the winner for production: near-best accuracy at a fraction of the cost, because most queries resolve cheaply at tier 1.

Math/Reasoning Task (GSM8K subset, 200 samples)

Technique Accuracy Notes
Direct (GPT-4o) 76% No reasoning
CoT (GPT-4o) 92% “Let’s think step by step”
CoT (GPT-4o-mini) 71% Smaller model, CoT helps less
Self-consistency (GPT-4o, n=5) 95% Majority vote over CoT
Tree of Thought (GPT-4o) 94% Similar to self-consistency but 4x the cost

Key takeaway: CoT gives the biggest bang for the buck on reasoning tasks. Self-consistency adds another 3% but at 5x cost. Tree of Thought is rarely worth the extra cost over self-consistency.

Prompting Technique Decision Flowchart

Use this to decide which technique to apply:

Is the task simple and well-defined?
├── YES → Zero-shot (maybe few-shot for format control)
└── NO → Does it require multi-step reasoning?
    ├── YES → Use CoT
    │   └── Is accuracy critical and latency budget > 2s?
    │       ├── YES → Add self-consistency (n=3-5)
    │       └── NO → CoT alone is sufficient
    └── NO → Does it require consistent output format?
        ├── YES → Few-shot (3-5 examples)
        └── NO → Does it require external data?
            ├── YES → ReAct pattern
            └── NO → Zero-shot with a good system prompt

For production systems handling diverse queries, the tiered fallback pattern is almost always the right answer. Start cheap, escalate when needed.

Key Takeaways

  1. Few-shot examples are the most reliable way to control output format. Use 3-5 examples. Select them dynamically based on the query for best results.

  2. Chain of Thought improves accuracy on reasoning tasks by 15-40%. The cost is more output tokens. Worth it for math, logic, multi-step analysis. Not worth it for simple lookups.

  3. Self-consistency is the simplest accuracy booster. Run 5 samples with temperature 0.7, take the majority vote. Use the n parameter for a single API call.

  4. Tree of Thought is rarely worth the cost in production. Self-consistency gives similar accuracy gains at a fraction of the price. Reserve ToT for offline analysis of complex problems.

  5. ReAct is the foundation of AI agents. Interleaving reasoning with tool use enables the model to gather information and reason about it iteratively. Lesson 13 covers this in depth.

  6. The tiered fallback pattern saves 50-70% on costs compared to using the best technique for every query. Most queries are easy — let the cheap model handle them.

  7. Structure your CoT output as JSON with reasoning fields alongside result fields. This gives you auditability (log the reasoning), structured data (parse the result), and single-pass efficiency.

  8. Measure before optimizing. Run your baseline (zero-shot), measure accuracy, then add techniques one at a time. Each technique adds cost and complexity — make sure the accuracy gain justifies it.