In Lesson 3, you built an agent that searches the web and answers questions. But for complex research questions — “Analyze the competitive landscape of AI agents in 2026” — a single search-and-answer loop isn’t enough. You need a system that plans, researches in parallel, and synthesizes like a human researcher would.
This lesson teaches you about reasoning models, inference-time scaling, and then combines everything into a deep research system.
Reasoning and Thinking LLMs
Not all LLMs are created equal when it comes to hard problems. A new class of models — reasoning models — are specifically trained to think before they answer.
Overview of Reasoning Models
| Model | Provider | Key Feature | Best For |
|---|---|---|---|
| o1 | OpenAI | Hidden chain-of-thought, RL-trained | Math, code, analysis |
| o3 | OpenAI | Next-gen reasoning, configurable effort | Complex multi-step problems |
| o4-mini | OpenAI | Smaller, faster reasoning | Cost-efficient reasoning tasks |
| DeepSeek-R1 | DeepSeek | Open-source reasoning, visible CoT | Self-hostable reasoning |
| Claude with thinking | Anthropic | Extended thinking blocks | Complex analysis, long reasoning |
The key difference from standard LLMs: reasoning models were trained with reinforcement learning on verifiable tasks (math problems, code challenges) so they learned to break problems down, check their work, and try alternative approaches.
Using Reasoning Models
from openai import OpenAI
client = OpenAI()
# Standard model — fast but less accurate on hard problems
standard = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 27^3 + 14^4 - 892?"}],
)
print(f"GPT-4o: {standard.choices[0].message.content}")
# Reasoning model — slower but tackles complex problems
reasoning = client.chat.completions.create(
model="o3",
messages=[{"role": "user", "content": "What is 27^3 + 14^3 - 892?"}],
)
print(f"o3: {reasoning.choices[0].message.content}")
# o3 spends thinking tokens (billed but not shown) to reason through the calculation# DeepSeek-R1 — open-source, visible chain of thought
from openai import OpenAI
ds_client = OpenAI(
base_url="https://api.deepseek.com",
api_key="your-deepseek-key",
)
response = ds_client.chat.completions.create(
model="deepseek-reasoner",
messages=[{
"role": "user",
"content": "Design a database schema for a multi-tenant SaaS application "
"with row-level security. Explain your reasoning."
}],
)
# R1 shows its reasoning chain in the response
print(response.choices[0].message.content)# Anthropic Claude with extended thinking
from anthropic import Anthropic
anthropic = Anthropic()
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000, # Up to 10K tokens for thinking
},
messages=[{
"role": "user",
"content": "Analyze the trade-offs between microservices and monoliths "
"for a startup with 5 engineers serving 100K users."
}],
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking]: {block.thinking[:200]}...")
elif block.type == "text":
print(f"\n[Answer]: {block.text}")When to Use Reasoning Models
| Task Type | Standard LLM | Reasoning Model | Winner |
|---|---|---|---|
| Simple Q&A | Fast, cheap | Overkill | Standard |
| Creative writing | Great | Overthinks | Standard |
| Math/logic | Often wrong | Reliable | Reasoning |
| Complex analysis | Surface-level | Deep insights | Reasoning |
| Code debugging | Hit or miss | Systematic | Reasoning |
| Research planning | Adequate | Excellent | Reasoning |
| Multi-step reasoning | Needs CoT prompt | Built-in | Reasoning |
Rule of thumb: use reasoning models for planning, analysis, and verification. Use standard models for generation, conversation, and simple tasks.
Inference-Time Techniques
These techniques make LLMs reason better at query time, without changing the model itself.
Inference-Time Scaling
The core insight: you can improve output quality by spending more compute at inference time. More tokens of reasoning = better answers. This is why reasoning models generate hidden “thinking tokens.”
# The simplest form: just ask the model to think more
# More output tokens = more reasoning = better answers
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Solve this step by step, showing ALL your work:\n\n"
"A train leaves station A at 8:00 AM traveling at 60 mph. "
"Another train leaves station B (300 miles away) at 9:00 AM "
"traveling at 80 mph toward station A. When do they meet?"
}],
max_tokens=2000, # Give it room to think
)Chain-of-Thought (CoT) Prompting
Force the model to show its reasoning:
def cot_prompt(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "You are a careful problem solver. Always think step by step. "
"Show your reasoning before giving the final answer."
}, {
"role": "user",
"content": f"Think step by step:\n\n{question}"
}],
temperature=0.2,
)
return response.choices[0].message.contentSelf-Consistency
Run the same CoT prompt multiple times and take the majority answer:
import asyncio
from collections import Counter
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def self_consistent_answer(question: str, n: int = 5) -> dict:
"""Sample N reasoning paths and take the majority answer."""
tasks = [
async_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Think step by step, then put your final answer on the last line "
"prefixed with 'ANSWER: '."
}, {
"role": "user",
"content": question,
}],
temperature=0.7, # Higher temp for diverse reasoning paths
)
for _ in range(n)
]
responses = await asyncio.gather(*tasks)
answers = []
for r in responses:
text = r.choices[0].message.content
for line in text.strip().split("\n")[::-1]:
if line.strip().startswith("ANSWER:"):
answers.append(line.split("ANSWER:")[1].strip())
break
if not answers:
return {"answer": "No consensus", "confidence": 0}
counter = Counter(answers)
best_answer, count = counter.most_common(1)[0]
return {
"answer": best_answer,
"confidence": count / len(answers),
"all_answers": dict(counter),
}Sequential Revision
Generate, then ask the model to improve its own answer:
def sequential_revise(question: str, revisions: int = 2) -> str:
"""Generate an answer and revise it N times."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
)
answer = response.choices[0].message.content
for i in range(revisions):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Review and improve this answer. Fix any errors, add missing details,
and improve clarity. If the answer is already correct and complete, return it unchanged.
Question: {question}
Current answer (revision {i}):
{answer}
Improved answer:"""
}],
temperature=0.3,
)
answer = response.choices[0].message.content
return answerTree of Thoughts (ToT)
Explore multiple reasoning branches, evaluate each, and pursue the most promising:
import json
def tree_of_thoughts(question: str, breadth: int = 3, depth: int = 3) -> str:
"""Explore multiple reasoning paths via BFS."""
def generate_thoughts(context: str, n: int) -> list[str]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Given this problem and progress so far, suggest {n} "
f"different next reasoning steps. Return as a JSON array of strings.\n\n"
f"Problem: {question}\n\nProgress: {context}"
}],
response_format={"type": "json_object"},
temperature=0.8,
)
data = json.loads(response.choices[0].message.content)
return data.get("thoughts", data.get("steps", []))[:n]
def evaluate_thought(thought: str) -> float:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Rate this reasoning step from 0 to 1 for correctness "
f"and usefulness toward solving:\n{question}\n\n"
f"Step: {thought}\n\nScore (just the number):"
}],
temperature=0.0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.5
best_path = ""
current_paths = [""]
for d in range(depth):
candidates = []
for path in current_paths:
thoughts = generate_thoughts(path, breadth)
for thought in thoughts:
score = evaluate_thought(path + "\n" + thought)
candidates.append((path + "\n" + thought, score))
candidates.sort(key=lambda x: x[1], reverse=True)
current_paths = [c[0] for c in candidates[:breadth]]
best_path = current_paths[0]
final = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Based on this reasoning, give a final answer:\n\n"
f"Question: {question}\n\nReasoning:\n{best_path}"
}],
)
return final.choices[0].message.contentSearch Against a Verifier
Use one model to generate solutions and another to verify:
def search_with_verifier(question: str, max_attempts: int = 5) -> str:
"""Generate candidate answers and verify each until one passes."""
for attempt in range(max_attempts):
candidate = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
temperature=0.7 + (attempt * 0.1), # Increase diversity with each try
).choices[0].message.content
verification = client.chat.completions.create(
model="o3", # Use reasoning model as verifier
messages=[{
"role": "user",
"content": f"Verify this answer. Check for logical errors, factual mistakes, "
f"and completeness. Respond with PASS or FAIL followed by explanation.\n\n"
f"Question: {question}\nAnswer: {candidate}"
}],
).choices[0].message.content
if verification.strip().upper().startswith("PASS"):
return candidate
return candidateTraining-Time Techniques
These techniques improve the model’s reasoning ability during training — they’re how reasoning models are built.
SFT on Reasoning Data (STaR)
Self-Taught Reasoner (STaR): Generate reasoning traces, keep the ones that lead to correct answers, and fine-tune on them.
# Conceptual STaR pipeline
star_pipeline = """
1. Start with a base model and a set of (question, answer) pairs
2. For each question:
a. Generate reasoning trace + answer (with CoT prompt)
b. If answer is correct → keep the trace
c. If wrong → "rationalize" — give the correct answer and ask model to explain why
3. Fine-tune the model on the collected correct reasoning traces
4. Repeat from step 2 with the improved model
"""Reinforcement Learning with a Verifier
Train a verifier model, then use RL to optimize the generator against it:
Generator creates solution → Verifier scores it → RL updates generator weights
↑ |
└───────────────────────────────────────┘Reward Modeling: ORM vs PRM
Two approaches to scoring model outputs:
| Approach | Scores | Pros | Cons |
|---|---|---|---|
| ORM (Outcome Reward Model) | Final answer only | Simple, cheap to label | Doesn’t help with partial credit |
| PRM (Process Reward Model) | Each reasoning step | Catches errors early, more signal | Expensive to label, harder to train |
PRM is what makes models like o3 powerful — they learn which steps lead to correct answers, not just which final answers are right.
Self-Refinement
Train the model on (draft, feedback, improved_draft) triples so it learns to improve its own work.
Internalizing Search (Meta-CoT)
The frontier: teach the model to perform the search/backtracking internally, without external tools. The model learns to consider alternatives and backtrack within its own chain of thought.
Project: Build the Deep Research System
Now let’s build a system that plans research, deploys sub-agents in parallel, and synthesizes a comprehensive report.
Architecture
The system has three phases:
- Planning — A reasoning model decomposes the query into research subtasks
- Execution — Parallel sub-agents research each subtask via web search
- Synthesis — A reasoning model aggregates findings into a cited report
Step 1: Query Planning
# planner.py
import json
from openai import OpenAI
client = OpenAI()
def clarify_query(query: str) -> str:
"""Optionally refine an ambiguous query."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Determine if this research query is clear enough to proceed. "
"If it's ambiguous, return a refined version. "
'If it\'s clear, return it unchanged. '
"Only return the query text, nothing else."
}, {
"role": "user",
"content": query,
}],
temperature=0.2,
)
return response.choices[0].message.content.strip()
def decompose_query(query: str, num_subtasks: int = 4) -> list[dict]:
"""Use a reasoning model to decompose query into research subtasks."""
response = client.chat.completions.create(
model="o3", # Reasoning model for planning
messages=[{
"role": "system",
"content": f"""You are a research planner. Decompose the user's query into
{num_subtasks} independent research subtasks that, combined, will provide a comprehensive answer.
Return a JSON object with this structure:
{{
"research_plan": {{
"objective": "one-sentence summary of the research goal",
"subtasks": [
{{
"id": 1,
"title": "Short title",
"description": "What to research",
"search_queries": ["query1", "query2", "query3"],
"key_questions": ["What specific things to find out"]
}}
]
}}
}}
Make subtasks independent so they can run in parallel.
Each subtask should have 2-3 specific search queries."""
}, {
"role": "user",
"content": query,
}],
response_format={"type": "json_object"},
)
plan = json.loads(response.choices[0].message.content)
return plan["research_plan"]Step 2: Sub-Agent Execution
# sub_agent.py
import json
import asyncio
from openai import AsyncOpenAI
from tool_executor import web_search, fetch_webpage
async_client = AsyncOpenAI()
async def execute_subtask(subtask: dict) -> dict:
"""Execute a single research subtask by searching and reading."""
findings = []
for query in subtask["search_queries"]:
search_results = web_search(query, num_results=5)
results = json.loads(search_results)
findings.append({
"query": query,
"results": results,
})
# Fetch top 2 most promising pages
for result in results[:2]:
try:
content = fetch_webpage(result["url"])
findings.append({
"url": result["url"],
"title": result["title"],
"content": content[:2000],
})
except Exception:
continue
# Summarize findings using the LLM
findings_text = json.dumps(findings, indent=2)[:8000]
response = await async_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "You are a research analyst. Summarize the provided search results "
"into key findings. Include specific data points, quotes, and URLs. "
"Be factual and cite sources."
}, {
"role": "user",
"content": f"Research task: {subtask['title']}\n"
f"Description: {subtask['description']}\n\n"
f"Key questions to answer:\n"
+ "\n".join(f"- {q}" for q in subtask["key_questions"])
+ f"\n\nSearch findings:\n{findings_text}"
}],
temperature=0.2,
)
return {
"subtask_id": subtask["id"],
"title": subtask["title"],
"summary": response.choices[0].message.content,
"sources": [
f.get("url", "") for f in findings
if isinstance(f, dict) and "url" in f
],
}
async def execute_all_subtasks(subtasks: list[dict]) -> list[dict]:
"""Run all subtasks in parallel."""
tasks = [execute_subtask(st) for st in subtasks]
results = await asyncio.gather(*tasks, return_exceptions=True)
completed = []
for r in results:
if isinstance(r, Exception):
print(f"Subtask failed: {r}")
else:
completed.append(r)
return completedStep 3: Synthesis and Citations
# synthesizer.py
import json
from openai import OpenAI
client = OpenAI()
def synthesize_report(query: str, plan: dict,
subtask_results: list[dict]) -> str:
"""Use a reasoning model to synthesize all findings into a report."""
results_block = ""
all_sources = []
for result in subtask_results:
results_block += f"\n\n## {result['title']}\n{result['summary']}"
all_sources.extend(result.get("sources", []))
unique_sources = list(dict.fromkeys(s for s in all_sources if s))
response = client.chat.completions.create(
model="o3", # Reasoning model for synthesis
messages=[{
"role": "system",
"content": """You are a senior research analyst writing a comprehensive report.
Your task:
1. Aggregate findings from multiple research sub-agents
2. Create a clear, well-structured report with sections
3. Cross-reference and deduplicate information
4. Identify patterns, contradictions, and key insights
5. Add inline citations as [1], [2], etc.
6. End with a "Sources" section listing all URLs
Report format:
- Executive Summary (2-3 sentences)
- Main sections with headers
- Key data points and statistics highlighted
- Analysis and insights (your synthesis, not just facts)
- Sources list at the end
Write for a knowledgeable audience. Be specific, use numbers, cite everything."""
}, {
"role": "user",
"content": f"Research query: {query}\n\n"
f"Research objective: {plan['objective']}\n\n"
f"Sub-agent findings:\n{results_block}\n\n"
f"Available sources:\n"
+ "\n".join(f"[{i+1}] {url}" for i, url in enumerate(unique_sources))
}],
max_tokens=4000,
)
return response.choices[0].message.content
def add_citations(report: str, sources: list[str]) -> str:
"""Post-process to ensure citations are properly formatted."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Review this report and ensure all factual claims have inline citations. "
"Add missing citations where needed. Keep the report text unchanged otherwise. "
"Append a formatted source list at the end."
}, {
"role": "user",
"content": f"Report:\n{report}\n\nAvailable sources:\n"
+ "\n".join(f"[{i+1}] {url}" for i, url in enumerate(sources))
}],
temperature=0.1,
)
return response.choices[0].message.contentThe Complete Deep Research Pipeline
# deep_research.py
import asyncio
import time
from planner import clarify_query, decompose_query
from sub_agent import execute_all_subtasks
from synthesizer import synthesize_report, add_citations
class DeepResearch:
def __init__(self):
self.stats = {}
async def research(self, query: str) -> dict:
"""Run the full deep research pipeline."""
start_time = time.time()
print(f"Starting deep research: {query}\n")
# Phase 1: Planning
print("Phase 1: Planning...")
refined_query = clarify_query(query)
plan = decompose_query(refined_query)
print(f" Objective: {plan['objective']}")
print(f" Subtasks: {len(plan['subtasks'])}")
for st in plan["subtasks"]:
print(f" {st['id']}. {st['title']}")
plan_time = time.time() - start_time
# Phase 2: Parallel execution
print(f"\nPhase 2: Executing {len(plan['subtasks'])} sub-agents in parallel...")
exec_start = time.time()
results = await execute_all_subtasks(plan["subtasks"])
exec_time = time.time() - exec_start
print(f" Completed {len(results)}/{len(plan['subtasks'])} subtasks in {exec_time:.1f}s")
# Phase 3: Synthesis
print("\nPhase 3: Synthesizing report...")
synth_start = time.time()
report = synthesize_report(refined_query, plan, results)
all_sources = []
for r in results:
all_sources.extend(r.get("sources", []))
unique_sources = list(dict.fromkeys(s for s in all_sources if s))
final_report = add_citations(report, unique_sources)
synth_time = time.time() - synth_start
total_time = time.time() - start_time
self.stats = {
"total_seconds": total_time,
"planning_seconds": plan_time,
"execution_seconds": exec_time,
"synthesis_seconds": synth_time,
"subtasks": len(plan["subtasks"]),
"sources": len(unique_sources),
}
print(f"\nDone in {total_time:.1f}s")
print(f" Planning: {plan_time:.1f}s | Execution: {exec_time:.1f}s | Synthesis: {synth_time:.1f}s")
print(f" Sources: {len(unique_sources)}")
return {
"query": refined_query,
"report": final_report,
"plan": plan,
"sources": unique_sources,
"stats": self.stats,
}
async def main():
researcher = DeepResearch()
result = await researcher.research(
"Analyze the competitive landscape of AI coding assistants in 2026. "
"Compare features, pricing, market share, and technology approaches."
)
print("\n" + "=" * 80)
print("DEEP RESEARCH REPORT")
print("=" * 80)
print(result["report"])
if __name__ == "__main__":
asyncio.run(main())Adding a FastAPI Endpoint
# server.py
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
from deep_research import DeepResearch
app = FastAPI(title="Deep Research API")
researcher = DeepResearch()
class ResearchRequest(BaseModel):
query: str
num_subtasks: int = 4
@app.post("/research")
async def research(req: ResearchRequest):
result = await researcher.research(req.query)
return resultRunning It
# Install dependencies
pip install openai httpx beautifulsoup4 tavily-python fastapi uvicorn
# Set API keys
export OPENAI_API_KEY=sk-your-key
export TAVILY_API_KEY=tvly-your-key
# Run the deep research
python deep_research.pyExample output structure:
# Competitive Landscape of AI Coding Assistants (2026)
## Executive Summary
The AI coding assistant market reached $X billion in 2026, with three dominant
players... [1][2]
## Market Overview
### Market Size and Growth
...
## Player Comparison
| Feature | GitHub Copilot | Cursor | Claude Code | Windsurf |
|---------|---------------|--------|-------------|----------|
| ...
## Technology Approaches
...
## Analysis and Key Insights
1. The market is consolidating around...
2. Open-source alternatives are gaining...
## Sources
[1] https://...
[2] https://...Optimizing the Pipeline
Cost Management
Deep research is expensive — multiple reasoning model calls plus many search queries. Strategies to reduce cost:
# Use cheaper models for simple tasks, reasoning models for planning/synthesis
MODEL_MAP = {
"planning": "o3", # Needs strong reasoning
"sub_agent_search": "gpt-4o-mini", # Simple extraction
"sub_agent_summary": "gpt-4o", # Good analysis
"synthesis": "o3", # Needs strong reasoning
"citations": "gpt-4o-mini", # Simple formatting
}| Component | Model | Est. Cost per Research |
|---|---|---|
| Planning | o3 | $0.10–0.30 |
| Sub-agents (4×) | gpt-4o-mini | $0.02–0.08 |
| Search API | Tavily | $0.01–0.04 |
| Synthesis | o3 | $0.15–0.50 |
| Citations | gpt-4o-mini | $0.01–0.02 |
| Total | $0.30–0.95 |
Quality Improvements
- Iterative deepening — if the synthesizer identifies gaps, trigger additional sub-agent searches
- Source validation — verify URLs still return content before citing
- Cross-referencing — flag claims that appear in only one source
- Confidence scoring — rate each finding by how many independent sources confirm it
Key Takeaways
- Reasoning models think before answering — they trade latency and cost for accuracy on hard problems (math, planning, analysis)
- Inference-time scaling is powerful — CoT, self-consistency, and Tree of Thoughts can significantly improve any model’s output quality
- Use reasoning models strategically — for planning and synthesis, not for every call. Mix model tiers to optimize cost
- Deep research = plan + parallel execute + synthesize — this three-phase pattern is how ChatGPT, Gemini, and Perplexity implement their research features
- Training-time techniques (STaR, PRM, RL) are how reasoning models are built — understanding them helps you choose the right model for the job
- Always cite sources — a deep research system that doesn’t cite is just a hallucination generator
What’s Next
In the next lesson, we’ll build a Multi-modal Generation Agent that goes beyond text — generating images, audio, and video by orchestrating multiple AI models into a unified pipeline.
