Build "Deep Research" with Web Search and Reasoning Models — Become an AI Engineer — Practical Guide

In Lesson 3, you built an agent that searches the web and answers questions. But for complex research questions — “Analyze the competitive landscape of AI agents in 2026” — a single search-and-answer loop isn’t enough. You need a system that plans, researches in parallel, and synthesizes like a human researcher would.

This lesson teaches you about reasoning models, inference-time scaling, and then combines everything into a deep research system.

Reasoning and Thinking LLMs

Not all LLMs are created equal when it comes to hard problems. A new class of models — reasoning models — are specifically trained to think before they answer.

Reasoning Models — Standard vs Reasoning LLMs

Overview of Reasoning Models

Model	Provider	Key Feature	Best For
o1	OpenAI	Hidden chain-of-thought, RL-trained	Math, code, analysis
o3	OpenAI	Next-gen reasoning, configurable effort	Complex multi-step problems
o4-mini	OpenAI	Smaller, faster reasoning	Cost-efficient reasoning tasks
DeepSeek-R1	DeepSeek	Open-source reasoning, visible CoT	Self-hostable reasoning
Claude with thinking	Anthropic	Extended thinking blocks	Complex analysis, long reasoning

The key difference from standard LLMs: reasoning models were trained with reinforcement learning on verifiable tasks (math problems, code challenges) so they learned to break problems down, check their work, and try alternative approaches.

Using Reasoning Models

from openai import OpenAI

client = OpenAI()

# Standard model — fast but less accurate on hard problems
standard = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 27^3 + 14^4 - 892?"}],
)
print(f"GPT-4o: {standard.choices[0].message.content}")

# Reasoning model — slower but tackles complex problems
reasoning = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "What is 27^3 + 14^3 - 892?"}],
)
print(f"o3: {reasoning.choices[0].message.content}")
# o3 spends thinking tokens (billed but not shown) to reason through the calculation

# DeepSeek-R1 — open-source, visible chain of thought
from openai import OpenAI

ds_client = OpenAI(
    base_url="https://api.deepseek.com",
    api_key="your-deepseek-key",
)

response = ds_client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[{
        "role": "user",
        "content": "Design a database schema for a multi-tenant SaaS application "
                   "with row-level security. Explain your reasoning."
    }],
)
# R1 shows its reasoning chain in the response
print(response.choices[0].message.content)

# Anthropic Claude with extended thinking
from anthropic import Anthropic

anthropic = Anthropic()

response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Up to 10K tokens for thinking
    },
    messages=[{
        "role": "user",
        "content": "Analyze the trade-offs between microservices and monoliths "
                   "for a startup with 5 engineers serving 100K users."
    }],
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]: {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"\n[Answer]: {block.text}")

When to Use Reasoning Models

Task Type	Standard LLM	Reasoning Model	Winner
Simple Q&A	Fast, cheap	Overkill	Standard
Creative writing	Great	Overthinks	Standard
Math/logic	Often wrong	Reliable	Reasoning
Complex analysis	Surface-level	Deep insights	Reasoning
Code debugging	Hit or miss	Systematic	Reasoning
Research planning	Adequate	Excellent	Reasoning
Multi-step reasoning	Needs CoT prompt	Built-in	Reasoning

Rule of thumb: use reasoning models for planning, analysis, and verification. Use standard models for generation, conversation, and simple tasks.

Inference-Time Techniques

These techniques make LLMs reason better at query time, without changing the model itself.

Inference-Time Scaling Techniques

Inference-Time Scaling

The core insight: you can improve output quality by spending more compute at inference time. More tokens of reasoning = better answers. This is why reasoning models generate hidden “thinking tokens.”

# The simplest form: just ask the model to think more
# More output tokens = more reasoning = better answers

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Solve this step by step, showing ALL your work:\n\n"
                   "A train leaves station A at 8:00 AM traveling at 60 mph. "
                   "Another train leaves station B (300 miles away) at 9:00 AM "
                   "traveling at 80 mph toward station A. When do they meet?"
    }],
    max_tokens=2000,  # Give it room to think
)

Chain-of-Thought (CoT) Prompting

Force the model to show its reasoning:

def cot_prompt(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "You are a careful problem solver. Always think step by step. "
                       "Show your reasoning before giving the final answer."
        }, {
            "role": "user",
            "content": f"Think step by step:\n\n{question}"
        }],
        temperature=0.2,
    )
    return response.choices[0].message.content

Self-Consistency

Run the same CoT prompt multiple times and take the majority answer:

import asyncio
from collections import Counter
from openai import AsyncOpenAI

async_client = AsyncOpenAI()


async def self_consistent_answer(question: str, n: int = 5) -> dict:
    """Sample N reasoning paths and take the majority answer."""
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Think step by step, then put your final answer on the last line "
                           "prefixed with 'ANSWER: '."
            }, {
                "role": "user",
                "content": question,
            }],
            temperature=0.7,  # Higher temp for diverse reasoning paths
        )
        for _ in range(n)
    ]
    responses = await asyncio.gather(*tasks)

    answers = []
    for r in responses:
        text = r.choices[0].message.content
        for line in text.strip().split("\n")[::-1]:
            if line.strip().startswith("ANSWER:"):
                answers.append(line.split("ANSWER:")[1].strip())
                break

    if not answers:
        return {"answer": "No consensus", "confidence": 0}

    counter = Counter(answers)
    best_answer, count = counter.most_common(1)[0]
    return {
        "answer": best_answer,
        "confidence": count / len(answers),
        "all_answers": dict(counter),
    }

Sequential Revision

Generate, then ask the model to improve its own answer:

def sequential_revise(question: str, revisions: int = 2) -> str:
    """Generate an answer and revise it N times."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    answer = response.choices[0].message.content

    for i in range(revisions):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"""Review and improve this answer. Fix any errors, add missing details,
and improve clarity. If the answer is already correct and complete, return it unchanged.

Question: {question}

Current answer (revision {i}):
{answer}

Improved answer:"""
            }],
            temperature=0.3,
        )
        answer = response.choices[0].message.content

    return answer

Tree of Thoughts (ToT)

Explore multiple reasoning branches, evaluate each, and pursue the most promising:

import json


def tree_of_thoughts(question: str, breadth: int = 3, depth: int = 3) -> str:
    """Explore multiple reasoning paths via BFS."""

    def generate_thoughts(context: str, n: int) -> list[str]:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"Given this problem and progress so far, suggest {n} "
                           f"different next reasoning steps. Return as a JSON array of strings.\n\n"
                           f"Problem: {question}\n\nProgress: {context}"
            }],
            response_format={"type": "json_object"},
            temperature=0.8,
        )
        data = json.loads(response.choices[0].message.content)
        return data.get("thoughts", data.get("steps", []))[:n]

    def evaluate_thought(thought: str) -> float:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Rate this reasoning step from 0 to 1 for correctness "
                           f"and usefulness toward solving:\n{question}\n\n"
                           f"Step: {thought}\n\nScore (just the number):"
            }],
            temperature=0.0,
        )
        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    best_path = ""
    current_paths = [""]

    for d in range(depth):
        candidates = []
        for path in current_paths:
            thoughts = generate_thoughts(path, breadth)
            for thought in thoughts:
                score = evaluate_thought(path + "\n" + thought)
                candidates.append((path + "\n" + thought, score))

        candidates.sort(key=lambda x: x[1], reverse=True)
        current_paths = [c[0] for c in candidates[:breadth]]
        best_path = current_paths[0]

    final = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Based on this reasoning, give a final answer:\n\n"
                       f"Question: {question}\n\nReasoning:\n{best_path}"
        }],
    )
    return final.choices[0].message.content

Search Against a Verifier

Use one model to generate solutions and another to verify:

def search_with_verifier(question: str, max_attempts: int = 5) -> str:
    """Generate candidate answers and verify each until one passes."""
    for attempt in range(max_attempts):
        candidate = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": question}],
            temperature=0.7 + (attempt * 0.1),  # Increase diversity with each try
        ).choices[0].message.content

        verification = client.chat.completions.create(
            model="o3",  # Use reasoning model as verifier
            messages=[{
                "role": "user",
                "content": f"Verify this answer. Check for logical errors, factual mistakes, "
                           f"and completeness. Respond with PASS or FAIL followed by explanation.\n\n"
                           f"Question: {question}\nAnswer: {candidate}"
            }],
        ).choices[0].message.content

        if verification.strip().upper().startswith("PASS"):
            return candidate

    return candidate

Training-Time Techniques

These techniques improve the model’s reasoning ability during training — they’re how reasoning models are built.

SFT on Reasoning Data (STaR)

Self-Taught Reasoner (STaR): Generate reasoning traces, keep the ones that lead to correct answers, and fine-tune on them.

# Conceptual STaR pipeline
star_pipeline = """
1. Start with a base model and a set of (question, answer) pairs
2. For each question:
   a. Generate reasoning trace + answer (with CoT prompt)
   b. If answer is correct → keep the trace
   c. If wrong → "rationalize" — give the correct answer and ask model to explain why
3. Fine-tune the model on the collected correct reasoning traces
4. Repeat from step 2 with the improved model
"""

Reinforcement Learning with a Verifier

Train a verifier model, then use RL to optimize the generator against it:

Generator creates solution → Verifier scores it → RL updates generator weights
                    ↑                                       |
                    └───────────────────────────────────────┘

Reward Modeling: ORM vs PRM

Two approaches to scoring model outputs:

Approach	Scores	Pros	Cons
ORM (Outcome Reward Model)	Final answer only	Simple, cheap to label	Doesn’t help with partial credit
PRM (Process Reward Model)	Each reasoning step	Catches errors early, more signal	Expensive to label, harder to train

PRM is what makes models like o3 powerful — they learn which steps lead to correct answers, not just which final answers are right.

Train the model on (draft, feedback, improved_draft) triples so it learns to improve its own work.

Internalizing Search (Meta-CoT)

The frontier: teach the model to perform the search/backtracking internally, without external tools. The model learns to consider alternatives and backtrack within its own chain of thought.

Project: Build the Deep Research System

Now let’s build a system that plans research, deploys sub-agents in parallel, and synthesizes a comprehensive report.

Deep Research Pipeline

Architecture

The system has three phases:

Planning — A reasoning model decomposes the query into research subtasks
Execution — Parallel sub-agents research each subtask via web search
Synthesis — A reasoning model aggregates findings into a cited report

Step 1: Query Planning

# planner.py
import json
from openai import OpenAI

client = OpenAI()


def clarify_query(query: str) -> str:
    """Optionally refine an ambiguous query."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Determine if this research query is clear enough to proceed. "
                       "If it's ambiguous, return a refined version. "
                       'If it\'s clear, return it unchanged. '
                       "Only return the query text, nothing else."
        }, {
            "role": "user",
            "content": query,
        }],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()


def decompose_query(query: str, num_subtasks: int = 4) -> list[dict]:
    """Use a reasoning model to decompose query into research subtasks."""
    response = client.chat.completions.create(
        model="o3",  # Reasoning model for planning
        messages=[{
            "role": "system",
            "content": f"""You are a research planner. Decompose the user's query into
{num_subtasks} independent research subtasks that, combined, will provide a comprehensive answer.

Return a JSON object with this structure:
{{
  "research_plan": {{
    "objective": "one-sentence summary of the research goal",
    "subtasks": [
      {{
        "id": 1,
        "title": "Short title",
        "description": "What to research",
        "search_queries": ["query1", "query2", "query3"],
        "key_questions": ["What specific things to find out"]
      }}
    ]
  }}
}}

Make subtasks independent so they can run in parallel.
Each subtask should have 2-3 specific search queries."""
        }, {
            "role": "user",
            "content": query,
        }],
        response_format={"type": "json_object"},
    )
    plan = json.loads(response.choices[0].message.content)
    return plan["research_plan"]

Step 2: Sub-Agent Execution

# sub_agent.py
import json
import asyncio
from openai import AsyncOpenAI
from tool_executor import web_search, fetch_webpage

async_client = AsyncOpenAI()


async def execute_subtask(subtask: dict) -> dict:
    """Execute a single research subtask by searching and reading."""
    findings = []

    for query in subtask["search_queries"]:
        search_results = web_search(query, num_results=5)
        results = json.loads(search_results)
        findings.append({
            "query": query,
            "results": results,
        })

        # Fetch top 2 most promising pages
        for result in results[:2]:
            try:
                content = fetch_webpage(result["url"])
                findings.append({
                    "url": result["url"],
                    "title": result["title"],
                    "content": content[:2000],
                })
            except Exception:
                continue

    # Summarize findings using the LLM
    findings_text = json.dumps(findings, indent=2)[:8000]

    response = await async_client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "You are a research analyst. Summarize the provided search results "
                       "into key findings. Include specific data points, quotes, and URLs. "
                       "Be factual and cite sources."
        }, {
            "role": "user",
            "content": f"Research task: {subtask['title']}\n"
                       f"Description: {subtask['description']}\n\n"
                       f"Key questions to answer:\n"
                       + "\n".join(f"- {q}" for q in subtask["key_questions"])
                       + f"\n\nSearch findings:\n{findings_text}"
        }],
        temperature=0.2,
    )

    return {
        "subtask_id": subtask["id"],
        "title": subtask["title"],
        "summary": response.choices[0].message.content,
        "sources": [
            f.get("url", "") for f in findings
            if isinstance(f, dict) and "url" in f
        ],
    }


async def execute_all_subtasks(subtasks: list[dict]) -> list[dict]:
    """Run all subtasks in parallel."""
    tasks = [execute_subtask(st) for st in subtasks]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    completed = []
    for r in results:
        if isinstance(r, Exception):
            print(f"Subtask failed: {r}")
        else:
            completed.append(r)
    return completed

Step 3: Synthesis and Citations

# synthesizer.py
import json
from openai import OpenAI

client = OpenAI()


def synthesize_report(query: str, plan: dict,
                      subtask_results: list[dict]) -> str:
    """Use a reasoning model to synthesize all findings into a report."""
    results_block = ""
    all_sources = []

    for result in subtask_results:
        results_block += f"\n\n## {result['title']}\n{result['summary']}"
        all_sources.extend(result.get("sources", []))

    unique_sources = list(dict.fromkeys(s for s in all_sources if s))

    response = client.chat.completions.create(
        model="o3",  # Reasoning model for synthesis
        messages=[{
            "role": "system",
            "content": """You are a senior research analyst writing a comprehensive report.

Your task:
1. Aggregate findings from multiple research sub-agents
2. Create a clear, well-structured report with sections
3. Cross-reference and deduplicate information
4. Identify patterns, contradictions, and key insights
5. Add inline citations as [1], [2], etc.
6. End with a "Sources" section listing all URLs

Report format:
- Executive Summary (2-3 sentences)
- Main sections with headers
- Key data points and statistics highlighted
- Analysis and insights (your synthesis, not just facts)
- Sources list at the end

Write for a knowledgeable audience. Be specific, use numbers, cite everything."""
        }, {
            "role": "user",
            "content": f"Research query: {query}\n\n"
                       f"Research objective: {plan['objective']}\n\n"
                       f"Sub-agent findings:\n{results_block}\n\n"
                       f"Available sources:\n"
                       + "\n".join(f"[{i+1}] {url}" for i, url in enumerate(unique_sources))
        }],
        max_tokens=4000,
    )

    return response.choices[0].message.content


def add_citations(report: str, sources: list[str]) -> str:
    """Post-process to ensure citations are properly formatted."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Review this report and ensure all factual claims have inline citations. "
                       "Add missing citations where needed. Keep the report text unchanged otherwise. "
                       "Append a formatted source list at the end."
        }, {
            "role": "user",
            "content": f"Report:\n{report}\n\nAvailable sources:\n"
                       + "\n".join(f"[{i+1}] {url}" for i, url in enumerate(sources))
        }],
        temperature=0.1,
    )
    return response.choices[0].message.content

The Complete Deep Research Pipeline

# deep_research.py
import asyncio
import time
from planner import clarify_query, decompose_query
from sub_agent import execute_all_subtasks
from synthesizer import synthesize_report, add_citations


class DeepResearch:
    def __init__(self):
        self.stats = {}

    async def research(self, query: str) -> dict:
        """Run the full deep research pipeline."""
        start_time = time.time()
        print(f"Starting deep research: {query}\n")

        # Phase 1: Planning
        print("Phase 1: Planning...")
        refined_query = clarify_query(query)
        plan = decompose_query(refined_query)
        print(f"  Objective: {plan['objective']}")
        print(f"  Subtasks: {len(plan['subtasks'])}")
        for st in plan["subtasks"]:
            print(f"    {st['id']}. {st['title']}")
        plan_time = time.time() - start_time

        # Phase 2: Parallel execution
        print(f"\nPhase 2: Executing {len(plan['subtasks'])} sub-agents in parallel...")
        exec_start = time.time()
        results = await execute_all_subtasks(plan["subtasks"])
        exec_time = time.time() - exec_start
        print(f"  Completed {len(results)}/{len(plan['subtasks'])} subtasks in {exec_time:.1f}s")

        # Phase 3: Synthesis
        print("\nPhase 3: Synthesizing report...")
        synth_start = time.time()
        report = synthesize_report(refined_query, plan, results)

        all_sources = []
        for r in results:
            all_sources.extend(r.get("sources", []))
        unique_sources = list(dict.fromkeys(s for s in all_sources if s))

        final_report = add_citations(report, unique_sources)
        synth_time = time.time() - synth_start

        total_time = time.time() - start_time
        self.stats = {
            "total_seconds": total_time,
            "planning_seconds": plan_time,
            "execution_seconds": exec_time,
            "synthesis_seconds": synth_time,
            "subtasks": len(plan["subtasks"]),
            "sources": len(unique_sources),
        }

        print(f"\nDone in {total_time:.1f}s")
        print(f"  Planning: {plan_time:.1f}s | Execution: {exec_time:.1f}s | Synthesis: {synth_time:.1f}s")
        print(f"  Sources: {len(unique_sources)}")

        return {
            "query": refined_query,
            "report": final_report,
            "plan": plan,
            "sources": unique_sources,
            "stats": self.stats,
        }


async def main():
    researcher = DeepResearch()

    result = await researcher.research(
        "Analyze the competitive landscape of AI coding assistants in 2026. "
        "Compare features, pricing, market share, and technology approaches."
    )

    print("\n" + "=" * 80)
    print("DEEP RESEARCH REPORT")
    print("=" * 80)
    print(result["report"])


if __name__ == "__main__":
    asyncio.run(main())

Adding a FastAPI Endpoint

# server.py
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
from deep_research import DeepResearch

app = FastAPI(title="Deep Research API")
researcher = DeepResearch()


class ResearchRequest(BaseModel):
    query: str
    num_subtasks: int = 4


@app.post("/research")
async def research(req: ResearchRequest):
    result = await researcher.research(req.query)
    return result

Running It

# Install dependencies
pip install openai httpx beautifulsoup4 tavily-python fastapi uvicorn

# Set API keys
export OPENAI_API_KEY=sk-your-key
export TAVILY_API_KEY=tvly-your-key

# Run the deep research
python deep_research.py

Example output structure:

# Competitive Landscape of AI Coding Assistants (2026)

## Executive Summary
The AI coding assistant market reached $X billion in 2026, with three dominant
players... [1][2]

## Market Overview
### Market Size and Growth
...

## Player Comparison
| Feature | GitHub Copilot | Cursor | Claude Code | Windsurf |
|---------|---------------|--------|-------------|----------|
| ...

## Technology Approaches
...

## Analysis and Key Insights
1. The market is consolidating around...
2. Open-source alternatives are gaining...

## Sources
[1] https://...
[2] https://...

Optimizing the Pipeline

Cost Management

Deep research is expensive — multiple reasoning model calls plus many search queries. Strategies to reduce cost:

# Use cheaper models for simple tasks, reasoning models for planning/synthesis
MODEL_MAP = {
    "planning": "o3",           # Needs strong reasoning
    "sub_agent_search": "gpt-4o-mini",  # Simple extraction
    "sub_agent_summary": "gpt-4o",      # Good analysis
    "synthesis": "o3",          # Needs strong reasoning
    "citations": "gpt-4o-mini", # Simple formatting
}

Component	Model	Est. Cost per Research
Planning	o3	$0.10–0.30
Sub-agents (4×)	gpt-4o-mini	$0.02–0.08
Search API	Tavily	$0.01–0.04
Synthesis	o3	$0.15–0.50
Citations	gpt-4o-mini	$0.01–0.02
Total		$0.30–0.95

Quality Improvements

Iterative deepening — if the synthesizer identifies gaps, trigger additional sub-agent searches
Source validation — verify URLs still return content before citing
Cross-referencing — flag claims that appear in only one source
Confidence scoring — rate each finding by how many independent sources confirm it

Key Takeaways

Reasoning models think before answering — they trade latency and cost for accuracy on hard problems (math, planning, analysis)
Inference-time scaling is powerful — CoT, self-consistency, and Tree of Thoughts can significantly improve any model’s output quality
Use reasoning models strategically — for planning and synthesis, not for every call. Mix model tiers to optimize cost
Deep research = plan + parallel execute + synthesize — this three-phase pattern is how ChatGPT, Gemini, and Perplexity implement their research features
Training-time techniques (STaR, PRM, RL) are how reasoning models are built — understanding them helps you choose the right model for the job
Always cite sources — a deep research system that doesn’t cite is just a hallucination generator

What’s Next

In the next lesson, we’ll build a Multi-modal Generation Agent that goes beyond text — generating images, audio, and video by orchestrating multiple AI models into a unified pipeline.