Building AI Agents with Tool Use — LLM Engineering in Production

An LLM by itself can only generate text. It cannot check the weather, query a database, run a calculation, or send an email. An AI agent is what you get when you give an LLM tools and let it decide when to use them.

The core idea is simple: the model receives a task, reasons about what to do, picks a tool, you execute the tool and return the result, and the model reasons again. This loop continues until the task is complete. That’s it. Everything else is engineering to make this loop reliable.

What Makes Something an “Agent”

An agent has three components:

An LLM — the reasoning engine that decides what to do next
Tools — functions the model can invoke (search, calculate, read files, call APIs)
A loop — the orchestration that feeds tool results back to the model

If your code calls the LLM once and returns the response, that’s not an agent. If your code calls the LLM, which requests a tool call, you execute the tool, send the result back, and the LLM either requests another tool or produces a final answer — that’s an agent.

# The simplest possible agent loop (pseudocode)
while True:
    response = llm.generate(messages)
    if response.has_tool_calls():
        for tool_call in response.tool_calls:
            result = execute_tool(tool_call)
            messages.append(tool_result(result))
    else:
        return response.text  # Final answer

The power — and the danger — is that the model decides what happens next. You don’t write if/else branches for every scenario. The model reasons about the task and picks the right tool. This makes agents flexible, but also unpredictable.

Function Calling with OpenAI

OpenAI’s function calling API lets you define tools as JSON schemas. The model returns structured JSON telling you which function to call with which arguments.

Defining Tools

Tools are defined as a list of function descriptions:

import openai
import json

client = openai.OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city. Use this when the user asks about weather conditions.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'San Francisco'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a mathematical expression. Use this for any arithmetic.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "Math expression to evaluate, e.g. '(42 * 3) + 17'"
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

The description fields are critical. The model reads them to decide when to use each tool. Vague descriptions produce unreliable tool selection. Be specific about what the tool does and when it should be used.

Making a Tool-Enabled Call

messages = [
    {"role": "system", "content": "You are a helpful assistant. Use tools when needed."},
    {"role": "user", "content": "What's the weather in Tokyo and what's 42 * 17?"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"  # Model decides whether to use tools
)

message = response.choices[0].message
print(f"Finish reason: {response.choices[0].finish_reason}")
# finish_reason will be "tool_calls" if the model wants to use tools
# finish_reason will be "stop" if the model is done

Processing Tool Calls

When the model decides to use tools, you need to execute them and return results:

def execute_tool(name: str, arguments: dict) -> str:
    """Execute a tool and return the result as a string."""
    if name == "get_weather":
        # In production, this calls a real weather API
        return json.dumps({
            "city": arguments["city"],
            "temperature": 22,
            "unit": arguments.get("unit", "celsius"),
            "condition": "partly cloudy"
        })
    elif name == "calculate":
        try:
            # WARNING: eval is dangerous. Use a safe math parser in production.
            result = eval(arguments["expression"])
            return json.dumps({"result": result})
        except Exception as e:
            return json.dumps({"error": str(e)})
    else:
        return json.dumps({"error": f"Unknown tool: {name}"})


def run_agent(messages: list, tools: list, max_iterations: int = 10) -> str:
    """Run the agent loop until the model produces a final answer."""
    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # If no tool calls, the model is done
        if not message.tool_calls:
            return message.content

        # Add the assistant's message (with tool calls) to history
        messages.append(message)

        # Execute each tool call and add results
        for tool_call in message.tool_calls:
            arguments = json.loads(tool_call.function.arguments)
            result = execute_tool(tool_call.function.name, arguments)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

        print(f"  Iteration {i+1}: executed {len(message.tool_calls)} tool(s)")

    return "Agent reached maximum iterations without a final answer."

The tool_call_id is essential. OpenAI uses it to match each tool result with the corresponding tool call. Omitting it causes an API error.

Parallel Tool Calls

When the model needs multiple independent pieces of information, it can request several tool calls in a single response. The example above already handles this — message.tool_calls is a list. OpenAI returns multiple entries when the model determines the calls don’t depend on each other:

# The model might return two tool calls at once:
# tool_calls[0]: get_weather(city="Tokyo")
# tool_calls[1]: calculate(expression="42 * 17")
# Execute both, return both results, then the model synthesizes the answer.

You can execute these in parallel using asyncio or concurrent.futures:

from concurrent.futures import ThreadPoolExecutor

def execute_tools_parallel(tool_calls):
    """Execute multiple tool calls in parallel."""
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {}
        for tc in tool_calls:
            args = json.loads(tc.function.arguments)
            future = executor.submit(execute_tool, tc.function.name, args)
            futures[tc.id] = future

        results = []
        for tc in tool_calls:
            result = futures[tc.id].result(timeout=30)
            results.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result
            })
        return results

Tool Use with Anthropic

Anthropic’s tool use API follows a similar pattern but with different message structures. Instead of tool_calls in the response, Claude returns tool_use content blocks.

Defining Tools for Claude

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name"
                }
            },
            "required": ["city"]
        }
    },
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "Math expression to evaluate"
                }
            },
            "required": ["expression"]
        }
    }
]

The Anthropic Agent Loop

def run_anthropic_agent(user_message: str, tools: list, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": user_message}]

    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # Check if the model wants to use tools
        if response.stop_reason == "tool_use":
            # Add assistant response to history
            messages.append({"role": "assistant", "content": response.content})

            # Extract tool use blocks and execute them
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            # Add tool results as a user message
            messages.append({"role": "user", "content": tool_results})
            print(f"  Iteration {i+1}: executed {len(tool_results)} tool(s)")

        elif response.stop_reason == "end_turn":
            # Extract the text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""

    return "Agent reached maximum iterations."

Key differences from OpenAI:

Tool results are sent as role: "user" messages with tool_result content blocks (not role: "tool")
The response contains content blocks — a mix of text and tool_use blocks
Stop reason is "tool_use" instead of finish_reason: "tool_calls"
Tool inputs are already parsed dicts (no json.loads needed)

The ReAct Pattern

ReAct (Reason + Act) is a prompting pattern where you explicitly ask the model to think through its reasoning before choosing a tool. This gives you an auditable trace of why the agent did what it did.

The pattern follows a cycle: Reason → Act → Observe → Reason → …

REACT_SYSTEM_PROMPT = """You are a helpful assistant with access to tools.

For each step, follow this exact format:

Thought: [Your reasoning about what to do next]
Action: [The tool to use]
Action Input: [The input for the tool]

After receiving a tool result, you'll see:
Observation: [The tool result]

Then continue with another Thought, or provide your final answer:
Thought: I now have enough information to answer.
Final Answer: [Your complete answer to the user]

Always start with a Thought. Never skip the reasoning step."""

Implementing ReAct with Structured Output

While the prompt-based ReAct above works, a more robust approach uses the function calling API for the “Act” step while letting the model reason in its text output:

def run_react_agent(query: str, tools: list, max_iterations: int = 10) -> dict:
    """
    ReAct agent that logs reasoning traces.
    Returns both the answer and the full trace.
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a reasoning agent. Before using any tool, explain your "
                "thinking in your text response. After receiving tool results, "
                "reason about what you learned and what to do next. When you have "
                "enough information, provide your final answer without using tools."
            )
        },
        {"role": "user", "content": query}
    ]

    trace = []  # Audit log of every step

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # Log the reasoning (text content)
        if message.content:
            trace.append({
                "step": i + 1,
                "type": "reasoning",
                "content": message.content
            })

        # If no tool calls, we have the final answer
        if not message.tool_calls:
            return {
                "answer": message.content,
                "trace": trace,
                "iterations": i + 1
            }

        # Process tool calls
        messages.append(message)

        for tool_call in message.tool_calls:
            arguments = json.loads(tool_call.function.arguments)

            trace.append({
                "step": i + 1,
                "type": "tool_call",
                "tool": tool_call.function.name,
                "arguments": arguments
            })

            result = execute_tool(tool_call.function.name, arguments)

            trace.append({
                "step": i + 1,
                "type": "observation",
                "result": result
            })

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return {
        "answer": "Max iterations reached.",
        "trace": trace,
        "iterations": max_iterations
    }


# Usage
result = run_react_agent("What's the weather in London and convert 72°F to Celsius?", tools)

print(f"Answer: {result['answer']}")
print(f"Steps taken: {result['iterations']}")
for step in result["trace"]:
    print(f"  [{step['type']}] {step.get('content') or step.get('tool') or step.get('result')}")

The trace is invaluable for debugging. When an agent produces a wrong answer, the trace shows you exactly where the reasoning went off track.

Designing Good Tools

Tool design is where most agent projects succeed or fail. The model’s ability to use a tool correctly depends almost entirely on how well you describe it.

Input Schemas

Be explicit about types, formats, and constraints:

# BAD: vague description, no constraints
{
    "name": "search",
    "description": "Search for stuff",
    "input_schema": {
        "type": "object",
        "properties": {
            "q": {"type": "string"}
        }
    }
}

# GOOD: specific description, clear parameter names, constraints documented
{
    "name": "search_knowledge_base",
    "description": (
        "Search the internal knowledge base for relevant documents. "
        "Returns up to 5 results ranked by relevance. Use this when the user "
        "asks about company policies, product documentation, or internal processes. "
        "Do NOT use this for general knowledge questions."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query. Be specific — 'vacation policy for US employees' works better than 'vacation'."
            },
            "max_results": {
                "type": "integer",
                "description": "Maximum number of results to return (1-10). Default is 5.",
                "default": 5
            },
            "department": {
                "type": "string",
                "enum": ["engineering", "sales", "hr", "finance", "all"],
                "description": "Filter results to a specific department. Use 'all' if unsure."
            }
        },
        "required": ["query"]
    }
}

Output Formats

Tool outputs should be structured, concise, and include error context:

def search_knowledge_base(query: str, max_results: int = 5, department: str = "all") -> str:
    """Always return structured JSON from tools."""
    try:
        results = vector_store.search(query, limit=max_results, filter_dept=department)

        return json.dumps({
            "status": "success",
            "query": query,
            "result_count": len(results),
            "results": [
                {
                    "title": r.title,
                    "snippet": r.text[:200],
                    "relevance_score": round(r.score, 3),
                    "source": r.metadata.get("source", "unknown"),
                    "last_updated": r.metadata.get("updated_at", "unknown")
                }
                for r in results
            ]
        })
    except Exception as e:
        return json.dumps({
            "status": "error",
            "error_type": type(e).__name__,
            "error_message": str(e),
            "suggestion": "Try a broader search query or different department filter."
        })

Key principles:

Always return JSON strings, not raw text
Include a status field so the model knows if the tool succeeded
On errors, include the error type, message, and a suggestion for what the model should try next
Truncate large results — the model doesn’t need 10,000 characters of raw data

Input Validation

Never trust the model’s tool inputs blindly:

def validate_and_execute(tool_name: str, arguments: dict) -> str:
    """Validate tool inputs before execution."""
    if tool_name == "calculate":
        expr = arguments.get("expression", "")

        # Block dangerous expressions
        dangerous_patterns = ["import", "exec", "eval", "__", "os.", "sys."]
        for pattern in dangerous_patterns:
            if pattern in expr:
                return json.dumps({
                    "status": "error",
                    "error_message": f"Expression contains blocked pattern: {pattern}"
                })

        # Limit length
        if len(expr) > 200:
            return json.dumps({
                "status": "error",
                "error_message": "Expression too long. Maximum 200 characters."
            })

    elif tool_name == "search_knowledge_base":
        query = arguments.get("query", "")
        if len(query) < 3:
            return json.dumps({
                "status": "error",
                "error_message": "Query too short. Provide at least 3 characters."
            })
        if len(query) > 500:
            arguments["query"] = query[:500]  # Truncate silently

    return execute_tool(tool_name, arguments)

Error Handling in Agent Loops

Tools fail. APIs time out. The model generates invalid arguments. Your agent must handle all of this gracefully.

import time
import traceback

def robust_agent_loop(
    messages: list,
    tools: list,
    max_iterations: int = 10,
    tool_timeout: float = 30.0
) -> dict:
    """
    Production-grade agent loop with error handling.
    """
    total_tokens = 0
    total_cost = 0.0
    errors = []

    for i in range(max_iterations):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=tools,
                tool_choice="auto",
                timeout=60
            )
        except openai.APITimeoutError:
            errors.append({"iteration": i, "error": "LLM API timeout"})
            # Retry once
            try:
                response = client.chat.completions.create(
                    model="gpt-4o",
                    messages=messages,
                    tools=tools,
                    tool_choice="auto",
                    timeout=120
                )
            except Exception as e:
                return {
                    "answer": "I'm having trouble connecting to the AI service. Please try again.",
                    "errors": errors,
                    "tokens": total_tokens,
                    "cost": total_cost
                }
        except openai.APIError as e:
            errors.append({"iteration": i, "error": str(e)})
            return {
                "answer": "An error occurred. Please try again.",
                "errors": errors,
                "tokens": total_tokens,
                "cost": total_cost
            }

        # Track usage
        if response.usage:
            total_tokens += response.usage.total_tokens
            total_cost += (response.usage.prompt_tokens * 0.0025 / 1000 +
                          response.usage.completion_tokens * 0.01 / 1000)

        message = response.choices[0].message

        if not message.tool_calls:
            return {
                "answer": message.content,
                "errors": errors,
                "tokens": total_tokens,
                "cost": total_cost,
                "iterations": i + 1
            }

        messages.append(message)

        for tool_call in message.tool_calls:
            # Parse arguments safely
            try:
                arguments = json.loads(tool_call.function.arguments)
            except json.JSONDecodeError:
                errors.append({
                    "iteration": i,
                    "error": f"Invalid JSON from model: {tool_call.function.arguments}"
                })
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps({
                        "status": "error",
                        "error_message": "Invalid arguments. Please try again with valid JSON."
                    })
                })
                continue

            # Execute with timeout
            try:
                start = time.time()
                result = execute_tool(tool_call.function.name, arguments)
                elapsed = time.time() - start

                if elapsed > tool_timeout:
                    errors.append({
                        "iteration": i,
                        "error": f"Tool {tool_call.function.name} took {elapsed:.1f}s"
                    })

            except Exception as e:
                errors.append({
                    "iteration": i,
                    "error": f"Tool execution failed: {traceback.format_exc()}"
                })
                result = json.dumps({
                    "status": "error",
                    "error_message": f"Tool failed: {str(e)}",
                    "suggestion": "Try a different approach or ask the user for clarification."
                })

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return {
        "answer": "I wasn't able to complete the task within the allowed steps. Here's what I found so far.",
        "errors": errors,
        "tokens": total_tokens,
        "cost": total_cost,
        "iterations": max_iterations
    }

Human-in-the-Loop

Some actions are dangerous — deleting data, sending emails, making purchases. For these, require human confirmation:

DANGEROUS_TOOLS = {"delete_record", "send_email", "make_purchase", "modify_database"}

def agent_with_confirmation(
    messages: list,
    tools: list,
    confirm_callback,  # Function that asks the user for confirmation
    max_iterations: int = 10
) -> dict:
    """Agent that pauses for human approval on dangerous actions."""

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        if not message.tool_calls:
            return {"answer": message.content}

        messages.append(message)

        for tool_call in message.tool_calls:
            arguments = json.loads(tool_call.function.arguments)

            # Check if this tool requires confirmation
            if tool_call.function.name in DANGEROUS_TOOLS:
                approved = confirm_callback(
                    tool_name=tool_call.function.name,
                    arguments=arguments,
                    reasoning=message.content  # Show the model's reasoning
                )

                if not approved:
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": json.dumps({
                            "status": "denied",
                            "message": "User denied this action. Ask the user how they'd like to proceed."
                        })
                    })
                    continue

            result = execute_tool(tool_call.function.name, arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return {"answer": "Max iterations reached."}


# Example confirmation callback for a CLI application
def cli_confirm(tool_name: str, arguments: dict, reasoning: str) -> bool:
    print(f"\n⚠️  Agent wants to execute: {tool_name}")
    print(f"   Arguments: {json.dumps(arguments, indent=2)}")
    print(f"   Reasoning: {reasoning}")
    response = input("   Approve? (y/n): ").strip().lower()
    return response == "y"

Agent Memory Patterns

Agents need to remember context across iterations. There are three common patterns:

1. Conversation History (Built-in)

The simplest form — the full message list is the memory. This is what all the examples above use. The downside is that long agent runs accumulate many messages, increasing token costs.

2. Scratchpad

A dedicated tool that lets the agent save and retrieve notes:

agent_scratchpad = {}

scratchpad_tools = [
    {
        "type": "function",
        "function": {
            "name": "save_note",
            "description": "Save a note to your scratchpad for later reference. Use this to track intermediate results, plans, or important findings.",
            "parameters": {
                "type": "object",
                "properties": {
                    "key": {"type": "string", "description": "A short label for this note"},
                    "value": {"type": "string", "description": "The content to save"}
                },
                "required": ["key", "value"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_notes",
            "description": "Read all saved notes from your scratchpad.",
            "parameters": {"type": "object", "properties": {}}
        }
    }
]

def execute_scratchpad(name: str, arguments: dict) -> str:
    if name == "save_note":
        agent_scratchpad[arguments["key"]] = arguments["value"]
        return json.dumps({"status": "saved", "key": arguments["key"]})
    elif name == "read_notes":
        return json.dumps({"notes": agent_scratchpad})

3. Summarization

For long-running agents, periodically summarize the conversation history to keep token counts manageable:

def summarize_history(messages: list, keep_recent: int = 4) -> list:
    """Compress old messages into a summary, keep recent ones intact."""
    if len(messages) <= keep_recent + 2:  # system + user + recent
        return messages

    system_msg = messages[0]
    old_messages = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    # Use the LLM to summarize
    summary_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for summarization
        messages=[
            {"role": "system", "content": "Summarize this conversation concisely. Focus on key findings, decisions made, and what was accomplished."},
            {"role": "user", "content": json.dumps([{"role": m.get("role", ""), "content": str(m.get("content", ""))} for m in old_messages])}
        ],
        max_tokens=500
    )

    summary = summary_response.choices[0].message.content

    return [
        system_msg,
        {"role": "system", "content": f"Summary of earlier conversation: {summary}"},
        *recent_messages
    ]

Building a Multi-Tool Agent

Let’s build a practical agent with multiple tools — web search, calculation, and a knowledge base lookup:

import json
import time
import hashlib
from datetime import datetime

# --- Tool implementations ---

def web_search(query: str, num_results: int = 3) -> str:
    """Simulated web search. In production, use SerpAPI, Brave, or Tavily."""
    # Replace with a real search API call
    return json.dumps({
        "status": "success",
        "results": [
            {
                "title": f"Result {i+1} for: {query}",
                "url": f"https://example.com/result-{i+1}",
                "snippet": f"This is a search result about {query}..."
            }
            for i in range(num_results)
        ]
    })


def safe_calculate(expression: str) -> str:
    """Safe math evaluation without eval()."""
    import ast
    import operator

    ops = {
        ast.Add: operator.add,
        ast.Sub: operator.sub,
        ast.Mult: operator.mul,
        ast.Div: operator.truediv,
        ast.Pow: operator.pow,
        ast.USub: operator.neg,
    }

    def _eval(node):
        if isinstance(node, ast.Constant):
            return node.value
        elif isinstance(node, ast.BinOp):
            return ops[type(node.op)](_eval(node.left), _eval(node.right))
        elif isinstance(node, ast.UnaryOp):
            return ops[type(node.op)](_eval(node.operand))
        else:
            raise ValueError(f"Unsupported operation: {type(node)}")

    try:
        tree = ast.parse(expression, mode="eval")
        result = _eval(tree.body)
        return json.dumps({"status": "success", "result": result})
    except Exception as e:
        return json.dumps({"status": "error", "error_message": str(e)})


def get_current_time(timezone: str = "UTC") -> str:
    return json.dumps({
        "status": "success",
        "time": datetime.utcnow().isoformat() + "Z",
        "timezone": timezone
    })


# --- Tool registry ---

TOOL_REGISTRY = {
    "web_search": web_search,
    "calculate": safe_calculate,
    "get_current_time": get_current_time,
}

TOOL_DEFINITIONS = [
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information. Use this for factual questions, recent events, or topics you're not sure about.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "num_results": {"type": "integer", "description": "Number of results (1-5)", "default": 3}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a mathematical expression. Supports +, -, *, /, and ** (power).",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression, e.g. '(42 * 3) + 17'"}
                },
                "required": ["expression"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current date and time.",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {"type": "string", "description": "Timezone (default: UTC)"}
                }
            }
        }
    }
]


# --- The complete agent ---

class ProductionAgent:
    def __init__(self, system_prompt: str, tools: list, tool_registry: dict,
                 model: str = "gpt-4o", max_iterations: int = 10):
        self.system_prompt = system_prompt
        self.tools = tools
        self.tool_registry = tool_registry
        self.model = model
        self.max_iterations = max_iterations
        self.client = openai.OpenAI()

    def run(self, user_message: str) -> dict:
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_message}
        ]

        run_id = hashlib.md5(f"{user_message}{time.time()}".encode()).hexdigest()[:8]
        log = {
            "run_id": run_id,
            "query": user_message,
            "started_at": datetime.utcnow().isoformat(),
            "steps": [],
            "total_tokens": 0,
            "total_cost": 0.0
        }

        for i in range(self.max_iterations):
            step = {"iteration": i + 1, "tool_calls": []}

            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    tools=self.tools,
                    tool_choice="auto"
                )
            except Exception as e:
                step["error"] = str(e)
                log["steps"].append(step)
                log["answer"] = f"Error: {str(e)}"
                return log

            # Track tokens
            if response.usage:
                tokens = response.usage.total_tokens
                cost = (response.usage.prompt_tokens * 0.0025 / 1000 +
                       response.usage.completion_tokens * 0.01 / 1000)
                log["total_tokens"] += tokens
                log["total_cost"] += cost
                step["tokens"] = tokens

            message = response.choices[0].message

            if message.content:
                step["reasoning"] = message.content

            # Final answer
            if not message.tool_calls:
                step["final_answer"] = True
                log["steps"].append(step)
                log["answer"] = message.content
                log["finished_at"] = datetime.utcnow().isoformat()
                log["iterations"] = i + 1
                return log

            # Process tool calls
            messages.append(message)

            for tc in message.tool_calls:
                tool_step = {
                    "tool": tc.function.name,
                    "arguments": json.loads(tc.function.arguments)
                }

                try:
                    func = self.tool_registry.get(tc.function.name)
                    if not func:
                        result = json.dumps({"status": "error", "error_message": f"Unknown tool: {tc.function.name}"})
                    else:
                        start = time.time()
                        result = func(**json.loads(tc.function.arguments))
                        tool_step["latency_ms"] = round((time.time() - start) * 1000)
                except Exception as e:
                    result = json.dumps({"status": "error", "error_message": str(e)})
                    tool_step["error"] = str(e)

                tool_step["result_preview"] = result[:200]
                step["tool_calls"].append(tool_step)

                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": result
                })

            log["steps"].append(step)

        log["answer"] = "Max iterations reached."
        log["finished_at"] = datetime.utcnow().isoformat()
        log["iterations"] = self.max_iterations
        return log


# --- Usage ---

agent = ProductionAgent(
    system_prompt="You are a helpful research assistant. Use tools to find accurate information. Always cite your sources.",
    tools=TOOL_DEFINITIONS,
    tool_registry=TOOL_REGISTRY,
    max_iterations=8
)

result = agent.run("What time is it, and what's 2^10 * 3.14?")
print(f"Answer: {result['answer']}")
print(f"Tokens used: {result['total_tokens']}")
print(f"Cost: ${result['total_cost']:.4f}")
print(f"Iterations: {result['iterations']}")

Production Logging and Cost Tracking

Every agent run should be logged. In production, you need to answer questions like “why did the agent do that?” and “how much did this cost?”

import logging
import json
from dataclasses import dataclass, asdict
from datetime import datetime

logger = logging.getLogger("agent")
logger.setLevel(logging.INFO)

@dataclass
class AgentRunLog:
    run_id: str
    user_id: str
    query: str
    answer: str
    model: str
    total_tokens: int
    total_cost: float
    iterations: int
    tool_calls: list
    errors: list
    started_at: str
    finished_at: str
    latency_ms: float

    def to_json(self) -> str:
        return json.dumps(asdict(self), default=str)


def log_agent_run(run_log: AgentRunLog):
    """Log to structured JSON for analysis."""
    logger.info(run_log.to_json())

    # Also store in database for dashboards
    # db.insert("agent_runs", asdict(run_log))

    # Alert on anomalies
    if run_log.total_cost > 0.50:
        logger.warning(f"High-cost agent run: ${run_log.total_cost:.2f} for run {run_log.run_id}")

    if run_log.iterations >= 8:
        logger.warning(f"Agent used {run_log.iterations} iterations for run {run_log.run_id}")

    if run_log.errors:
        logger.error(f"Agent run {run_log.run_id} had {len(run_log.errors)} errors")

Testing Agents

Agents are non-deterministic by nature. Testing them requires a different approach than unit testing deterministic functions.

import unittest
from unittest.mock import patch, MagicMock

class TestAgentToolExecution(unittest.TestCase):
    """Test individual tools deterministically."""

    def test_safe_calculate_basic(self):
        result = json.loads(safe_calculate("2 + 3"))
        self.assertEqual(result["status"], "success")
        self.assertEqual(result["result"], 5)

    def test_safe_calculate_rejects_imports(self):
        """Ensure the calculator can't execute arbitrary code."""
        result = json.loads(safe_calculate("__import__('os').system('rm -rf /')"))
        self.assertEqual(result["status"], "error")

    def test_safe_calculate_division_by_zero(self):
        result = json.loads(safe_calculate("1 / 0"))
        self.assertEqual(result["status"], "error")


class TestAgentLoop(unittest.TestCase):
    """Test the agent loop with mocked LLM responses."""

    @patch("openai.OpenAI")
    def test_agent_respects_max_iterations(self, mock_client):
        """Agent should stop after max_iterations even if the model keeps requesting tools."""
        # Mock the LLM to always request a tool call
        mock_response = MagicMock()
        mock_response.choices[0].message.tool_calls = [
            MagicMock(
                id="call_1",
                function=MagicMock(
                    name="calculate",
                    arguments='{"expression": "1+1"}'
                )
            )
        ]
        mock_response.choices[0].message.content = None
        mock_response.usage.total_tokens = 100
        mock_response.usage.prompt_tokens = 80
        mock_response.usage.completion_tokens = 20

        mock_client.return_value.chat.completions.create.return_value = mock_response

        agent = ProductionAgent(
            system_prompt="test",
            tools=TOOL_DEFINITIONS,
            tool_registry=TOOL_REGISTRY,
            max_iterations=3
        )
        agent.client = mock_client.return_value

        result = agent.run("test query")
        self.assertEqual(result["iterations"], 3)
        self.assertIn("Max iterations", result["answer"])


class TestAgentEndToEnd(unittest.TestCase):
    """
    Integration tests against the real API.
    Run with: pytest -m integration
    These cost money — run sparingly.
    """

    @unittest.skip("Costs money — run manually")
    def test_agent_can_calculate(self):
        agent = ProductionAgent(
            system_prompt="You are a helpful assistant.",
            tools=TOOL_DEFINITIONS,
            tool_registry=TOOL_REGISTRY
        )
        result = agent.run("What is 123 * 456?")
        self.assertIn("56088", result["answer"])
        self.assertGreater(result["iterations"], 0)

When NOT to Use Agents

Agents add complexity, cost, and unpredictability. Use them only when you genuinely need dynamic tool selection. Here’s a decision framework:

Use a direct API call when:

You know exactly which tool to call and in what order
The task is a single step: summarize this text, translate this sentence

Use a chain (fixed sequence) when:

The steps are always the same: retrieve → generate → format
A RAG pipeline is a chain, not an agent

Use an agent when:

The user’s intent determines which tools to use
The number of steps varies per query
The model needs to react to intermediate results
You genuinely don’t know the execution path at design time

# This does NOT need an agent — it's always the same steps:
def answer_question(question: str) -> str:
    docs = retrieve(question)       # Always step 1
    answer = generate(question, docs)  # Always step 2
    return answer

# This DOES need an agent — the steps depend on the question:
# "What's the weather?" → 1 tool call
# "Compare weather in NYC and London, convert to Celsius" → 3 tool calls
# "Find today's weather and calculate heating costs" → 2+ tool calls

The rule of thumb: if you can draw the flow as a straight line, you don’t need an agent. If the flow is a tree that depends on runtime decisions, you do.

Key Takeaways

An agent is LLM + tools + a loop. The model decides which tool to call, you execute it, return the result, and repeat. That’s the entire architecture.
Tool descriptions are your most important design decision. The model relies on descriptions to choose tools. Vague descriptions lead to wrong tool selection. Be specific about what each tool does and when to use it.
Always set iteration limits. Without a max_iterations guard, agents can loop forever. Start with 10 iterations and adjust based on your use case.
Validate all tool inputs. The model generates tool arguments — treat them like untrusted user input. Validate types, check lengths, block dangerous patterns.
Return structured errors from tools. When a tool fails, return a JSON error with a suggestion. This lets the model recover gracefully instead of hallucinating.
The ReAct pattern gives you auditability. Forcing the model to reason before acting creates a decision trace you can log, review, and debug.
Human-in-the-loop is not optional for dangerous actions. Any tool that modifies data, sends communications, or costs money should require human approval.
Log every agent run end-to-end. Track run IDs, tool calls, token counts, costs, and errors. You will need this data when debugging production issues.
Test tools deterministically, test agents with mocks. Unit test each tool in isolation. Mock LLM responses to test the agent loop logic. Run integration tests sparingly against real APIs.
Most tasks don’t need agents. If the execution path is always the same, use a chain. Agents are for tasks where the model genuinely needs to decide what to do next based on runtime information.