How LLMs Actually Work — What You Need to Know — LLM Engineering in Production

You are going to build production systems on top of LLMs. Before you do, you need to understand what is actually happening inside the box. Not the research paper version. Not the “it’s just statistics” dismissal. The engineering version — the one that helps you debug, optimize, and make informed architectural decisions.

This lesson walks through the transformer pipeline from input to output. By the end, you will understand why context windows exist, why tokens cost money, why models hallucinate, and why prompt order matters.

1. The Transformer Architecture — A Programmer’s View

Every modern LLM — GPT-4o, Claude, Llama, Mistral, Gemini — is built on the transformer architecture. Published in 2017 in the paper “Attention Is All You Need,” it replaced earlier approaches (RNNs, LSTMs) because it could be massively parallelized during training.

Here is the pipeline at a high level:

Input text
  → Tokenization (split text into tokens)
  → Embedding (convert tokens to vectors)
  → Positional encoding (add position information)
  → N transformer layers (self-attention + feed-forward)
  → Output projection (vectors → probability over vocabulary)
  → Sampling (pick the next token)
  → Repeat until done

Transformer pipeline — from input text through tokenization, embedding, attention layers, to output token prediction

Think of it as a pipeline of transformations. Text goes in one end, a probability distribution over the next token comes out the other. Each stage has engineering implications that matter to you.

What you are paying for

When you call an LLM API, you are running an inference pass through this pipeline. Each generated token requires a full forward pass through all transformer layers. A model with 70 billion parameters needs to multiply your input through 70 billion weights — per token. That is why:

Larger models cost more per token
Output tokens cost more than input tokens (each output token triggers a new forward pass)
Latency scales with model size and output length

2. Tokenization — Where Text Becomes Numbers

Models do not see text. They see integers. Tokenization is the first step — converting raw text into a sequence of token IDs from a fixed vocabulary.

Tokenization in practice with tiktoken

import tiktoken

# Get the tokenizer for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")

text = "The transformer architecture revolutionized NLP."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print()

# Decode each token individually to see the splits
for token_id in tokens:
    print(f"  {token_id:>6} -> '{enc.decode([token_id])}'")

Output:

Text: The transformer architecture revolutionized NLP.
Token IDs: [976, 43578, 18112, 14110, 1463, 49855, 13]
Token count: 7
Decoded individually:
     976 -> 'The'
   43578 -> ' transformer'
   18112 -> ' architecture'
   14110 -> ' revolution'
    1463 -> 'ized'
   49855 -> ' NLP'
      13 -> '.'

Notice that “revolutionized” gets split into two tokens: “revolution” + “ized”. Common words are single tokens. Rare or long words get split into subword pieces. This is called Byte Pair Encoding (BPE).

Tokenization quirks that affect production code

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

# Different languages have very different token efficiencies
examples = {
    "English":  "Hello, how are you doing today?",
    "Spanish":  "Hola, como estas hoy?",
    "Japanese": "こんにちは、今日はお元気ですか？",
    "Code":     "def calculate_sum(numbers: list[int]) -> int:",
    "JSON":     '{"name": "Alice", "age": 30, "active": true}',
}

for lang, text in examples.items():
    tokens = enc.encode(text)
    ratio = len(text) / len(tokens)
    print(f"{lang:>10}: {len(tokens):>3} tokens | "
          f"{len(text):>3} chars | "
          f"{ratio:.1f} chars/token")

Output:

   English:   8 tokens |  31 chars | 3.9 chars/token
   Spanish:   7 tokens |  22 chars | 3.1 chars/token
  Japanese:  11 tokens |  16 chars | 1.5 chars/token
      Code:  12 tokens |  46 chars | 3.8 chars/token
      JSON:  15 tokens |  43 chars | 2.9 chars/token

Production implications:

Observation	Why it matters
Non-English text uses more tokens per character	Non-English users cost more per request
JSON is token-expensive (brackets, quotes, colons)	Structured output formats inflate your bill
Code is relatively efficient	Code generation tasks have reasonable cost profiles
Whitespace and formatting consume tokens	Minifying prompts can save 5-15% on tokens

Counting tokens before sending requests

Always count tokens before making API calls. This prevents context window errors and lets you estimate costs:

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens in a chat messages array."""
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for message in messages:
        total += 4  # message framing overhead
        for key, value in message.items():
            total += len(enc.encode(value))
    total += 2  # reply priming
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain transformers in one paragraph."},
]

token_count = count_tokens(messages)
print(f"Prompt tokens: {token_count}")
print(f"Estimated cost at GPT-4o rates: ${token_count * 2.50 / 1_000_000:.6f}")

3. Embeddings — What Vectors Actually Represent

After tokenization, each token ID gets converted to an embedding vector — a dense array of floating-point numbers (typically 4096 to 12288 dimensions, depending on the model).

Token "cat"  -> [0.12, -0.45, 0.78, ..., 0.33]  (4096 numbers)
Token "dog"  -> [0.14, -0.41, 0.75, ..., 0.29]  (4096 numbers)
Token "lamp" -> [-0.82, 0.11, -0.23, ..., 0.67] (4096 numbers)

The key insight: similar words end up near each other in this vector space. “cat” and “dog” have similar vectors because they appear in similar contexts during training. “lamp” is far away from both.

This is not hand-coded. The model learns these representations during training by adjusting billions of weights to minimize prediction error.

Why embeddings matter for production

Embeddings are not just an internal detail. They are the foundation of:

Semantic search — Find documents similar to a query by comparing embedding vectors
RAG (Retrieval-Augmented Generation) — Retrieve relevant context before prompting
Classification — Cluster or classify text based on embedding similarity
Deduplication — Find near-duplicate content without exact string matching

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    """Get the embedding vector for a text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare semantic similarity
pairs = [
    ("The cat sat on the mat", "A feline rested on the rug"),
    ("The cat sat on the mat", "Stock prices rose sharply today"),
    ("Python is a programming language", "Python is a type of snake"),
]

for text_a, text_b in pairs:
    emb_a = get_embedding(text_a)
    emb_b = get_embedding(text_b)
    sim = cosine_similarity(emb_a, emb_b)
    print(f"Similarity: {sim:.3f}")
    print(f"  A: {text_a}")
    print(f"  B: {text_b}\n")

Output:

Similarity: 0.847
  A: The cat sat on the mat
  B: A feline rested on the rug

Similarity: 0.112
  A: The cat sat on the mat
  B: Stock prices rose sharply today

Similarity: 0.634
  A: Python is a programming language
  B: Python is a type of snake

Positional encoding

Embeddings alone do not carry position information. The model needs to know that “the cat ate the fish” is different from “the fish ate the cat.” Positional encodings are added to the embedding vectors to encode where each token sits in the sequence.

Modern models use Rotary Position Embeddings (RoPE), which encode relative positions and can be extended to handle longer contexts than the model was originally trained on.

4. Self-Attention — The Core Mechanism

Self-attention is what makes transformers work. It allows every token to “look at” every other token in the sequence and decide how much to pay attention to each one.

The intuition

Consider this sentence: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? “The animal.” A human knows this because you understand the causal relationship. Self-attention is the mechanism that lets the model figure this out — by computing a relevance score between “it” and every other token.

Self-attention mechanism — how each token computes attention weights over all other tokens

How it works (simplified)

For each token, the model computes three vectors from the embedding:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I provide?”

The attention score between two tokens is the dot product of the Query of one and the Key of the other. High score means “pay attention to this token.”

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

You do not need to implement this. But understanding the mechanism explains several production behaviors:

Behavior	Explanation via attention
Models “forget” information in long prompts	Attention scores get diluted across more tokens
Instruction at the start of a prompt works differently than at the end	Positional encoding affects attention patterns
The “lost in the middle” problem	Attention tends to be strongest at the beginning and end of the context
Models can follow “focus on X” instructions	You are biasing which tokens get high attention scores
Repetition in output	The model attends to its own recent output and reinforces patterns

Multi-head attention

The model does not compute attention once. It computes it multiple times in parallel (“heads”), each head potentially focusing on different relationships. One head might track syntactic relationships (subject-verb), another might track semantic relationships (pronoun-antecedent), another might track positional patterns.

A model like GPT-4 might have 96 attention heads per layer, across 120 layers. That is 11,520 separate attention computations per forward pass.

The quadratic cost problem

Here is the engineering punchline: attention is computed between every pair of tokens. For a sequence of length N, that is N x N comparisons. Double the context length, and you quadruple the computation.

Context length	Attention computations	Relative cost
1,000 tokens	1,000,000	1x
4,000 tokens	16,000,000	16x
32,000 tokens	1,024,000,000	1,024x
128,000 tokens	16,384,000,000	16,384x

This is why context windows have limits. It is not a software limitation — it is a fundamental computational cost. Models that advertise very large context windows (Gemini’s 2M tokens) use optimizations like sparse attention, sliding window attention, or linear attention approximations to manage this cost.

Production implication: Sending 128K tokens when you only need 4K is not just wasteful — it is dramatically more expensive and slower.

5. The Forward Pass — Input to Output

Let us trace a complete forward pass for generating one token:

1. Tokenize input: "What is the capital of" -> [3923, 374, 279, 6864, 315]

2. Embed tokens: Each ID -> 4096-dimensional vector

3. Add positional encoding: Inject position information

4. Pass through N transformer layers:
   Layer 1: Self-attention -> Feed-forward -> Normalize
   Layer 2: Self-attention -> Feed-forward -> Normalize
   ...
   Layer N: Self-attention -> Feed-forward -> Normalize

5. Project final hidden state to vocabulary size:
   4096-dim vector -> 100,000-dim vector (one value per vocab token)

6. Apply softmax to get probabilities:
   " France": 0.82
   " Paris":  0.07
   " the":    0.03
   " China":  0.01
   ...

7. Sample from distribution (controlled by temperature):
   Selected token: " France"

8. Append to input, repeat from step 1:
   "What is the capital of France" -> predict next token

Each transformer layer has two sub-components:

Self-attention: Each token attends to all others, rewriting its representation based on context
Feed-forward network: A position-wise neural network that processes each token independently, adding capacity for pattern matching

The feed-forward layers are where most of the model’s “knowledge” is stored — factual associations, grammar rules, reasoning patterns. The attention layers are the routing mechanism that decides which knowledge to apply.

6. Training vs. Inference — What You Are Paying For

Understanding the distinction between training and inference clarifies what LLM APIs actually provide.

Training (you never do this)

Training is the process of adjusting the model’s billions of parameters by showing it vast amounts of text. The training process:

Pre-training: Feed the model trillions of tokens from books, websites, code. The model learns to predict the next token. This costs tens of millions of dollars in compute.
Fine-tuning (SFT): Train on curated instruction-response pairs to make the model follow instructions.
RLHF/RLAIF: Use human (or AI) feedback to align the model’s behavior with desired outputs — being helpful, harmless, and honest.

Pre-training cost estimates:
  GPT-4:    ~$100M+ in compute
  Llama 3.1 405B: ~$30M in compute
  Mistral Large:  ~$10M (estimated)

You do not pay for training when you use an API. The provider amortizes training costs across all their customers.

Inference (this is what you pay for)

Inference is the forward pass — running input through the trained model to get output. Every API call is an inference request.

# This single call runs inference
# Cost: input_tokens * input_price + output_tokens * output_price
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

KV Cache — why prompt tokens are cheaper than output tokens

During inference, the model uses a Key-Value (KV) cache. When processing the input prompt, the model computes attention Key and Value vectors for all input tokens and caches them. When generating output tokens, it only needs to compute Q, K, V for the new token and look up the cached K, V for all previous tokens.

This is why:

Input tokens are cheaper: They are processed in parallel, in one batch
Output tokens are more expensive: Each one requires a sequential forward pass
Prompt caching saves money: If your prompt starts the same way every time, providers can reuse the cached KV pairs

7. Temperature, top_p, and Sampling

After the forward pass produces a probability distribution over the vocabulary, the model needs to pick one token. This is sampling, and the parameters you control determine how it works.

Temperature

Temperature scales the logits (raw scores) before applying softmax. Lower temperature makes the distribution sharper (peaky), higher makes it flatter (uniform).

import numpy as np

def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
    """Apply temperature scaling to logits."""
    if temperature == 0:
        # Greedy: always pick the highest
        result = np.zeros_like(logits)
        result[np.argmax(logits)] = 1.0
        return result
    scaled = logits / temperature
    exp = np.exp(scaled - np.max(scaled))  # Subtract max for numerical stability
    return exp / exp.sum()

# Example logits for tokens: ["Paris", "France", "London", "Berlin"]
logits = np.array([5.0, 3.0, 1.5, 0.5])

for temp in [0.0, 0.3, 0.7, 1.0, 1.5]:
    probs = apply_temperature(logits, temp)
    print(f"temp={temp:.1f}: Paris={probs[0]:.3f}  France={probs[1]:.3f}  "
          f"London={probs[2]:.3f}  Berlin={probs[3]:.3f}")

Output:

temp=0.0: Paris=1.000  France=0.000  London=0.000  Berlin=0.000
temp=0.3: Paris=0.997  France=0.003  London=0.000  Berlin=0.000
temp=0.7: Paris=0.907  France=0.082  London=0.009  Berlin=0.002
temp=1.0: Paris=0.727  France=0.098  London=0.022  Berlin=0.008
temp=1.5: Paris=0.500  France=0.153  London=0.055  Berlin=0.027

top_p (nucleus sampling)

Instead of considering all possible tokens, top_p keeps only the smallest set of tokens whose cumulative probability exceeds the threshold p. This cuts off the long tail of unlikely tokens.

def apply_top_p(probs: np.ndarray, top_p: float) -> np.ndarray:
    """Apply nucleus sampling — keep tokens until cumulative prob >= top_p."""
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)

    # Find cutoff
    cutoff_idx = np.searchsorted(cumulative, top_p) + 1
    mask = np.zeros_like(probs)
    mask[sorted_indices[:cutoff_idx]] = 1

    filtered = probs * mask
    return filtered / filtered.sum()  # Renormalize

# With top_p=0.9, we keep tokens until we hit 90% cumulative probability
probs = np.array([0.72, 0.10, 0.08, 0.05, 0.03, 0.02])
filtered = apply_top_p(probs, top_p=0.9)
print(f"Original: {probs}")
print(f"After top_p=0.9: {np.round(filtered, 3)}")

When to use which

Setting	Use case
`temperature=0`	Classification, data extraction, structured output, any task where you need consistency
`temperature=0.3, top_p=0.9`	Code generation, technical writing — mostly deterministic with slight variation
`temperature=0.7, top_p=0.95`	General conversation, summarization — natural sounding with variety
`temperature=1.0, top_p=1.0`	Creative writing, brainstorming — maximum diversity

Do not set both temperature and top_p to non-default values at the same time. Pick one strategy. OpenAI’s documentation explicitly recommends altering one or the other, not both.

8. Why Models Hallucinate

Hallucination is not a bug in the software. It is a fundamental property of how these models work. Understanding why helps you design systems that mitigate it.

The core problem: no grounding mechanism

The model generates the most probable next token given the context. It has no way to verify whether its output is factually correct. There is no internal fact-checking step. There is no database lookup. There is only pattern completion.

Prompt: "The CEO of Google in 2024 is"
Model's process:
  - "Sundar" has high probability (correct, based on training data)
  - But so does "Satya" if context is ambiguous
  - The model picks based on learned patterns, not a verified source

When hallucination is most likely

Scenario	Risk level	Why
Obscure facts or recent events	High	Less training data to form strong patterns
Specific numbers, dates, URLs	High	Models are not precise retrieval systems
Multi-step reasoning	Medium-High	Errors compound across steps
Generating code for well-known libraries	Low	Massive training data with consistent patterns
Summarizing provided text	Low	The answer is in the context

Production mitigation strategies

# Strategy 1: Ground responses in provided context (RAG)
messages = [
    {"role": "system", "content": (
        "Answer ONLY based on the provided context. "
        "If the context doesn't contain the answer, say 'I don't know.' "
        "Never make up information."
    )},
    {"role": "user", "content": f"Context:\n{retrieved_docs}\n\nQuestion: {question}"},
]

# Strategy 2: Ask the model to quote its sources
messages = [
    {"role": "system", "content": (
        "When answering, cite the specific passage from the provided "
        "documents that supports your answer. Format: [Source: doc_name, paragraph X]"
    )},
    {"role": "user", "content": f"Documents:\n{docs}\n\nQuestion: {question}"},
]

# Strategy 3: Use structured output to constrain responses
messages = [
    {"role": "system", "content": (
        "Extract entities from the text. Return JSON with only the fields: "
        "name, date, location. Use null for any field not found in the text. "
        "Do NOT infer or guess missing values."
    )},
    {"role": "user", "content": f"Text: {text}"},
]

9. Why Prompt Order and Structure Matter

The transformer’s attention mechanism creates measurable biases based on position. This has direct implications for how you structure prompts.

The “lost in the middle” effect

Research has shown that LLMs pay the most attention to information at the beginning and end of the context window. Information buried in the middle gets less attention — literally lower attention scores.

Attention strength across position:
  [HIGH] ... [decreasing] ... [LOW] ... [increasing] ... [HIGH]
  ^start                      ^middle                      ^end

Practical rule: Put your most important instructions and context at the beginning or end of the prompt. Never bury critical information in the middle of a long context.

Instruction positioning experiment

from openai import OpenAI

client = OpenAI()

# Same instruction, different position — different results
long_context = "... (imagine 50 paragraphs of text here) ..."

# Instruction at the START (better)
prompt_start = f"""Answer in exactly one sentence.

{long_context}

What is the main topic of the text above?"""

# Instruction at the END (also good)
prompt_end = f"""{long_context}

Based on the text above, what is the main topic?
Answer in exactly one sentence."""

# Instruction buried in the MIDDLE (worse)
half = len(long_context) // 2
prompt_middle = f"""{long_context[:half]}

Remember: answer in exactly one sentence.

{long_context[half:]}

What is the main topic?"""

Recency bias in generation

The model’s output is influenced by what it just generated. This creates a recency bias — if the model starts going down a wrong path, it tends to continue in that direction because its own recent tokens have high attention scores.

This is why:

Few-shot examples work: The model attends to the pattern in your examples and continues it
Chain-of-thought helps: Breaking reasoning into steps gives the model correct intermediate tokens to attend to
Bad first tokens can derail everything: If the model starts with “I’m not sure, but…” it tends to generate a less confident (and often less accurate) response

10. Practical Implications Summary

Here is the mental model you should carry into every engineering decision:

The cost model

# Your LLM costs are driven by this formula
cost = (input_tokens * input_price + output_tokens * output_price) * num_requests

# To reduce costs:
# 1. Reduce input_tokens (shorter prompts, less context)
# 2. Reduce output_tokens (set max_tokens, use structured output)
# 3. Use cheaper models (GPT-4o mini instead of GPT-4o)
# 4. Reduce num_requests (cache responses, batch similar queries)

The quality model

Factor	Effect on quality
Clearer instructions	Better output
Relevant context in prompt	Reduces hallucination
temperature=0	More consistent, less creative
Longer context	Diminishing returns, “lost in middle”
Better model	Generally better but 5x+ cost
Few-shot examples	Significantly better for format/style

The latency model

Total latency = time_to_first_token + (output_tokens * time_per_token)

time_to_first_token: depends on input length and model size
time_per_token: roughly constant for a given model (~20-50ms for GPT-4o)

Example:
  Input: 2000 tokens, Output: 500 tokens
  TTFT: ~500ms
  Generation: 500 * 30ms = 15s
  Total: ~15.5s (without streaming, user waits this entire time)

LLM latency breakdown — time to first token vs token generation time

Quick reference: model parameter counts

Model	Parameters	Layers	Hidden size	Attention heads
GPT-4o	~200B (estimated)	~120	~12,288	~96
Claude 3.5 Sonnet	Undisclosed	-	-	-
Llama 3.1 70B	70B	80	8,192	64
Llama 3.1 8B	8B	32	4,096	32
Mistral 7B	7B	32	4,096	32

More parameters generally means better quality but higher cost and latency. The skill of LLM engineering is finding the smallest model that meets your quality bar.

Key Takeaways

LLMs are next-token predictors built on the transformer architecture. Text goes in, gets tokenized, embedded, processed through attention layers, and a probability distribution over the next token comes out. Generation is an autoregressive loop — one token at a time.
Tokenization determines your costs and limits. Everything is measured in tokens, not words. Different languages and formats have different token efficiencies. Always count tokens before making API calls.
Embeddings encode meaning as vectors. Similar concepts end up near each other in vector space. This property powers semantic search, RAG, and classification — all tools you will use in production.
Self-attention is the core mechanism. Every token attends to every other token, which is powerful but has quadratic cost. This is why context windows exist and why sending less context is almost always better.
The KV cache makes input tokens cheaper than output tokens. Input is processed in parallel; output is generated sequentially. This directly affects API pricing and latency.
Temperature and top_p control sampling. Use temperature=0 for deterministic tasks, 0.7 for general use. Do not tweak both simultaneously.
Hallucination is inherent, not a bug. The model has no fact-checking mechanism. Mitigate with grounded context (RAG), structured output constraints, and explicit “say I don’t know” instructions.
Prompt position matters. Information at the beginning and end of the context gets the most attention. Critical instructions should never be buried in the middle.
Your three optimization levers are cost, quality, and latency. Every architectural decision trades between them. Understanding the transformer pipeline helps you make those trades intelligently.