You are going to build production systems on top of LLMs. Before you do, you need to understand what is actually happening inside the box. Not the research paper version. Not the “it’s just statistics” dismissal. The engineering version — the one that helps you debug, optimize, and make informed architectural decisions.
This lesson walks through the transformer pipeline from input to output. By the end, you will understand why context windows exist, why tokens cost money, why models hallucinate, and why prompt order matters.
1. The Transformer Architecture — A Programmer’s View
Every modern LLM — GPT-4o, Claude, Llama, Mistral, Gemini — is built on the transformer architecture. Published in 2017 in the paper “Attention Is All You Need,” it replaced earlier approaches (RNNs, LSTMs) because it could be massively parallelized during training.
Here is the pipeline at a high level:
Input text
→ Tokenization (split text into tokens)
→ Embedding (convert tokens to vectors)
→ Positional encoding (add position information)
→ N transformer layers (self-attention + feed-forward)
→ Output projection (vectors → probability over vocabulary)
→ Sampling (pick the next token)
→ Repeat until doneThink of it as a pipeline of transformations. Text goes in one end, a probability distribution over the next token comes out the other. Each stage has engineering implications that matter to you.
What you are paying for
When you call an LLM API, you are running an inference pass through this pipeline. Each generated token requires a full forward pass through all transformer layers. A model with 70 billion parameters needs to multiply your input through 70 billion weights — per token. That is why:
- Larger models cost more per token
- Output tokens cost more than input tokens (each output token triggers a new forward pass)
- Latency scales with model size and output length
2. Tokenization — Where Text Becomes Numbers
Models do not see text. They see integers. Tokenization is the first step — converting raw text into a sequence of token IDs from a fixed vocabulary.
Tokenization in practice with tiktoken
import tiktoken
# Get the tokenizer for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")
text = "The transformer architecture revolutionized NLP."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print()
# Decode each token individually to see the splits
for token_id in tokens:
print(f" {token_id:>6} -> '{enc.decode([token_id])}'")Output:
Text: The transformer architecture revolutionized NLP.
Token IDs: [976, 43578, 18112, 14110, 1463, 49855, 13]
Token count: 7
Decoded individually:
976 -> 'The'
43578 -> ' transformer'
18112 -> ' architecture'
14110 -> ' revolution'
1463 -> 'ized'
49855 -> ' NLP'
13 -> '.'Notice that “revolutionized” gets split into two tokens: “revolution” + “ized”. Common words are single tokens. Rare or long words get split into subword pieces. This is called Byte Pair Encoding (BPE).
Tokenization quirks that affect production code
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
# Different languages have very different token efficiencies
examples = {
"English": "Hello, how are you doing today?",
"Spanish": "Hola, como estas hoy?",
"Japanese": "こんにちは、今日はお元気ですか?",
"Code": "def calculate_sum(numbers: list[int]) -> int:",
"JSON": '{"name": "Alice", "age": 30, "active": true}',
}
for lang, text in examples.items():
tokens = enc.encode(text)
ratio = len(text) / len(tokens)
print(f"{lang:>10}: {len(tokens):>3} tokens | "
f"{len(text):>3} chars | "
f"{ratio:.1f} chars/token")Output:
English: 8 tokens | 31 chars | 3.9 chars/token
Spanish: 7 tokens | 22 chars | 3.1 chars/token
Japanese: 11 tokens | 16 chars | 1.5 chars/token
Code: 12 tokens | 46 chars | 3.8 chars/token
JSON: 15 tokens | 43 chars | 2.9 chars/tokenProduction implications:
| Observation | Why it matters |
|---|---|
| Non-English text uses more tokens per character | Non-English users cost more per request |
| JSON is token-expensive (brackets, quotes, colons) | Structured output formats inflate your bill |
| Code is relatively efficient | Code generation tasks have reasonable cost profiles |
| Whitespace and formatting consume tokens | Minifying prompts can save 5-15% on tokens |
Counting tokens before sending requests
Always count tokens before making API calls. This prevents context window errors and lets you estimate costs:
import tiktoken
def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
"""Count tokens in a chat messages array."""
enc = tiktoken.encoding_for_model(model)
total = 0
for message in messages:
total += 4 # message framing overhead
for key, value in message.items():
total += len(enc.encode(value))
total += 2 # reply priming
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in one paragraph."},
]
token_count = count_tokens(messages)
print(f"Prompt tokens: {token_count}")
print(f"Estimated cost at GPT-4o rates: ${token_count * 2.50 / 1_000_000:.6f}")3. Embeddings — What Vectors Actually Represent
After tokenization, each token ID gets converted to an embedding vector — a dense array of floating-point numbers (typically 4096 to 12288 dimensions, depending on the model).
Token "cat" -> [0.12, -0.45, 0.78, ..., 0.33] (4096 numbers)
Token "dog" -> [0.14, -0.41, 0.75, ..., 0.29] (4096 numbers)
Token "lamp" -> [-0.82, 0.11, -0.23, ..., 0.67] (4096 numbers)The key insight: similar words end up near each other in this vector space. “cat” and “dog” have similar vectors because they appear in similar contexts during training. “lamp” is far away from both.
This is not hand-coded. The model learns these representations during training by adjusting billions of weights to minimize prediction error.
Why embeddings matter for production
Embeddings are not just an internal detail. They are the foundation of:
- Semantic search — Find documents similar to a query by comparing embedding vectors
- RAG (Retrieval-Augmented Generation) — Retrieve relevant context before prompting
- Classification — Cluster or classify text based on embedding similarity
- Deduplication — Find near-duplicate content without exact string matching
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Get the embedding vector for a text string."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare semantic similarity
pairs = [
("The cat sat on the mat", "A feline rested on the rug"),
("The cat sat on the mat", "Stock prices rose sharply today"),
("Python is a programming language", "Python is a type of snake"),
]
for text_a, text_b in pairs:
emb_a = get_embedding(text_a)
emb_b = get_embedding(text_b)
sim = cosine_similarity(emb_a, emb_b)
print(f"Similarity: {sim:.3f}")
print(f" A: {text_a}")
print(f" B: {text_b}\n")Output:
Similarity: 0.847
A: The cat sat on the mat
B: A feline rested on the rug
Similarity: 0.112
A: The cat sat on the mat
B: Stock prices rose sharply today
Similarity: 0.634
A: Python is a programming language
B: Python is a type of snakePositional encoding
Embeddings alone do not carry position information. The model needs to know that “the cat ate the fish” is different from “the fish ate the cat.” Positional encodings are added to the embedding vectors to encode where each token sits in the sequence.
Modern models use Rotary Position Embeddings (RoPE), which encode relative positions and can be extended to handle longer contexts than the model was originally trained on.
4. Self-Attention — The Core Mechanism
Self-attention is what makes transformers work. It allows every token to “look at” every other token in the sequence and decide how much to pay attention to each one.
The intuition
Consider this sentence: “The animal didn’t cross the street because it was too tired.”
What does “it” refer to? “The animal.” A human knows this because you understand the causal relationship. Self-attention is the mechanism that lets the model figure this out — by computing a relevance score between “it” and every other token.
How it works (simplified)
For each token, the model computes three vectors from the embedding:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information do I provide?”
The attention score between two tokens is the dot product of the Query of one and the Key of the other. High score means “pay attention to this token.”
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * VYou do not need to implement this. But understanding the mechanism explains several production behaviors:
| Behavior | Explanation via attention |
|---|---|
| Models “forget” information in long prompts | Attention scores get diluted across more tokens |
| Instruction at the start of a prompt works differently than at the end | Positional encoding affects attention patterns |
| The “lost in the middle” problem | Attention tends to be strongest at the beginning and end of the context |
| Models can follow “focus on X” instructions | You are biasing which tokens get high attention scores |
| Repetition in output | The model attends to its own recent output and reinforces patterns |
Multi-head attention
The model does not compute attention once. It computes it multiple times in parallel (“heads”), each head potentially focusing on different relationships. One head might track syntactic relationships (subject-verb), another might track semantic relationships (pronoun-antecedent), another might track positional patterns.
A model like GPT-4 might have 96 attention heads per layer, across 120 layers. That is 11,520 separate attention computations per forward pass.
The quadratic cost problem
Here is the engineering punchline: attention is computed between every pair of tokens. For a sequence of length N, that is N x N comparisons. Double the context length, and you quadruple the computation.
| Context length | Attention computations | Relative cost |
|---|---|---|
| 1,000 tokens | 1,000,000 | 1x |
| 4,000 tokens | 16,000,000 | 16x |
| 32,000 tokens | 1,024,000,000 | 1,024x |
| 128,000 tokens | 16,384,000,000 | 16,384x |
This is why context windows have limits. It is not a software limitation — it is a fundamental computational cost. Models that advertise very large context windows (Gemini’s 2M tokens) use optimizations like sparse attention, sliding window attention, or linear attention approximations to manage this cost.
Production implication: Sending 128K tokens when you only need 4K is not just wasteful — it is dramatically more expensive and slower.
5. The Forward Pass — Input to Output
Let us trace a complete forward pass for generating one token:
1. Tokenize input: "What is the capital of" -> [3923, 374, 279, 6864, 315]
2. Embed tokens: Each ID -> 4096-dimensional vector
3. Add positional encoding: Inject position information
4. Pass through N transformer layers:
Layer 1: Self-attention -> Feed-forward -> Normalize
Layer 2: Self-attention -> Feed-forward -> Normalize
...
Layer N: Self-attention -> Feed-forward -> Normalize
5. Project final hidden state to vocabulary size:
4096-dim vector -> 100,000-dim vector (one value per vocab token)
6. Apply softmax to get probabilities:
" France": 0.82
" Paris": 0.07
" the": 0.03
" China": 0.01
...
7. Sample from distribution (controlled by temperature):
Selected token: " France"
8. Append to input, repeat from step 1:
"What is the capital of France" -> predict next tokenEach transformer layer has two sub-components:
- Self-attention: Each token attends to all others, rewriting its representation based on context
- Feed-forward network: A position-wise neural network that processes each token independently, adding capacity for pattern matching
The feed-forward layers are where most of the model’s “knowledge” is stored — factual associations, grammar rules, reasoning patterns. The attention layers are the routing mechanism that decides which knowledge to apply.
6. Training vs. Inference — What You Are Paying For
Understanding the distinction between training and inference clarifies what LLM APIs actually provide.
Training (you never do this)
Training is the process of adjusting the model’s billions of parameters by showing it vast amounts of text. The training process:
- Pre-training: Feed the model trillions of tokens from books, websites, code. The model learns to predict the next token. This costs tens of millions of dollars in compute.
- Fine-tuning (SFT): Train on curated instruction-response pairs to make the model follow instructions.
- RLHF/RLAIF: Use human (or AI) feedback to align the model’s behavior with desired outputs — being helpful, harmless, and honest.
Pre-training cost estimates:
GPT-4: ~$100M+ in compute
Llama 3.1 405B: ~$30M in compute
Mistral Large: ~$10M (estimated)You do not pay for training when you use an API. The provider amortizes training costs across all their customers.
Inference (this is what you pay for)
Inference is the forward pass — running input through the trained model to get output. Every API call is an inference request.
# This single call runs inference
# Cost: input_tokens * input_price + output_tokens * output_price
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)KV Cache — why prompt tokens are cheaper than output tokens
During inference, the model uses a Key-Value (KV) cache. When processing the input prompt, the model computes attention Key and Value vectors for all input tokens and caches them. When generating output tokens, it only needs to compute Q, K, V for the new token and look up the cached K, V for all previous tokens.
This is why:
- Input tokens are cheaper: They are processed in parallel, in one batch
- Output tokens are more expensive: Each one requires a sequential forward pass
- Prompt caching saves money: If your prompt starts the same way every time, providers can reuse the cached KV pairs
7. Temperature, top_p, and Sampling
After the forward pass produces a probability distribution over the vocabulary, the model needs to pick one token. This is sampling, and the parameters you control determine how it works.
Temperature
Temperature scales the logits (raw scores) before applying softmax. Lower temperature makes the distribution sharper (peaky), higher makes it flatter (uniform).
import numpy as np
def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
"""Apply temperature scaling to logits."""
if temperature == 0:
# Greedy: always pick the highest
result = np.zeros_like(logits)
result[np.argmax(logits)] = 1.0
return result
scaled = logits / temperature
exp = np.exp(scaled - np.max(scaled)) # Subtract max for numerical stability
return exp / exp.sum()
# Example logits for tokens: ["Paris", "France", "London", "Berlin"]
logits = np.array([5.0, 3.0, 1.5, 0.5])
for temp in [0.0, 0.3, 0.7, 1.0, 1.5]:
probs = apply_temperature(logits, temp)
print(f"temp={temp:.1f}: Paris={probs[0]:.3f} France={probs[1]:.3f} "
f"London={probs[2]:.3f} Berlin={probs[3]:.3f}")Output:
temp=0.0: Paris=1.000 France=0.000 London=0.000 Berlin=0.000
temp=0.3: Paris=0.997 France=0.003 London=0.000 Berlin=0.000
temp=0.7: Paris=0.907 France=0.082 London=0.009 Berlin=0.002
temp=1.0: Paris=0.727 France=0.098 London=0.022 Berlin=0.008
temp=1.5: Paris=0.500 France=0.153 London=0.055 Berlin=0.027top_p (nucleus sampling)
Instead of considering all possible tokens, top_p keeps only the smallest set of tokens whose cumulative probability exceeds the threshold p. This cuts off the long tail of unlikely tokens.
def apply_top_p(probs: np.ndarray, top_p: float) -> np.ndarray:
"""Apply nucleus sampling — keep tokens until cumulative prob >= top_p."""
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
cumulative = np.cumsum(sorted_probs)
# Find cutoff
cutoff_idx = np.searchsorted(cumulative, top_p) + 1
mask = np.zeros_like(probs)
mask[sorted_indices[:cutoff_idx]] = 1
filtered = probs * mask
return filtered / filtered.sum() # Renormalize
# With top_p=0.9, we keep tokens until we hit 90% cumulative probability
probs = np.array([0.72, 0.10, 0.08, 0.05, 0.03, 0.02])
filtered = apply_top_p(probs, top_p=0.9)
print(f"Original: {probs}")
print(f"After top_p=0.9: {np.round(filtered, 3)}")When to use which
| Setting | Use case |
|---|---|
temperature=0 |
Classification, data extraction, structured output, any task where you need consistency |
temperature=0.3, top_p=0.9 |
Code generation, technical writing — mostly deterministic with slight variation |
temperature=0.7, top_p=0.95 |
General conversation, summarization — natural sounding with variety |
temperature=1.0, top_p=1.0 |
Creative writing, brainstorming — maximum diversity |
Do not set both temperature and top_p to non-default values at the same time. Pick one strategy. OpenAI’s documentation explicitly recommends altering one or the other, not both.
8. Why Models Hallucinate
Hallucination is not a bug in the software. It is a fundamental property of how these models work. Understanding why helps you design systems that mitigate it.
The core problem: no grounding mechanism
The model generates the most probable next token given the context. It has no way to verify whether its output is factually correct. There is no internal fact-checking step. There is no database lookup. There is only pattern completion.
Prompt: "The CEO of Google in 2024 is"
Model's process:
- "Sundar" has high probability (correct, based on training data)
- But so does "Satya" if context is ambiguous
- The model picks based on learned patterns, not a verified sourceWhen hallucination is most likely
| Scenario | Risk level | Why |
|---|---|---|
| Obscure facts or recent events | High | Less training data to form strong patterns |
| Specific numbers, dates, URLs | High | Models are not precise retrieval systems |
| Multi-step reasoning | Medium-High | Errors compound across steps |
| Generating code for well-known libraries | Low | Massive training data with consistent patterns |
| Summarizing provided text | Low | The answer is in the context |
Production mitigation strategies
# Strategy 1: Ground responses in provided context (RAG)
messages = [
{"role": "system", "content": (
"Answer ONLY based on the provided context. "
"If the context doesn't contain the answer, say 'I don't know.' "
"Never make up information."
)},
{"role": "user", "content": f"Context:\n{retrieved_docs}\n\nQuestion: {question}"},
]
# Strategy 2: Ask the model to quote its sources
messages = [
{"role": "system", "content": (
"When answering, cite the specific passage from the provided "
"documents that supports your answer. Format: [Source: doc_name, paragraph X]"
)},
{"role": "user", "content": f"Documents:\n{docs}\n\nQuestion: {question}"},
]
# Strategy 3: Use structured output to constrain responses
messages = [
{"role": "system", "content": (
"Extract entities from the text. Return JSON with only the fields: "
"name, date, location. Use null for any field not found in the text. "
"Do NOT infer or guess missing values."
)},
{"role": "user", "content": f"Text: {text}"},
]9. Why Prompt Order and Structure Matter
The transformer’s attention mechanism creates measurable biases based on position. This has direct implications for how you structure prompts.
The “lost in the middle” effect
Research has shown that LLMs pay the most attention to information at the beginning and end of the context window. Information buried in the middle gets less attention — literally lower attention scores.
Attention strength across position:
[HIGH] ... [decreasing] ... [LOW] ... [increasing] ... [HIGH]
^start ^middle ^endPractical rule: Put your most important instructions and context at the beginning or end of the prompt. Never bury critical information in the middle of a long context.
Instruction positioning experiment
from openai import OpenAI
client = OpenAI()
# Same instruction, different position — different results
long_context = "... (imagine 50 paragraphs of text here) ..."
# Instruction at the START (better)
prompt_start = f"""Answer in exactly one sentence.
{long_context}
What is the main topic of the text above?"""
# Instruction at the END (also good)
prompt_end = f"""{long_context}
Based on the text above, what is the main topic?
Answer in exactly one sentence."""
# Instruction buried in the MIDDLE (worse)
half = len(long_context) // 2
prompt_middle = f"""{long_context[:half]}
Remember: answer in exactly one sentence.
{long_context[half:]}
What is the main topic?"""Recency bias in generation
The model’s output is influenced by what it just generated. This creates a recency bias — if the model starts going down a wrong path, it tends to continue in that direction because its own recent tokens have high attention scores.
This is why:
- Few-shot examples work: The model attends to the pattern in your examples and continues it
- Chain-of-thought helps: Breaking reasoning into steps gives the model correct intermediate tokens to attend to
- Bad first tokens can derail everything: If the model starts with “I’m not sure, but…” it tends to generate a less confident (and often less accurate) response
10. Practical Implications Summary
Here is the mental model you should carry into every engineering decision:
The cost model
# Your LLM costs are driven by this formula
cost = (input_tokens * input_price + output_tokens * output_price) * num_requests
# To reduce costs:
# 1. Reduce input_tokens (shorter prompts, less context)
# 2. Reduce output_tokens (set max_tokens, use structured output)
# 3. Use cheaper models (GPT-4o mini instead of GPT-4o)
# 4. Reduce num_requests (cache responses, batch similar queries)The quality model
| Factor | Effect on quality |
|---|---|
| Clearer instructions | Better output |
| Relevant context in prompt | Reduces hallucination |
| temperature=0 | More consistent, less creative |
| Longer context | Diminishing returns, “lost in middle” |
| Better model | Generally better but 5x+ cost |
| Few-shot examples | Significantly better for format/style |
The latency model
Total latency = time_to_first_token + (output_tokens * time_per_token)
time_to_first_token: depends on input length and model size
time_per_token: roughly constant for a given model (~20-50ms for GPT-4o)
Example:
Input: 2000 tokens, Output: 500 tokens
TTFT: ~500ms
Generation: 500 * 30ms = 15s
Total: ~15.5s (without streaming, user waits this entire time)Quick reference: model parameter counts
| Model | Parameters | Layers | Hidden size | Attention heads |
|---|---|---|---|---|
| GPT-4o | ~200B (estimated) | ~120 | ~12,288 | ~96 |
| Claude 3.5 Sonnet | Undisclosed | - | - | - |
| Llama 3.1 70B | 70B | 80 | 8,192 | 64 |
| Llama 3.1 8B | 8B | 32 | 4,096 | 32 |
| Mistral 7B | 7B | 32 | 4,096 | 32 |
More parameters generally means better quality but higher cost and latency. The skill of LLM engineering is finding the smallest model that meets your quality bar.
Key Takeaways
- LLMs are next-token predictors built on the transformer architecture. Text goes in, gets tokenized, embedded, processed through attention layers, and a probability distribution over the next token comes out. Generation is an autoregressive loop — one token at a time.
- Tokenization determines your costs and limits. Everything is measured in tokens, not words. Different languages and formats have different token efficiencies. Always count tokens before making API calls.
- Embeddings encode meaning as vectors. Similar concepts end up near each other in vector space. This property powers semantic search, RAG, and classification — all tools you will use in production.
- Self-attention is the core mechanism. Every token attends to every other token, which is powerful but has quadratic cost. This is why context windows exist and why sending less context is almost always better.
- The KV cache makes input tokens cheaper than output tokens. Input is processed in parallel; output is generated sequentially. This directly affects API pricing and latency.
- Temperature and top_p control sampling. Use temperature=0 for deterministic tasks, 0.7 for general use. Do not tweak both simultaneously.
- Hallucination is inherent, not a bug. The model has no fact-checking mechanism. Mitigate with grounded context (RAG), structured output constraints, and explicit “say I don’t know” instructions.
- Prompt position matters. Information at the beginning and end of the context gets the most attention. Critical instructions should never be buried in the middle.
- Your three optimization levers are cost, quality, and latency. Every architectural decision trades between them. Understanding the transformer pipeline helps you make those trades intelligently.