You built a RAG pipeline. You embedded your documents, stored them in a vector database, and wrote a beautiful retrieval-augmented prompt. But answers are wrong. Not hallucinated-wrong — the model is faithfully answering based on retrieved context, but the context itself is garbage. The problem is almost always chunking.
Chunking is the least glamorous part of RAG and the most impactful. How you split your documents determines what your retriever can find, what context the model sees, and ultimately whether your answers are correct. Get it wrong and no amount of prompt engineering, reranking, or model upgrades will save you.
Why Chunking Matters More Than You Think
When you embed a chunk of text, you compress its entire meaning into a single vector. If that chunk contains three unrelated ideas, the embedding becomes a muddy average of all three — and it matches none of them well. If the chunk cuts a sentence in half, the embedding captures an incomplete thought that will never match a well-formed query.
Here is the core tension: smaller chunks are more precise (each one is about one thing), but larger chunks carry more context (the model has more to work with). Every chunking decision is a tradeoff between retrieval precision and context completeness.
Too small (50 chars): "The refund policy for enterprise"
→ Embedding is vague, matches too many queries
Too large (5000 chars): "Our refund policy... [followed by shipping info,
return windows, tax implications, contact details]"
→ Embedding is diluted, retrieval misses specifics
Right size (400 chars): "Enterprise customers can request a full refund within
30 days of purchase. After 30 days, a prorated refund
is available for annual plans. Contact your account
manager to initiate the process."
→ Embedding captures one coherent ideaFixed-Size Chunking
The simplest approach: split text every N characters (or tokens), regardless of content.
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Step back by overlap amount
return chunks
# Example
text = "A" * 1200 # 1200 characters of text
chunks = fixed_size_chunk(text, chunk_size=500, overlap=50)
print(f"Number of chunks: {len(chunks)}") # 3
print(f"Chunk sizes: {[len(c) for c in chunks]}") # [500, 500, 250]When fixed-size works: Highly uniform text where every section has similar structure — think log files, CSV rows, or repetitive form data. Also useful as a baseline to compare against smarter strategies.
When it fails: Any natural language text. It splits mid-sentence, mid-word, mid-thought. A chunk might start with “…ment was approved on Tuesday.” and end with “The customer then reque…” — useless for retrieval.
Recursive Character Splitting
This is the right default for most text. Instead of splitting at an arbitrary character count, you split on a hierarchy of separators: first try double newlines (paragraphs), then single newlines, then sentences, then words. Each level only triggers when the previous level produced chunks that are still too large.
def recursive_split(
text: str,
chunk_size: int = 500,
chunk_overlap: int = 50,
separators: list[str] | None = None,
) -> list[str]:
"""Recursively split text using a hierarchy of separators."""
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
chunks = []
separator = separators[0]
remaining_separators = separators[1:]
# Split on the current separator
splits = text.split(separator) if separator else list(text)
current_chunk = ""
for split in splits:
piece = split if not separator else split + separator
if len(current_chunk) + len(piece) <= chunk_size:
current_chunk += piece
else:
if current_chunk:
chunks.append(current_chunk.strip())
# If this single piece is too large, recurse with finer separator
if len(piece) > chunk_size and remaining_separators:
sub_chunks = recursive_split(
piece, chunk_size, chunk_overlap, remaining_separators
)
chunks.extend(sub_chunks)
current_chunk = ""
else:
current_chunk = piece
if current_chunk.strip():
chunks.append(current_chunk.strip())
# Apply overlap
if chunk_overlap > 0 and len(chunks) > 1:
chunks = _apply_overlap(chunks, chunk_overlap)
return chunks
def _apply_overlap(chunks: list[str], overlap: int) -> list[str]:
"""Add overlap between consecutive chunks."""
result = [chunks[0]]
for i in range(1, len(chunks)):
prev_tail = chunks[i - 1][-overlap:]
result.append(prev_tail + " " + chunks[i])
return resultIn practice, most teams use LangChain’s RecursiveCharacterTextSplitter because it handles the edge cases well:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", ", ", " ", ""],
length_function=len,
)
text = """
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that focuses
on building systems that learn from data. Unlike traditional programming,
where you write explicit rules, ML systems discover patterns automatically.
There are three main categories of machine learning:
Supervised Learning: The model learns from labeled examples. You provide
input-output pairs and the model learns the mapping function. Common
algorithms include linear regression, decision trees, and neural networks.
Unsupervised Learning: The model finds structure in unlabeled data.
Clustering and dimensionality reduction are typical tasks. K-means and
PCA are widely used algorithms.
Reinforcement Learning: The model learns by interacting with an
environment and receiving rewards or penalties. This approach powers
game-playing AIs and robotic control systems.
"""
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i} ({len(chunk)} chars) ---")
print(chunk[:100] + "...")
print()The separator hierarchy is the key insight. "\n\n" catches paragraph breaks. "\n" catches line breaks. ". " catches sentence boundaries. This means you almost always get chunks that start and end at natural boundaries.
Sentence-Based Chunking
Sometimes you want precise sentence boundaries. NLP libraries like spaCy and NLTK handle edge cases that regex cannot — abbreviations (Dr., U.S.A.), decimal numbers (3.14), and URLs.
import spacy
nlp = spacy.load("en_core_web_sm")
def sentence_chunk(
text: str,
max_sentences: int = 5,
overlap_sentences: int = 1,
) -> list[str]:
"""Chunk text by grouping sentences together."""
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
chunks = []
for i in range(0, len(sentences), max_sentences - overlap_sentences):
group = sentences[i : i + max_sentences]
chunks.append(" ".join(group))
return chunks
text = """
Dr. Smith published the results on Jan. 15. The study involved 3.5 million
data points collected across the U.S.A. Results showed a 23% improvement in
accuracy. The p-value was less than 0.001, which is statistically significant.
Follow-up studies are planned for Q3 2026. The team expects to expand the
dataset to 10 million records. Funding has been secured through 2028.
"""
chunks = sentence_chunk(text, max_sentences=3, overlap_sentences=1)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk}\n")NLTK alternative (lighter weight, no model download):
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)Sentence chunking is ideal when your source material is dense prose where every sentence carries distinct information — legal documents, medical records, research papers.
Semantic Chunking
The most sophisticated approach: use embeddings to decide where to split. The idea is that consecutive sentences about the same topic should stay together, and a split should happen when the topic shifts.
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embeddings(texts: list[str]) -> list[list[float]]:
"""Get embeddings for a batch of texts."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_chunk(
text: str,
similarity_threshold: float = 0.75,
min_chunk_size: int = 100,
) -> list[str]:
"""Split text into semantically coherent chunks using embeddings."""
# Step 1: Split into sentences
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
if len(sentences) <= 1:
return [text]
# Step 2: Embed each sentence
embeddings = get_embeddings(sentences)
# Step 3: Find breakpoints where similarity drops
similarities = []
for i in range(len(embeddings) - 1):
sim = cosine_similarity(embeddings[i], embeddings[i + 1])
similarities.append(sim)
# Step 4: Split at low-similarity points
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
if sim < similarity_threshold and len(" ".join(current_chunk)) >= min_chunk_size:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
# Usage
text = """
Python is a high-level programming language known for its readability.
It was created by Guido van Rossum and first released in 1991. Python
supports multiple programming paradigms including procedural and
object-oriented programming.
The stock market experienced significant volatility last quarter.
The S&P 500 dropped 8% in March before recovering in April. Analysts
attribute the decline to rising interest rates and geopolitical tensions.
Machine learning models require large datasets for training. The quality
of training data directly impacts model performance. Data augmentation
techniques can help when labeled data is scarce.
"""
chunks = semantic_chunk(text, similarity_threshold=0.78)
for i, chunk in enumerate(chunks):
print(f"--- Semantic Chunk {i} ---")
print(chunk[:150])
print()Semantic chunking correctly groups the three paragraphs (Python, stock market, ML) into separate chunks even if you removed the blank lines between them. A fixed-size splitter would blindly cut at character 500 regardless of topic.
The cost: You must embed every sentence before chunking, which means API calls and latency. For a 100-page document with 2,000 sentences, that is one embedding API call with 2,000 inputs. It works, but it is 10-100x slower than recursive splitting.
When to use it: High-value documents where retrieval quality matters more than indexing speed — legal contracts, technical specifications, medical literature.
Document-Aware Chunking
Structured documents (Markdown, HTML, PDFs with headings) carry explicit signals about where topics change. Use them.
import re
def markdown_chunk(text: str, max_chunk_size: int = 1000) -> list[dict]:
"""Chunk markdown by headings, preserving section hierarchy."""
# Split on headings (##, ###, etc.)
sections = re.split(r"(^#{1,4}\s+.+$)", text, flags=re.MULTILINE)
chunks = []
current_heading = "Introduction"
current_content = ""
for section in sections:
if re.match(r"^#{1,4}\s+", section):
# Save previous section
if current_content.strip():
chunks.append({
"content": current_content.strip(),
"heading": current_heading,
"char_count": len(current_content.strip()),
})
current_heading = section.strip("# \n")
current_content = ""
else:
current_content += section
# Don't forget the last section
if current_content.strip():
chunks.append({
"content": current_content.strip(),
"heading": current_heading,
"char_count": len(current_content.strip()),
})
# Split oversized sections with recursive splitting
final_chunks = []
for chunk in chunks:
if chunk["char_count"] > max_chunk_size:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=max_chunk_size, chunk_overlap=50
)
sub_texts = splitter.split_text(chunk["content"])
for j, sub in enumerate(sub_texts):
final_chunks.append({
"content": sub,
"heading": chunk["heading"],
"sub_section": j,
})
else:
final_chunks.append(chunk)
return final_chunks
# Example
markdown_doc = """
# API Reference
## Authentication
All API calls require a Bearer token in the Authorization header.
Tokens expire after 24 hours. Use the /auth/refresh endpoint to
get a new token without re-authenticating.
## Rate Limiting
The API enforces rate limits of 100 requests per minute per API key.
Exceeding the limit returns a 429 status code with a Retry-After
header indicating how many seconds to wait.
### Burst Limits
Short bursts of up to 20 requests per second are allowed. Sustained
traffic above 100 RPM will trigger rate limiting.
## Pagination
All list endpoints support cursor-based pagination. Pass the cursor
parameter from the previous response to get the next page. The default
page size is 20 items, configurable up to 100.
"""
chunks = markdown_chunk(markdown_doc)
for chunk in chunks:
print(f"[{chunk['heading']}] {chunk['content'][:80]}...")
print()HTML-aware chunking follows the same principle — split on <h1>, <h2>, <section>, <article> tags. For PDFs, use libraries like pymupdf or unstructured that extract heading hierarchy.
The Overlap Question
Overlap means repeating a portion of the previous chunk at the start of the next chunk. It exists to solve one problem: information that spans a chunk boundary.
Without overlap:
Chunk 1: "...the refund policy requires customers to"
Chunk 2: "submit a written request within 30 days."
With 50-char overlap:
Chunk 1: "...the refund policy requires customers to"
Chunk 2: "policy requires customers to submit a written request within 30 days."Without overlap, neither chunk contains the complete sentence. A query about “refund request deadline” might not match either chunk well enough to be retrieved.
How much overlap? 10-20% of chunk size is the standard range. For a 500-character chunk, use 50-100 characters of overlap. More overlap means more redundancy in your vector database (more storage, more embeddings to compute) but fewer missed boundary cases.
The cost of overlap: If you have 1000 chunks with 20% overlap, you effectively store 1200 chunks worth of text. Embedding cost increases proportionally.
| Chunk Size | Overlap | Effective Chunks per 100K chars | Embedding Cost Multiplier |
|---|---|---|---|
| 500 | 0 | 200 | 1.0x |
| 500 | 50 | 222 | 1.11x |
| 500 | 100 | 250 | 1.25x |
| 500 | 200 | 333 | 1.67x |
Chunk Size Experiments
The question everyone asks: what chunk size should I use? Here is a systematic way to find out for your data.
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
import numpy as np
import json
client = OpenAI()
def embed(texts: list[str]) -> list[list[float]]:
"""Batch embed texts."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [d.embedding for d in response.data]
def cosine_sim(a, b):
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def evaluate_chunk_size(
documents: list[str],
queries: list[dict], # [{"query": "...", "expected_content": "..."}]
chunk_sizes: list[int],
top_k: int = 5,
) -> dict:
"""Compare retrieval quality across different chunk sizes."""
results = {}
for size in chunk_sizes:
splitter = RecursiveCharacterTextSplitter(
chunk_size=size, chunk_overlap=int(size * 0.1)
)
# Chunk all documents
all_chunks = []
for doc in documents:
all_chunks.extend(splitter.split_text(doc))
# Embed all chunks
chunk_embeddings = embed(all_chunks)
# Test each query
hits = 0
for q in queries:
query_emb = embed([q["query"]])[0]
# Find top-k most similar chunks
sims = [cosine_sim(query_emb, ce) for ce in chunk_embeddings]
top_indices = np.argsort(sims)[-top_k:][::-1]
retrieved_text = " ".join(all_chunks[i] for i in top_indices)
# Check if expected content appears in retrieved chunks
if q["expected_content"].lower() in retrieved_text.lower():
hits += 1
results[size] = {
"num_chunks": len(all_chunks),
"recall": hits / len(queries),
"avg_chunk_chars": np.mean([len(c) for c in all_chunks]),
}
return results
# Example usage
documents = [open(f).read() for f in ["doc1.txt", "doc2.txt", "doc3.txt"]]
queries = [
{
"query": "What is the refund policy for enterprise?",
"expected_content": "30 days",
},
{
"query": "How do I reset my API key?",
"expected_content": "dashboard settings",
},
# ... 50-100 queries for a real evaluation
]
results = evaluate_chunk_size(
documents, queries, chunk_sizes=[200, 500, 1000, 2000]
)
for size, metrics in sorted(results.items()):
print(f"Chunk size {size:>5}: "
f"{metrics['num_chunks']:>4} chunks, "
f"recall={metrics['recall']:.2%}")Typical results from production systems:
| Chunk Size | Num Chunks | Retrieval Recall | Notes |
|---|---|---|---|
| 200 | 1,847 | 72% | Too granular — loses context |
| 500 | 739 | 88% | Good balance |
| 800 | 462 | 85% | Slightly diluted embeddings |
| 1,000 | 370 | 81% | Starting to lose precision |
| 2,000 | 185 | 68% | Too much noise per chunk |
The sweet spot is almost always 300-800 characters. Below 200, you lose too much context. Above 1000, the embedding gets diluted by mixed topics within a single chunk. But always test on your own data — the right size depends on your document style and query patterns.
Metadata Enrichment
A chunk in isolation often lacks context. “The process takes 3-5 business days” — which process? Attaching metadata to each chunk solves this.
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class EnrichedChunk:
content: str
metadata: dict = field(default_factory=dict)
def enrich_chunks(
chunks: list[str],
document_title: str,
source_url: str,
section_headings: list[str] | None = None,
) -> list[EnrichedChunk]:
"""Add metadata to chunks for better retrieval context."""
enriched = []
for i, chunk in enumerate(chunks):
meta = {
"document_title": document_title,
"source_url": source_url,
"chunk_index": i,
"total_chunks": len(chunks),
"char_count": len(chunk),
"indexed_at": datetime.utcnow().isoformat(),
}
# Add section heading if available
if section_headings and i < len(section_headings):
meta["section"] = section_headings[i]
# Prepend context to the chunk text for better embeddings
context_prefix = f"Document: {document_title}"
if "section" in meta:
context_prefix += f" | Section: {meta['section']}"
enriched.append(EnrichedChunk(
content=f"{context_prefix}\n\n{chunk}",
metadata=meta,
))
return enriched
# Usage
chunks = ["The process takes 3-5 business days...", "Contact support at..."]
enriched = enrich_chunks(
chunks,
document_title="Enterprise Onboarding Guide",
source_url="https://docs.example.com/onboarding",
section_headings=["Timeline", "Support"],
)
for ec in enriched:
print(f"Content: {ec.content[:100]}...")
print(f"Metadata: {json.dumps(ec.metadata, indent=2)}\n")The context_prefix trick is important: by prepending “Document: Enterprise Onboarding Guide | Section: Timeline” to the chunk before embedding, the embedding captures both the content and its context. A query like “how long does enterprise onboarding take” will now match better because the chunk explicitly mentions “Enterprise Onboarding” in its embedded text.
Parent-Child Chunking
A hybrid approach: embed small, focused chunks for precision retrieval, but return the larger parent chunk to the LLM for context.
from dataclasses import dataclass
@dataclass
class ParentChildChunk:
child_content: str # Small, embedded for retrieval
parent_content: str # Large, sent to LLM
child_id: str
parent_id: str
def create_parent_child_chunks(
text: str,
parent_size: int = 2000,
child_size: int = 300,
child_overlap: int = 50,
) -> list[ParentChildChunk]:
"""Create two-level chunk hierarchy."""
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib
# Level 1: Large parent chunks
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=parent_size, chunk_overlap=200
)
parent_texts = parent_splitter.split_text(text)
# Level 2: Small child chunks within each parent
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=child_size, chunk_overlap=child_overlap
)
all_chunks = []
for p_idx, parent in enumerate(parent_texts):
parent_id = hashlib.md5(parent[:100].encode()).hexdigest()[:12]
children = child_splitter.split_text(parent)
for c_idx, child in enumerate(children):
child_id = f"{parent_id}_c{c_idx}"
all_chunks.append(ParentChildChunk(
child_content=child,
parent_content=parent,
child_id=child_id,
parent_id=parent_id,
))
return all_chunks
# Retrieval flow:
# 1. Embed all child_content and store in vector DB
# 2. On query, search against child embeddings
# 3. Return the parent_content of the matched children to the LLM
#
# This gives you the best of both worlds:
# - Precise retrieval (small chunks match specific queries)
# - Rich context (LLM sees the full surrounding text)
chunks = create_parent_child_chunks(open("large_doc.txt").read())
print(f"Parents: {len(set(c.parent_id for c in chunks))}")
print(f"Children: {len(chunks)}")
print(f"Child sizes: {[len(c.child_content) for c in chunks[:5]]}")
print(f"Parent sizes: {[len(c.parent_content) for c in chunks[:5]]}")Parent-child chunking is used by many production RAG systems. It solves the precision-vs-context tradeoff without compromise. The downside is complexity — you need to store and manage the parent-child mapping.
Code-Aware Chunking
Source code has its own structure. Splitting Python by character count will break functions in half. Instead, split by language constructs.
import ast
def chunk_python_code(source_code: str) -> list[dict]:
"""Chunk Python source code by functions and classes."""
tree = ast.parse(source_code)
lines = source_code.split("\n")
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
start = node.lineno - 1
end = node.end_lineno
func_code = "\n".join(lines[start:end])
# Get docstring if present
docstring = ast.get_docstring(node) or ""
chunks.append({
"content": func_code,
"type": "function",
"name": node.name,
"docstring": docstring,
"line_start": node.lineno,
"line_end": node.end_lineno,
})
elif isinstance(node, ast.ClassDef):
start = node.lineno - 1
end = node.end_lineno
class_code = "\n".join(lines[start:end])
chunks.append({
"content": class_code,
"type": "class",
"name": node.name,
"docstring": ast.get_docstring(node) or "",
"line_start": node.lineno,
"line_end": node.end_lineno,
})
# Also capture module-level code that's not in any function/class
covered_lines = set()
for chunk in chunks:
covered_lines.update(range(chunk["line_start"], chunk["line_end"] + 1))
module_lines = []
for i, line in enumerate(lines, 1):
if i not in covered_lines and line.strip():
module_lines.append(line)
if module_lines:
chunks.insert(0, {
"content": "\n".join(module_lines),
"type": "module_level",
"name": "module",
"docstring": "",
})
return chunks
# Example
source = '''
import os
from pathlib import Path
CONFIG_DIR = Path.home() / ".config"
class DatabaseConnection:
"""Manages database connections with pooling."""
def __init__(self, url: str, pool_size: int = 5):
self.url = url
self.pool_size = pool_size
def connect(self):
"""Establish a connection from the pool."""
pass
def close(self):
"""Return connection to pool."""
pass
def process_file(path: str) -> dict:
"""Read and parse a configuration file."""
with open(path) as f:
return json.loads(f.read())
'''
chunks = chunk_python_code(source)
for chunk in chunks:
print(f"[{chunk['type']}] {chunk['name']}: {len(chunk['content'])} chars")For other languages (JavaScript, Go, Rust), use tree-sitter for AST parsing — it supports 100+ languages with a uniform API.
Production Chunking Pipeline
Here is a complete pipeline that handles multiple document types and produces enriched, ready-to-embed chunks.
import re
import hashlib
from dataclasses import dataclass, field
from pathlib import Path
from enum import Enum
from langchain.text_splitter import RecursiveCharacterTextSplitter
class DocType(Enum):
MARKDOWN = "markdown"
PLAIN_TEXT = "plain_text"
CODE = "code"
HTML = "html"
@dataclass
class ProcessedChunk:
chunk_id: str
content: str
content_for_embedding: str # May include metadata prefix
metadata: dict = field(default_factory=dict)
class ChunkingPipeline:
"""Production chunking pipeline with multiple strategies."""
def __init__(
self,
default_chunk_size: int = 500,
default_overlap: int = 50,
):
self.default_chunk_size = default_chunk_size
self.default_overlap = default_overlap
def process(
self,
text: str,
doc_type: DocType,
source: str = "",
title: str = "",
) -> list[ProcessedChunk]:
"""Process a document into enriched chunks."""
# Step 1: Detect doc type and apply appropriate strategy
if doc_type == DocType.MARKDOWN:
raw_chunks = self._chunk_markdown(text)
elif doc_type == DocType.CODE:
raw_chunks = self._chunk_code(text)
elif doc_type == DocType.HTML:
raw_chunks = self._chunk_html(text)
else:
raw_chunks = self._chunk_plain_text(text)
# Step 2: Enrich with metadata
processed = []
for i, chunk_data in enumerate(raw_chunks):
content = chunk_data["content"]
section = chunk_data.get("section", "")
chunk_id = hashlib.sha256(
f"{source}:{i}:{content[:50]}".encode()
).hexdigest()[:16]
# Build embedding text with context
embed_parts = []
if title:
embed_parts.append(f"Document: {title}")
if section:
embed_parts.append(f"Section: {section}")
embed_parts.append(content)
content_for_embedding = "\n".join(embed_parts)
processed.append(ProcessedChunk(
chunk_id=chunk_id,
content=content,
content_for_embedding=content_for_embedding,
metadata={
"source": source,
"title": title,
"section": section,
"chunk_index": i,
"char_count": len(content),
"doc_type": doc_type.value,
},
))
return processed
def _chunk_markdown(self, text: str) -> list[dict]:
"""Chunk markdown respecting heading structure."""
sections = re.split(r"(^#{1,4}\s+.+$)", text, flags=re.MULTILINE)
chunks = []
current_heading = ""
current_text = ""
for part in sections:
if re.match(r"^#{1,4}\s+", part):
if current_text.strip():
chunks.extend(self._split_section(
current_text.strip(), current_heading
))
current_heading = part.strip("# \n")
current_text = ""
else:
current_text += part
if current_text.strip():
chunks.extend(self._split_section(
current_text.strip(), current_heading
))
return chunks
def _split_section(self, text: str, heading: str) -> list[dict]:
"""Split a section that might be too large."""
if len(text) <= self.default_chunk_size:
return [{"content": text, "section": heading}]
splitter = RecursiveCharacterTextSplitter(
chunk_size=self.default_chunk_size,
chunk_overlap=self.default_overlap,
)
parts = splitter.split_text(text)
return [{"content": p, "section": heading} for p in parts]
def _chunk_plain_text(self, text: str) -> list[dict]:
"""Chunk plain text with recursive splitting."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=self.default_chunk_size,
chunk_overlap=self.default_overlap,
)
parts = splitter.split_text(text)
return [{"content": p, "section": ""} for p in parts]
def _chunk_code(self, text: str) -> list[dict]:
"""Chunk source code by functions/classes."""
try:
chunks = chunk_python_code(text) # from earlier example
return [
{"content": c["content"], "section": f"{c['type']}: {c['name']}"}
for c in chunks
]
except SyntaxError:
return self._chunk_plain_text(text)
def _chunk_html(self, text: str) -> list[dict]:
"""Chunk HTML by stripping tags and splitting."""
from html.parser import HTMLParser
class TextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.parts = []
self.current_heading = ""
def handle_starttag(self, tag, attrs):
if tag in ("h1", "h2", "h3", "h4"):
self._tag = tag
def handle_data(self, data):
if hasattr(self, "_tag"):
self.current_heading = data.strip()
del self._tag
else:
if data.strip():
self.parts.append({
"content": data.strip(),
"section": self.current_heading,
})
extractor = TextExtractor()
extractor.feed(text)
# Merge small consecutive parts from the same section
merged = []
current = {"content": "", "section": ""}
for part in extractor.parts:
if (part["section"] == current["section"]
and len(current["content"]) + len(part["content"])
< self.default_chunk_size):
current["content"] += " " + part["content"]
else:
if current["content"]:
merged.append(current)
current = dict(part)
if current["content"]:
merged.append(current)
return merged
# Usage
pipeline = ChunkingPipeline(default_chunk_size=500, default_overlap=50)
# Process a markdown doc
md_chunks = pipeline.process(
text=open("docs/api-reference.md").read(),
doc_type=DocType.MARKDOWN,
source="docs/api-reference.md",
title="API Reference",
)
# Process source code
code_chunks = pipeline.process(
text=open("src/auth.py").read(),
doc_type=DocType.CODE,
source="src/auth.py",
title="Authentication Module",
)
for chunk in md_chunks[:3]:
print(f"ID: {chunk.chunk_id}")
print(f"Section: {chunk.metadata['section']}")
print(f"Embed text: {chunk.content_for_embedding[:100]}...")
print()Evaluating Chunking Quality
How do you know if your chunking is good? Three methods.
Method 1: Human inspection. Read 50 random chunks. Can you understand each one in isolation? Does each chunk answer the question “what is this about?” If not, your chunks are too fragmented.
Method 2: Retrieval evaluation. Build a test set of queries with expected source documents. Measure what percentage of queries retrieve the right chunks. This is the experiment framework from the “Chunk Size Experiments” section above.
Method 3: Coherence scoring. Use an LLM to evaluate chunk quality.
def score_chunk_coherence(chunks: list[str], sample_size: int = 50) -> float:
"""Use an LLM to score chunk coherence."""
import random
sample = random.sample(chunks, min(sample_size, len(chunks)))
scores = []
for chunk in sample:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Rate the following text chunk on coherence from 1-5. "
"5 = complete, self-contained thought. "
"1 = fragment, starts/ends mid-sentence, or mixes unrelated topics. "
"Respond with just the number."
),
},
{"role": "user", "content": chunk},
],
max_tokens=5,
)
try:
score = int(response.choices[0].message.content.strip())
scores.append(score)
except ValueError:
continue
return sum(scores) / len(scores) if scores else 0.0
# Compare strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = open("large_document.txt").read()
fixed_chunks = fixed_size_chunk(text, chunk_size=500)
recursive_chunks = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50
).split_text(text)
print(f"Fixed-size coherence: {score_chunk_coherence(fixed_chunks):.2f}/5")
print(f"Recursive coherence: {score_chunk_coherence(recursive_chunks):.2f}/5")Typical results: fixed-size averages 2.5-3.0, recursive averages 3.5-4.0, semantic averages 4.0-4.5. The difference is meaningful — it directly translates to retrieval quality.
Decision Framework
Use this table to pick the right strategy for your documents.
| Document Type | Recommended Strategy | Chunk Size | Why |
|---|---|---|---|
| Blog posts, articles | Recursive character | 400-600 | Natural paragraph structure |
| API documentation | Markdown-aware | 300-500 | Headings mark topic changes |
| Legal contracts | Semantic | 500-800 | Dense prose, subtle topic shifts |
| Chat/conversation logs | Sentence-based | 200-400 | Each message is a unit |
| Source code | AST-based (code-aware) | Per function | Functions are natural units |
| PDF reports | Document-aware + recursive | 400-800 | Use headings, fall back to recursive |
| Knowledge base Q&A | Per question-answer pair | Varies | Each Q&A is a natural chunk |
| CSV/structured data | Per row or row group | Per row | Rows are records |
The production default: Start with RecursiveCharacterTextSplitter at 500 characters with 50 character overlap. It works well for 80% of cases. Only switch to a more sophisticated strategy when you have evidence (from evaluation) that it is not good enough.
Key Takeaways
-
Chunking determines retrieval quality. No amount of model upgrades or prompt engineering fixes bad chunks. If the right information never gets retrieved, the answer will be wrong.
-
Recursive character splitting is the right default. It respects natural text boundaries (paragraphs, sentences) and handles most document types well. Start here.
-
The sweet spot is 300-800 characters. Below 200, chunks lack context. Above 1000, embeddings get diluted. Test with your own data to find the optimal size.
-
Overlap prevents boundary losses. Use 10-20% overlap to catch information that spans chunk boundaries. The storage cost is minimal compared to the retrieval improvement.
-
Semantic chunking is powerful but expensive. Reserve it for high-value documents where retrieval quality justifies the 10x slower indexing.
-
Always enrich chunks with metadata. Section headings, document titles, and source URLs help both retrieval (better embeddings) and generation (LLM has more context).
-
Parent-child chunking solves the precision vs context tradeoff. Embed small chunks for matching, retrieve large chunks for the LLM.
-
Evaluate your chunks. Read them manually, measure retrieval recall, and score coherence. Chunking is not something you configure once and forget.
-
Document-aware chunking beats recursive for structured documents. If your docs have headings, use them. They are explicit signals about topic boundaries.
-
Read your chunks before deploying. If a human cannot understand a chunk in isolation, the retriever will not match it to the right query. This five-minute sanity check catches most chunking problems.