Chunking Strategies — What Nobody Tells You — LLM Engineering in Production

You built a RAG pipeline. You embedded your documents, stored them in a vector database, and wrote a beautiful retrieval-augmented prompt. But answers are wrong. Not hallucinated-wrong — the model is faithfully answering based on retrieved context, but the context itself is garbage. The problem is almost always chunking.

Chunking is the least glamorous part of RAG and the most impactful. How you split your documents determines what your retriever can find, what context the model sees, and ultimately whether your answers are correct. Get it wrong and no amount of prompt engineering, reranking, or model upgrades will save you.

Why Chunking Matters More Than You Think

When you embed a chunk of text, you compress its entire meaning into a single vector. If that chunk contains three unrelated ideas, the embedding becomes a muddy average of all three — and it matches none of them well. If the chunk cuts a sentence in half, the embedding captures an incomplete thought that will never match a well-formed query.

Here is the core tension: smaller chunks are more precise (each one is about one thing), but larger chunks carry more context (the model has more to work with). Every chunking decision is a tradeoff between retrieval precision and context completeness.

Too small (50 chars):   "The refund policy for enterprise"
                        → Embedding is vague, matches too many queries

Too large (5000 chars): "Our refund policy... [followed by shipping info, 
                         return windows, tax implications, contact details]"
                        → Embedding is diluted, retrieval misses specifics

Right size (400 chars): "Enterprise customers can request a full refund within 
                         30 days of purchase. After 30 days, a prorated refund 
                         is available for annual plans. Contact your account 
                         manager to initiate the process."
                        → Embedding captures one coherent idea

Fixed-Size Chunking

The simplest approach: split text every N characters (or tokens), regardless of content.

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Step back by overlap amount
    return chunks


# Example
text = "A" * 1200  # 1200 characters of text
chunks = fixed_size_chunk(text, chunk_size=500, overlap=50)
print(f"Number of chunks: {len(chunks)}")  # 3
print(f"Chunk sizes: {[len(c) for c in chunks]}")  # [500, 500, 250]

When fixed-size works: Highly uniform text where every section has similar structure — think log files, CSV rows, or repetitive form data. Also useful as a baseline to compare against smarter strategies.

When it fails: Any natural language text. It splits mid-sentence, mid-word, mid-thought. A chunk might start with “…ment was approved on Tuesday.” and end with “The customer then reque…” — useless for retrieval.

Recursive Character Splitting

This is the right default for most text. Instead of splitting at an arbitrary character count, you split on a hierarchy of separators: first try double newlines (paragraphs), then single newlines, then sentences, then words. Each level only triggers when the previous level produced chunks that are still too large.

def recursive_split(
    text: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    separators: list[str] | None = None,
) -> list[str]:
    """Recursively split text using a hierarchy of separators."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    chunks = []
    separator = separators[0]
    remaining_separators = separators[1:]

    # Split on the current separator
    splits = text.split(separator) if separator else list(text)

    current_chunk = ""
    for split in splits:
        piece = split if not separator else split + separator
        if len(current_chunk) + len(piece) <= chunk_size:
            current_chunk += piece
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            # If this single piece is too large, recurse with finer separator
            if len(piece) > chunk_size and remaining_separators:
                sub_chunks = recursive_split(
                    piece, chunk_size, chunk_overlap, remaining_separators
                )
                chunks.extend(sub_chunks)
                current_chunk = ""
            else:
                current_chunk = piece

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    # Apply overlap
    if chunk_overlap > 0 and len(chunks) > 1:
        chunks = _apply_overlap(chunks, chunk_overlap)

    return chunks


def _apply_overlap(chunks: list[str], overlap: int) -> list[str]:
    """Add overlap between consecutive chunks."""
    result = [chunks[0]]
    for i in range(1, len(chunks)):
        prev_tail = chunks[i - 1][-overlap:]
        result.append(prev_tail + " " + chunks[i])
    return result

In practice, most teams use LangChain’s RecursiveCharacterTextSplitter because it handles the edge cases well:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", ", ", " ", ""],
    length_function=len,
)

text = """
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that focuses 
on building systems that learn from data. Unlike traditional programming, 
where you write explicit rules, ML systems discover patterns automatically.

There are three main categories of machine learning:

Supervised Learning: The model learns from labeled examples. You provide 
input-output pairs and the model learns the mapping function. Common 
algorithms include linear regression, decision trees, and neural networks.

Unsupervised Learning: The model finds structure in unlabeled data. 
Clustering and dimensionality reduction are typical tasks. K-means and 
PCA are widely used algorithms.

Reinforcement Learning: The model learns by interacting with an 
environment and receiving rewards or penalties. This approach powers 
game-playing AIs and robotic control systems.
"""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:100] + "...")
    print()

The separator hierarchy is the key insight. "\n\n" catches paragraph breaks. "\n" catches line breaks. ". " catches sentence boundaries. This means you almost always get chunks that start and end at natural boundaries.

Sentence-Based Chunking

Sometimes you want precise sentence boundaries. NLP libraries like spaCy and NLTK handle edge cases that regex cannot — abbreviations (Dr., U.S.A.), decimal numbers (3.14), and URLs.

import spacy

nlp = spacy.load("en_core_web_sm")


def sentence_chunk(
    text: str,
    max_sentences: int = 5,
    overlap_sentences: int = 1,
) -> list[str]:
    """Chunk text by grouping sentences together."""
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks = []
    for i in range(0, len(sentences), max_sentences - overlap_sentences):
        group = sentences[i : i + max_sentences]
        chunks.append(" ".join(group))

    return chunks


text = """
Dr. Smith published the results on Jan. 15. The study involved 3.5 million 
data points collected across the U.S.A. Results showed a 23% improvement in 
accuracy. The p-value was less than 0.001, which is statistically significant. 
Follow-up studies are planned for Q3 2026. The team expects to expand the 
dataset to 10 million records. Funding has been secured through 2028.
"""

chunks = sentence_chunk(text, max_sentences=3, overlap_sentences=1)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}\n")

NLTK alternative (lighter weight, no model download):

import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(text)

Sentence chunking is ideal when your source material is dense prose where every sentence carries distinct information — legal documents, medical records, research papers.

Semantic Chunking

The most sophisticated approach: use embeddings to decide where to split. The idea is that consecutive sentences about the same topic should stay together, and a split should happen when the topic shifts.

import numpy as np
from openai import OpenAI

client = OpenAI()


def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get embeddings for a batch of texts."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


def semantic_chunk(
    text: str,
    similarity_threshold: float = 0.75,
    min_chunk_size: int = 100,
) -> list[str]:
    """Split text into semantically coherent chunks using embeddings."""
    # Step 1: Split into sentences
    import nltk
    nltk.download("punkt_tab", quiet=True)
    from nltk.tokenize import sent_tokenize

    sentences = sent_tokenize(text)
    if len(sentences) <= 1:
        return [text]

    # Step 2: Embed each sentence
    embeddings = get_embeddings(sentences)

    # Step 3: Find breakpoints where similarity drops
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = cosine_similarity(embeddings[i], embeddings[i + 1])
        similarities.append(sim)

    # Step 4: Split at low-similarity points
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        if sim < similarity_threshold and len(" ".join(current_chunk)) >= min_chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


# Usage
text = """
Python is a high-level programming language known for its readability. 
It was created by Guido van Rossum and first released in 1991. Python 
supports multiple programming paradigms including procedural and 
object-oriented programming.

The stock market experienced significant volatility last quarter. 
The S&P 500 dropped 8% in March before recovering in April. Analysts 
attribute the decline to rising interest rates and geopolitical tensions.

Machine learning models require large datasets for training. The quality 
of training data directly impacts model performance. Data augmentation 
techniques can help when labeled data is scarce.
"""

chunks = semantic_chunk(text, similarity_threshold=0.78)
for i, chunk in enumerate(chunks):
    print(f"--- Semantic Chunk {i} ---")
    print(chunk[:150])
    print()

Semantic chunking correctly groups the three paragraphs (Python, stock market, ML) into separate chunks even if you removed the blank lines between them. A fixed-size splitter would blindly cut at character 500 regardless of topic.

The cost: You must embed every sentence before chunking, which means API calls and latency. For a 100-page document with 2,000 sentences, that is one embedding API call with 2,000 inputs. It works, but it is 10-100x slower than recursive splitting.

When to use it: High-value documents where retrieval quality matters more than indexing speed — legal contracts, technical specifications, medical literature.

Document-Aware Chunking

Structured documents (Markdown, HTML, PDFs with headings) carry explicit signals about where topics change. Use them.

import re


def markdown_chunk(text: str, max_chunk_size: int = 1000) -> list[dict]:
    """Chunk markdown by headings, preserving section hierarchy."""
    # Split on headings (##, ###, etc.)
    sections = re.split(r"(^#{1,4}\s+.+$)", text, flags=re.MULTILINE)

    chunks = []
    current_heading = "Introduction"
    current_content = ""

    for section in sections:
        if re.match(r"^#{1,4}\s+", section):
            # Save previous section
            if current_content.strip():
                chunks.append({
                    "content": current_content.strip(),
                    "heading": current_heading,
                    "char_count": len(current_content.strip()),
                })
            current_heading = section.strip("# \n")
            current_content = ""
        else:
            current_content += section

    # Don't forget the last section
    if current_content.strip():
        chunks.append({
            "content": current_content.strip(),
            "heading": current_heading,
            "char_count": len(current_content.strip()),
        })

    # Split oversized sections with recursive splitting
    final_chunks = []
    for chunk in chunks:
        if chunk["char_count"] > max_chunk_size:
            from langchain.text_splitter import RecursiveCharacterTextSplitter

            splitter = RecursiveCharacterTextSplitter(
                chunk_size=max_chunk_size, chunk_overlap=50
            )
            sub_texts = splitter.split_text(chunk["content"])
            for j, sub in enumerate(sub_texts):
                final_chunks.append({
                    "content": sub,
                    "heading": chunk["heading"],
                    "sub_section": j,
                })
        else:
            final_chunks.append(chunk)

    return final_chunks


# Example
markdown_doc = """
# API Reference

## Authentication

All API calls require a Bearer token in the Authorization header. 
Tokens expire after 24 hours. Use the /auth/refresh endpoint to 
get a new token without re-authenticating.

## Rate Limiting

The API enforces rate limits of 100 requests per minute per API key. 
Exceeding the limit returns a 429 status code with a Retry-After 
header indicating how many seconds to wait.

### Burst Limits

Short bursts of up to 20 requests per second are allowed. Sustained 
traffic above 100 RPM will trigger rate limiting.

## Pagination

All list endpoints support cursor-based pagination. Pass the cursor 
parameter from the previous response to get the next page. The default 
page size is 20 items, configurable up to 100.
"""

chunks = markdown_chunk(markdown_doc)
for chunk in chunks:
    print(f"[{chunk['heading']}] {chunk['content'][:80]}...")
    print()

HTML-aware chunking follows the same principle — split on <h1>, <h2>, <section>, <article> tags. For PDFs, use libraries like pymupdf or unstructured that extract heading hierarchy.

The Overlap Question

Overlap means repeating a portion of the previous chunk at the start of the next chunk. It exists to solve one problem: information that spans a chunk boundary.

Without overlap:
  Chunk 1: "...the refund policy requires customers to"
  Chunk 2: "submit a written request within 30 days."

With 50-char overlap:
  Chunk 1: "...the refund policy requires customers to"
  Chunk 2: "policy requires customers to submit a written request within 30 days."

Without overlap, neither chunk contains the complete sentence. A query about “refund request deadline” might not match either chunk well enough to be retrieved.

How much overlap? 10-20% of chunk size is the standard range. For a 500-character chunk, use 50-100 characters of overlap. More overlap means more redundancy in your vector database (more storage, more embeddings to compute) but fewer missed boundary cases.

The cost of overlap: If you have 1000 chunks with 20% overlap, you effectively store 1200 chunks worth of text. Embedding cost increases proportionally.

Chunk Size	Overlap	Effective Chunks per 100K chars	Embedding Cost Multiplier
500	0	200	1.0x
500	50	222	1.11x
500	100	250	1.25x
500	200	333	1.67x

Chunk Size Experiments

The question everyone asks: what chunk size should I use? Here is a systematic way to find out for your data.

from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
import numpy as np
import json

client = OpenAI()


def embed(texts: list[str]) -> list[list[float]]:
    """Batch embed texts."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [d.embedding for d in response.data]


def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


def evaluate_chunk_size(
    documents: list[str],
    queries: list[dict],  # [{"query": "...", "expected_content": "..."}]
    chunk_sizes: list[int],
    top_k: int = 5,
) -> dict:
    """Compare retrieval quality across different chunk sizes."""
    results = {}

    for size in chunk_sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size, chunk_overlap=int(size * 0.1)
        )

        # Chunk all documents
        all_chunks = []
        for doc in documents:
            all_chunks.extend(splitter.split_text(doc))

        # Embed all chunks
        chunk_embeddings = embed(all_chunks)

        # Test each query
        hits = 0
        for q in queries:
            query_emb = embed([q["query"]])[0]

            # Find top-k most similar chunks
            sims = [cosine_sim(query_emb, ce) for ce in chunk_embeddings]
            top_indices = np.argsort(sims)[-top_k:][::-1]
            retrieved_text = " ".join(all_chunks[i] for i in top_indices)

            # Check if expected content appears in retrieved chunks
            if q["expected_content"].lower() in retrieved_text.lower():
                hits += 1

        results[size] = {
            "num_chunks": len(all_chunks),
            "recall": hits / len(queries),
            "avg_chunk_chars": np.mean([len(c) for c in all_chunks]),
        }

    return results


# Example usage
documents = [open(f).read() for f in ["doc1.txt", "doc2.txt", "doc3.txt"]]
queries = [
    {
        "query": "What is the refund policy for enterprise?",
        "expected_content": "30 days",
    },
    {
        "query": "How do I reset my API key?",
        "expected_content": "dashboard settings",
    },
    # ... 50-100 queries for a real evaluation
]

results = evaluate_chunk_size(
    documents, queries, chunk_sizes=[200, 500, 1000, 2000]
)

for size, metrics in sorted(results.items()):
    print(f"Chunk size {size:>5}: "
          f"{metrics['num_chunks']:>4} chunks, "
          f"recall={metrics['recall']:.2%}")

Typical results from production systems:

Chunk Size	Num Chunks	Retrieval Recall	Notes
200	1,847	72%	Too granular — loses context
500	739	88%	Good balance
800	462	85%	Slightly diluted embeddings
1,000	370	81%	Starting to lose precision
2,000	185	68%	Too much noise per chunk

The sweet spot is almost always 300-800 characters. Below 200, you lose too much context. Above 1000, the embedding gets diluted by mixed topics within a single chunk. But always test on your own data — the right size depends on your document style and query patterns.

Metadata Enrichment

A chunk in isolation often lacks context. “The process takes 3-5 business days” — which process? Attaching metadata to each chunk solves this.

from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class EnrichedChunk:
    content: str
    metadata: dict = field(default_factory=dict)


def enrich_chunks(
    chunks: list[str],
    document_title: str,
    source_url: str,
    section_headings: list[str] | None = None,
) -> list[EnrichedChunk]:
    """Add metadata to chunks for better retrieval context."""
    enriched = []
    for i, chunk in enumerate(chunks):
        meta = {
            "document_title": document_title,
            "source_url": source_url,
            "chunk_index": i,
            "total_chunks": len(chunks),
            "char_count": len(chunk),
            "indexed_at": datetime.utcnow().isoformat(),
        }

        # Add section heading if available
        if section_headings and i < len(section_headings):
            meta["section"] = section_headings[i]

        # Prepend context to the chunk text for better embeddings
        context_prefix = f"Document: {document_title}"
        if "section" in meta:
            context_prefix += f" | Section: {meta['section']}"

        enriched.append(EnrichedChunk(
            content=f"{context_prefix}\n\n{chunk}",
            metadata=meta,
        ))

    return enriched


# Usage
chunks = ["The process takes 3-5 business days...", "Contact support at..."]
enriched = enrich_chunks(
    chunks,
    document_title="Enterprise Onboarding Guide",
    source_url="https://docs.example.com/onboarding",
    section_headings=["Timeline", "Support"],
)

for ec in enriched:
    print(f"Content: {ec.content[:100]}...")
    print(f"Metadata: {json.dumps(ec.metadata, indent=2)}\n")

The context_prefix trick is important: by prepending “Document: Enterprise Onboarding Guide | Section: Timeline” to the chunk before embedding, the embedding captures both the content and its context. A query like “how long does enterprise onboarding take” will now match better because the chunk explicitly mentions “Enterprise Onboarding” in its embedded text.

Parent-Child Chunking

A hybrid approach: embed small, focused chunks for precision retrieval, but return the larger parent chunk to the LLM for context.

from dataclasses import dataclass


@dataclass
class ParentChildChunk:
    child_content: str  # Small, embedded for retrieval
    parent_content: str  # Large, sent to LLM
    child_id: str
    parent_id: str


def create_parent_child_chunks(
    text: str,
    parent_size: int = 2000,
    child_size: int = 300,
    child_overlap: int = 50,
) -> list[ParentChildChunk]:
    """Create two-level chunk hierarchy."""
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    import hashlib

    # Level 1: Large parent chunks
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_size, chunk_overlap=200
    )
    parent_texts = parent_splitter.split_text(text)

    # Level 2: Small child chunks within each parent
    child_splitter = RecursiveCharacterTextSplitter(
        chunk_size=child_size, chunk_overlap=child_overlap
    )

    all_chunks = []
    for p_idx, parent in enumerate(parent_texts):
        parent_id = hashlib.md5(parent[:100].encode()).hexdigest()[:12]
        children = child_splitter.split_text(parent)

        for c_idx, child in enumerate(children):
            child_id = f"{parent_id}_c{c_idx}"
            all_chunks.append(ParentChildChunk(
                child_content=child,
                parent_content=parent,
                child_id=child_id,
                parent_id=parent_id,
            ))

    return all_chunks


# Retrieval flow:
# 1. Embed all child_content and store in vector DB
# 2. On query, search against child embeddings
# 3. Return the parent_content of the matched children to the LLM
#
# This gives you the best of both worlds:
# - Precise retrieval (small chunks match specific queries)
# - Rich context (LLM sees the full surrounding text)

chunks = create_parent_child_chunks(open("large_doc.txt").read())
print(f"Parents: {len(set(c.parent_id for c in chunks))}")
print(f"Children: {len(chunks)}")
print(f"Child sizes: {[len(c.child_content) for c in chunks[:5]]}")
print(f"Parent sizes: {[len(c.parent_content) for c in chunks[:5]]}")

Parent-child chunking is used by many production RAG systems. It solves the precision-vs-context tradeoff without compromise. The downside is complexity — you need to store and manage the parent-child mapping.

Code-Aware Chunking

Source code has its own structure. Splitting Python by character count will break functions in half. Instead, split by language constructs.

import ast


def chunk_python_code(source_code: str) -> list[dict]:
    """Chunk Python source code by functions and classes."""
    tree = ast.parse(source_code)
    lines = source_code.split("\n")
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            start = node.lineno - 1
            end = node.end_lineno
            func_code = "\n".join(lines[start:end])

            # Get docstring if present
            docstring = ast.get_docstring(node) or ""

            chunks.append({
                "content": func_code,
                "type": "function",
                "name": node.name,
                "docstring": docstring,
                "line_start": node.lineno,
                "line_end": node.end_lineno,
            })

        elif isinstance(node, ast.ClassDef):
            start = node.lineno - 1
            end = node.end_lineno
            class_code = "\n".join(lines[start:end])

            chunks.append({
                "content": class_code,
                "type": "class",
                "name": node.name,
                "docstring": ast.get_docstring(node) or "",
                "line_start": node.lineno,
                "line_end": node.end_lineno,
            })

    # Also capture module-level code that's not in any function/class
    covered_lines = set()
    for chunk in chunks:
        covered_lines.update(range(chunk["line_start"], chunk["line_end"] + 1))

    module_lines = []
    for i, line in enumerate(lines, 1):
        if i not in covered_lines and line.strip():
            module_lines.append(line)

    if module_lines:
        chunks.insert(0, {
            "content": "\n".join(module_lines),
            "type": "module_level",
            "name": "module",
            "docstring": "",
        })

    return chunks


# Example
source = '''
import os
from pathlib import Path

CONFIG_DIR = Path.home() / ".config"


class DatabaseConnection:
    """Manages database connections with pooling."""

    def __init__(self, url: str, pool_size: int = 5):
        self.url = url
        self.pool_size = pool_size

    def connect(self):
        """Establish a connection from the pool."""
        pass

    def close(self):
        """Return connection to pool."""
        pass


def process_file(path: str) -> dict:
    """Read and parse a configuration file."""
    with open(path) as f:
        return json.loads(f.read())
'''

chunks = chunk_python_code(source)
for chunk in chunks:
    print(f"[{chunk['type']}] {chunk['name']}: {len(chunk['content'])} chars")

For other languages (JavaScript, Go, Rust), use tree-sitter for AST parsing — it supports 100+ languages with a uniform API.

Production Chunking Pipeline

Here is a complete pipeline that handles multiple document types and produces enriched, ready-to-embed chunks.

import re
import hashlib
from dataclasses import dataclass, field
from pathlib import Path
from enum import Enum

from langchain.text_splitter import RecursiveCharacterTextSplitter


class DocType(Enum):
    MARKDOWN = "markdown"
    PLAIN_TEXT = "plain_text"
    CODE = "code"
    HTML = "html"


@dataclass
class ProcessedChunk:
    chunk_id: str
    content: str
    content_for_embedding: str  # May include metadata prefix
    metadata: dict = field(default_factory=dict)


class ChunkingPipeline:
    """Production chunking pipeline with multiple strategies."""

    def __init__(
        self,
        default_chunk_size: int = 500,
        default_overlap: int = 50,
    ):
        self.default_chunk_size = default_chunk_size
        self.default_overlap = default_overlap

    def process(
        self,
        text: str,
        doc_type: DocType,
        source: str = "",
        title: str = "",
    ) -> list[ProcessedChunk]:
        """Process a document into enriched chunks."""
        # Step 1: Detect doc type and apply appropriate strategy
        if doc_type == DocType.MARKDOWN:
            raw_chunks = self._chunk_markdown(text)
        elif doc_type == DocType.CODE:
            raw_chunks = self._chunk_code(text)
        elif doc_type == DocType.HTML:
            raw_chunks = self._chunk_html(text)
        else:
            raw_chunks = self._chunk_plain_text(text)

        # Step 2: Enrich with metadata
        processed = []
        for i, chunk_data in enumerate(raw_chunks):
            content = chunk_data["content"]
            section = chunk_data.get("section", "")

            chunk_id = hashlib.sha256(
                f"{source}:{i}:{content[:50]}".encode()
            ).hexdigest()[:16]

            # Build embedding text with context
            embed_parts = []
            if title:
                embed_parts.append(f"Document: {title}")
            if section:
                embed_parts.append(f"Section: {section}")
            embed_parts.append(content)
            content_for_embedding = "\n".join(embed_parts)

            processed.append(ProcessedChunk(
                chunk_id=chunk_id,
                content=content,
                content_for_embedding=content_for_embedding,
                metadata={
                    "source": source,
                    "title": title,
                    "section": section,
                    "chunk_index": i,
                    "char_count": len(content),
                    "doc_type": doc_type.value,
                },
            ))

        return processed

    def _chunk_markdown(self, text: str) -> list[dict]:
        """Chunk markdown respecting heading structure."""
        sections = re.split(r"(^#{1,4}\s+.+$)", text, flags=re.MULTILINE)
        chunks = []
        current_heading = ""
        current_text = ""

        for part in sections:
            if re.match(r"^#{1,4}\s+", part):
                if current_text.strip():
                    chunks.extend(self._split_section(
                        current_text.strip(), current_heading
                    ))
                current_heading = part.strip("# \n")
                current_text = ""
            else:
                current_text += part

        if current_text.strip():
            chunks.extend(self._split_section(
                current_text.strip(), current_heading
            ))

        return chunks

    def _split_section(self, text: str, heading: str) -> list[dict]:
        """Split a section that might be too large."""
        if len(text) <= self.default_chunk_size:
            return [{"content": text, "section": heading}]

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.default_chunk_size,
            chunk_overlap=self.default_overlap,
        )
        parts = splitter.split_text(text)
        return [{"content": p, "section": heading} for p in parts]

    def _chunk_plain_text(self, text: str) -> list[dict]:
        """Chunk plain text with recursive splitting."""
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.default_chunk_size,
            chunk_overlap=self.default_overlap,
        )
        parts = splitter.split_text(text)
        return [{"content": p, "section": ""} for p in parts]

    def _chunk_code(self, text: str) -> list[dict]:
        """Chunk source code by functions/classes."""
        try:
            chunks = chunk_python_code(text)  # from earlier example
            return [
                {"content": c["content"], "section": f"{c['type']}: {c['name']}"}
                for c in chunks
            ]
        except SyntaxError:
            return self._chunk_plain_text(text)

    def _chunk_html(self, text: str) -> list[dict]:
        """Chunk HTML by stripping tags and splitting."""
        from html.parser import HTMLParser

        class TextExtractor(HTMLParser):
            def __init__(self):
                super().__init__()
                self.parts = []
                self.current_heading = ""

            def handle_starttag(self, tag, attrs):
                if tag in ("h1", "h2", "h3", "h4"):
                    self._tag = tag

            def handle_data(self, data):
                if hasattr(self, "_tag"):
                    self.current_heading = data.strip()
                    del self._tag
                else:
                    if data.strip():
                        self.parts.append({
                            "content": data.strip(),
                            "section": self.current_heading,
                        })

        extractor = TextExtractor()
        extractor.feed(text)

        # Merge small consecutive parts from the same section
        merged = []
        current = {"content": "", "section": ""}
        for part in extractor.parts:
            if (part["section"] == current["section"]
                    and len(current["content"]) + len(part["content"])
                    < self.default_chunk_size):
                current["content"] += " " + part["content"]
            else:
                if current["content"]:
                    merged.append(current)
                current = dict(part)
        if current["content"]:
            merged.append(current)

        return merged


# Usage
pipeline = ChunkingPipeline(default_chunk_size=500, default_overlap=50)

# Process a markdown doc
md_chunks = pipeline.process(
    text=open("docs/api-reference.md").read(),
    doc_type=DocType.MARKDOWN,
    source="docs/api-reference.md",
    title="API Reference",
)

# Process source code
code_chunks = pipeline.process(
    text=open("src/auth.py").read(),
    doc_type=DocType.CODE,
    source="src/auth.py",
    title="Authentication Module",
)

for chunk in md_chunks[:3]:
    print(f"ID: {chunk.chunk_id}")
    print(f"Section: {chunk.metadata['section']}")
    print(f"Embed text: {chunk.content_for_embedding[:100]}...")
    print()

Evaluating Chunking Quality

How do you know if your chunking is good? Three methods.

Method 1: Human inspection. Read 50 random chunks. Can you understand each one in isolation? Does each chunk answer the question “what is this about?” If not, your chunks are too fragmented.

Method 2: Retrieval evaluation. Build a test set of queries with expected source documents. Measure what percentage of queries retrieve the right chunks. This is the experiment framework from the “Chunk Size Experiments” section above.

Method 3: Coherence scoring. Use an LLM to evaluate chunk quality.

def score_chunk_coherence(chunks: list[str], sample_size: int = 50) -> float:
    """Use an LLM to score chunk coherence."""
    import random

    sample = random.sample(chunks, min(sample_size, len(chunks)))
    scores = []

    for chunk in sample:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Rate the following text chunk on coherence from 1-5. "
                        "5 = complete, self-contained thought. "
                        "1 = fragment, starts/ends mid-sentence, or mixes unrelated topics. "
                        "Respond with just the number."
                    ),
                },
                {"role": "user", "content": chunk},
            ],
            max_tokens=5,
        )
        try:
            score = int(response.choices[0].message.content.strip())
            scores.append(score)
        except ValueError:
            continue

    return sum(scores) / len(scores) if scores else 0.0


# Compare strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = open("large_document.txt").read()

fixed_chunks = fixed_size_chunk(text, chunk_size=500)
recursive_chunks = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
).split_text(text)

print(f"Fixed-size coherence:  {score_chunk_coherence(fixed_chunks):.2f}/5")
print(f"Recursive coherence:   {score_chunk_coherence(recursive_chunks):.2f}/5")

Typical results: fixed-size averages 2.5-3.0, recursive averages 3.5-4.0, semantic averages 4.0-4.5. The difference is meaningful — it directly translates to retrieval quality.

Decision Framework

Use this table to pick the right strategy for your documents.

Document Type	Recommended Strategy	Chunk Size	Why
Blog posts, articles	Recursive character	400-600	Natural paragraph structure
API documentation	Markdown-aware	300-500	Headings mark topic changes
Legal contracts	Semantic	500-800	Dense prose, subtle topic shifts
Chat/conversation logs	Sentence-based	200-400	Each message is a unit
Source code	AST-based (code-aware)	Per function	Functions are natural units
PDF reports	Document-aware + recursive	400-800	Use headings, fall back to recursive
Knowledge base Q&A	Per question-answer pair	Varies	Each Q&A is a natural chunk
CSV/structured data	Per row or row group	Per row	Rows are records

The production default: Start with RecursiveCharacterTextSplitter at 500 characters with 50 character overlap. It works well for 80% of cases. Only switch to a more sophisticated strategy when you have evidence (from evaluation) that it is not good enough.

Key Takeaways

Chunking determines retrieval quality. No amount of model upgrades or prompt engineering fixes bad chunks. If the right information never gets retrieved, the answer will be wrong.
Recursive character splitting is the right default. It respects natural text boundaries (paragraphs, sentences) and handles most document types well. Start here.
The sweet spot is 300-800 characters. Below 200, chunks lack context. Above 1000, embeddings get diluted. Test with your own data to find the optimal size.
Overlap prevents boundary losses. Use 10-20% overlap to catch information that spans chunk boundaries. The storage cost is minimal compared to the retrieval improvement.
Semantic chunking is powerful but expensive. Reserve it for high-value documents where retrieval quality justifies the 10x slower indexing.
Always enrich chunks with metadata. Section headings, document titles, and source URLs help both retrieval (better embeddings) and generation (LLM has more context).
Parent-child chunking solves the precision vs context tradeoff. Embed small chunks for matching, retrieve large chunks for the LLM.
Evaluate your chunks. Read them manually, measure retrieval recall, and score coherence. Chunking is not something you configure once and forget.
Document-aware chunking beats recursive for structured documents. If your docs have headings, use them. They are explicit signals about topic boundaries.
Read your chunks before deploying. If a human cannot understand a chunk in isolation, the retriever will not match it to the right query. This five-minute sanity check catches most chunking problems.