In Lesson 5 you built a RAG pipeline using ChromaDB. That was the right choice for getting started fast. But production has different requirements: durability, scalability, high availability, monitoring, backups. The vector database you choose determines your operational burden for the next 2-3 years. This lesson gives you the information to make that decision well.
What Vector Databases Actually Do
A vector database stores high-dimensional vectors (embeddings) and retrieves the most similar ones to a query vector. That is it. Everything else — metadata filtering, persistence, replication, CRUD operations — is the packaging around that core operation.
The core operation is Approximate Nearest Neighbor (ANN) search, not exact K-Nearest Neighbor (KNN). Exact KNN compares the query vector against every stored vector. That is O(n) per query — fine for 10K vectors, unusable for 10M. ANN algorithms trade a small amount of accuracy for dramatically faster search by building index structures that narrow the search space.
ANN vs KNN
import numpy as np
import time
def exact_knn(query: np.ndarray, vectors: np.ndarray, k: int = 5) -> list[int]:
"""Exact KNN — compare against every vector. O(n)."""
distances = np.linalg.norm(vectors - query, axis=1)
return np.argsort(distances)[:k].tolist()
# With 1M vectors of dimension 1536:
# Exact KNN: ~500ms per query
# HNSW (ANN): ~2ms per query, 99%+ recallThe “approximate” part scares some people. In practice, modern ANN algorithms achieve 95-99%+ recall — meaning they find 95-99% of the true nearest neighbors. For RAG, this is more than enough. You are retrieving 5-20 chunks out of thousands or millions. Missing one marginally relevant chunk has negligible impact on answer quality.
Index Types: HNSW, IVF, Flat
Every vector database uses one or more of these index types. Understanding them helps you tune performance.
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph. The top layer has few nodes spread far apart (for coarse navigation). Lower layers have more nodes packed closer together (for fine-grained search). Query traversal starts at the top and drills down.
Strengths: Fast queries (~1-5ms at 1M vectors), high recall (99%+), no training required. Weaknesses: High memory usage (stores the graph in RAM), slower inserts than IVF, memory grows linearly with data.
Layer 3: [A] -------------- [F] (few nodes, big jumps)
Layer 2: [A] --- [C] --- [F] --- [H] (more nodes, smaller jumps)
Layer 1: [A]-[B]-[C]-[D]-[E]-[F]-[G]-[H] (all nodes, fine-grained)When to use: Default choice for most workloads. Best balance of speed and accuracy.
IVF (Inverted File Index)
IVF clusters vectors into buckets using k-means. At query time, it finds the nearest clusters and only searches vectors within those clusters.
Strengths: Lower memory than HNSW, faster inserts, tunable speed/accuracy tradeoff. Weaknesses: Requires training step (k-means on your data), lower recall than HNSW at the same speed.
| Parameter | Effect |
|---|---|
| nlist (number of clusters) | More clusters = faster queries but lower recall |
| nprobe (clusters to search) | More probes = higher recall but slower queries |
When to use: Very large datasets (10M+ vectors) where HNSW memory is a constraint.
Flat (Brute Force)
No index at all. Compares every vector. Exact results, guaranteed.
Strengths: Perfect recall, simple, no index build time. Weaknesses: O(n) query time. Unusable above ~100K vectors.
When to use: Small datasets, or as a baseline to measure ANN recall.
Index Comparison Summary
| Index | Query Speed (1M vectors) | Memory | Recall | Build Time |
|---|---|---|---|---|
| Flat | ~500ms | Low | 100% | None |
| IVF | ~5-20ms | Medium | 90-98% | Minutes |
| HNSW | ~1-5ms | High | 95-99%+ | Minutes-Hours |
ChromaDB Deep Dive
ChromaDB is an in-process vector database designed for simplicity. It runs inside your Python process with zero configuration.
Setup and Basic Operations
pip install chromadbimport chromadb
# In-memory (for testing)
client = chromadb.Client()
# Persistent (for production)
client = chromadb.PersistentClient(path="./chroma_data")
# Create a collection
collection = client.get_or_create_collection(
name="documents",
metadata={
"hnsw:space": "cosine", # Distance metric: cosine, l2, or ip
"hnsw:M": 16, # HNSW connections per node (default 16)
"hnsw:construction_ef": 100, # Build-time search width (default 100)
"hnsw:search_ef": 50, # Query-time search width (default 10)
},
)CRUD Operations
# CREATE — add documents
collection.add(
ids=["doc1", "doc2", "doc3"],
documents=[
"How to reset your password",
"Billing FAQ and payment methods",
"API rate limits and quotas",
],
metadatas=[
{"category": "account", "priority": "high"},
{"category": "billing", "priority": "medium"},
{"category": "technical", "priority": "low"},
],
)
# READ — get by ID
results = collection.get(ids=["doc1", "doc2"])
print(results["documents"]) # ["How to reset your password", "Billing FAQ..."]
# UPDATE — upsert (update if exists, insert if not)
collection.upsert(
ids=["doc1"],
documents=["Updated: How to reset your password — new 2FA flow"],
metadatas=[{"category": "account", "priority": "high", "updated": True}],
)
# DELETE
collection.delete(ids=["doc3"])
# DELETE with filter
collection.delete(where={"category": "billing"})Similarity Search with Metadata Filtering
# Basic similarity search
results = collection.query(
query_texts=["I forgot my login credentials"],
n_results=5,
)
# Search with metadata filter
results = collection.query(
query_texts=["payment issue"],
n_results=5,
where={"category": "billing"}, # Only search billing documents
)
# Complex filters
results = collection.query(
query_texts=["urgent problem"],
n_results=10,
where={
"$and": [
{"category": {"$in": ["account", "billing"]}},
{"priority": {"$eq": "high"}},
]
},
)
# Access results
for i in range(len(results["documents"][0])):
print(f"Score: {1 - results['distances'][0][i]:.3f}")
print(f" Doc: {results['documents'][0][i][:100]}")
print(f" Meta: {results['metadatas'][0][i]}")Custom Embedding Functions
from chromadb.utils import embedding_functions
# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-3-small",
)
# Sentence Transformers (local, free)
st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2",
)
# Use with collection
collection = client.get_or_create_collection(
name="documents",
embedding_function=openai_ef, # Applied automatically on add/query
)ChromaDB Limitations
- Single-process. No built-in clustering or replication. One process, one machine.
- Memory-bound. HNSW index lives in RAM. At 1M vectors with 1536 dimensions, that is roughly 6GB RAM just for the index.
- No authentication. Anyone who can connect to the Chroma server can read/write everything.
- Limited filtering. Metadata filters work but are not as expressive as SQL WHERE clauses.
- No backup tooling. You back up the data directory yourself.
Verdict: Excellent for prototyping, local development, and small production workloads (under 500K vectors). Move to pgvector or Pinecone when you need durability, auth, or scale.
pgvector Deep Dive
pgvector is a PostgreSQL extension that adds vector similarity search to your existing Postgres database. If you already run Postgres, this is the path of least resistance.
Installation
-- Enable the extension (requires pgvector installed on the server)
CREATE EXTENSION IF NOT EXISTS vector;For Docker:
docker run -d \
--name pgvector \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
pgvector/pgvector:pg16Schema Design
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536), -- Match your embedding model's dimensions
metadata JSONB DEFAULT '{}',
source VARCHAR(500),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Metadata index for filtering
CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);
CREATE INDEX idx_documents_source ON documents (source);Index Creation: IVFFlat vs HNSW
pgvector supports two index types. Choose based on your workload.
-- HNSW index (recommended for most workloads)
-- Slower to build, faster queries, higher recall
CREATE INDEX idx_documents_embedding_hnsw
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat index (for very large datasets)
-- Faster to build, requires training data, lower recall
CREATE INDEX idx_documents_embedding_ivf
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- Rule of thumb: sqrt(num_rows) for lists| Index | Build Time (1M vectors) | Query Latency | Recall | Memory |
|---|---|---|---|---|
| None (flat) | N/A | ~800ms | 100% | Low |
| IVFFlat | ~5 min | ~10ms | 90-95% | Medium |
| HNSW | ~30 min | ~3ms | 98-99% | High |
Python Integration
import psycopg2
import json
from openai import OpenAI
oai_client = OpenAI()
def get_connection():
return psycopg2.connect(
host="localhost",
port=5432,
dbname="ragdb",
user="postgres",
password="password",
)
def insert_document(content: str, metadata: dict, source: str):
"""Embed and store a document."""
# Generate embedding
response = oai_client.embeddings.create(
model="text-embedding-3-small",
input=content,
)
embedding = response.data[0].embedding
conn = get_connection()
cur = conn.cursor()
cur.execute(
"""
INSERT INTO documents (content, embedding, metadata, source)
VALUES (%s, %s::vector, %s, %s)
RETURNING id
""",
(content, str(embedding), json.dumps(metadata), source),
)
doc_id = cur.fetchone()[0]
conn.commit()
cur.close()
conn.close()
return doc_id
def search_documents(
query: str,
n_results: int = 5,
source_filter: str = None,
metadata_filter: dict = None,
) -> list[dict]:
"""Search for similar documents using cosine similarity."""
# Embed the query
response = oai_client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
query_embedding = response.data[0].embedding
conn = get_connection()
cur = conn.cursor()
# Build query with optional filters
sql = """
SELECT id, content, metadata, source,
1 - (embedding <=> %s::vector) as similarity
FROM documents
WHERE 1=1
"""
params = [str(query_embedding)]
if source_filter:
sql += " AND source = %s"
params.append(source_filter)
if metadata_filter:
for key, value in metadata_filter.items():
sql += f" AND metadata->>'{key}' = %s"
params.append(str(value))
sql += " ORDER BY embedding <=> %s::vector LIMIT %s"
params.extend([str(query_embedding), n_results])
cur.execute(sql, params)
results = []
for row in cur.fetchall():
results.append({
"id": row[0],
"content": row[1],
"metadata": row[2],
"source": row[3],
"similarity": float(row[4]),
})
cur.close()
conn.close()
return results
# Usage
doc_id = insert_document(
content="Our refund policy allows full refunds within 30 days of purchase.",
metadata={"category": "policy", "version": "2.1"},
source="policies/refund.md",
)
results = search_documents("Can I get my money back?", n_results=3)
for r in results:
print(f"[{r['similarity']:.3f}] {r['content'][:100]}")Query Performance Tuning
-- Increase HNSW search width for higher recall (slower queries)
SET hnsw.ef_search = 100; -- Default is 40
-- Increase IVF probes for higher recall (slower queries)
SET ivfflat.probes = 10; -- Default is 1
-- Check index usage
EXPLAIN ANALYZE
SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) as similarity
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;pgvector Strengths and Limitations
Strengths:
- No new infrastructure if you already run Postgres
- Full SQL capabilities — joins, transactions, complex queries
- Battle-tested Postgres ecosystem: backups, replication, monitoring
- Metadata filtering is just SQL WHERE clauses
- ACID transactions — consistent reads and writes
Limitations:
- Performance ceiling around 5-10M vectors per table (depends on hardware)
- HNSW index lives in shared memory — competes with other Postgres workloads
- Horizontal scaling requires Postgres sharding (complex)
- No built-in multi-tenancy or namespace isolation
Verdict: The right choice for most production teams. If you run Postgres, use pgvector. You avoid introducing new infrastructure, and Postgres operational patterns (backups, monitoring, replication) apply directly.
Pinecone Deep Dive
Pinecone is a fully managed vector database. You do not run servers, manage indexes, or worry about scaling. You get an API endpoint and it handles the rest.
Setup
pip install pineconefrom pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-api-key")
# Create an index
pc.create_index(
name="knowledge-base",
dimension=1536, # Must match your embedding model
metric="cosine", # cosine, euclidean, or dotproduct
spec=ServerlessSpec(
cloud="aws",
region="us-east-1",
),
)
# Connect to the index
index = pc.Index("knowledge-base")CRUD Operations
from openai import OpenAI
oai_client = OpenAI()
def embed(text: str) -> list[float]:
response = oai_client.embeddings.create(
model="text-embedding-3-small", input=text
)
return response.data[0].embedding
# UPSERT — insert or update vectors
index.upsert(
vectors=[
{
"id": "doc1",
"values": embed("How to reset your password"),
"metadata": {
"content": "How to reset your password",
"category": "account",
"source": "faq.md",
},
},
{
"id": "doc2",
"values": embed("Billing FAQ and payment methods"),
"metadata": {
"content": "Billing FAQ and payment methods",
"category": "billing",
"source": "faq.md",
},
},
],
namespace="production", # Namespaces isolate data within an index
)
# QUERY — similarity search
results = index.query(
vector=embed("forgot my login"),
top_k=5,
namespace="production",
include_metadata=True,
filter={
"category": {"$in": ["account", "technical"]},
},
)
for match in results.matches:
print(f"[{match.score:.3f}] {match.metadata['content'][:100]}")
# FETCH — get by ID
fetched = index.fetch(ids=["doc1"], namespace="production")
# DELETE
index.delete(ids=["doc2"], namespace="production")
# DELETE by filter
index.delete(
filter={"category": "billing"},
namespace="production",
)
# Index statistics
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Namespaces: {stats.namespaces}")Batch Upsert with Error Handling
def batch_upsert(
documents: list[dict],
namespace: str = "production",
batch_size: int = 100,
):
"""Upsert documents in batches with error handling."""
import time
total = len(documents)
for i in range(0, total, batch_size):
batch = documents[i:i + batch_size]
# Embed batch
texts = [doc["content"] for doc in batch]
response = oai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
# Build vectors
vectors = []
for j, doc in enumerate(batch):
vectors.append({
"id": doc["id"],
"values": response.data[j].embedding,
"metadata": {
"content": doc["content"][:1000], # Pinecone metadata limit: 40KB
**doc.get("metadata", {}),
},
})
# Upsert with retry
for attempt in range(3):
try:
index.upsert(vectors=vectors, namespace=namespace)
break
except Exception as e:
if attempt == 2:
raise
time.sleep(2 ** attempt)
print(f"Upserted {min(i + batch_size, total)}/{total}")Pinecone Pricing Model
Pinecone charges based on:
- Storage: per GB of stored vectors
- Read units: per query
- Write units: per upsert
Serverless pricing (as of 2026):
- Reads: ~$8 per million read units
- Writes: ~$2 per million write units
- Storage: ~$0.33 per GB/month
For 1M vectors at 1536 dimensions: roughly $2-5/month storage plus usage-based query costs. Affordable for most teams, but costs scale linearly with data and traffic.
Pinecone Strengths and Limitations
Strengths:
- Zero operations — no servers, no tuning, no backups
- Scales to billions of vectors
- Fast queries (~10-50ms) with consistent latency
- Namespaces for multi-tenancy
- Good metadata filtering
Limitations:
- Vendor lock-in — proprietary API, no self-hosted option
- Metadata size limit: 40KB per vector
- No full-text search (vectors only)
- Limited query expressiveness compared to SQL
- Cold start on serverless (first query after idle can be slow)
Verdict: Best choice for teams that want zero operational overhead and can accept vendor lock-in. Excellent at scale. Consider pgvector if you want to avoid lock-in.
Qdrant Deep Dive
Qdrant is a high-performance vector database built in Rust. It can be self-hosted or used as a managed cloud service. It offers the best self-hosted performance characteristics.
Self-Hosted Setup
# Docker
docker run -d \
--name qdrant \
-p 6333:6333 \
-p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrantPython Client
pip install qdrant-clientfrom qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct,
Filter, FieldCondition, MatchValue,
)
client = QdrantClient(host="localhost", port=6333)
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
),
)
# Insert vectors
from openai import OpenAI
oai = OpenAI()
def embed(text: str) -> list[float]:
response = oai.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=embed("Password reset instructions"),
payload={
"content": "Password reset instructions",
"category": "account",
"source": "faq.md",
},
),
PointStruct(
id=2,
vector=embed("Billing and payment FAQ"),
payload={
"content": "Billing and payment FAQ",
"category": "billing",
"source": "faq.md",
},
),
],
)
# Search with filter
results = client.query_points(
collection_name="documents",
query=embed("I can't log in"),
limit=5,
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="account"),
),
]
),
)
for point in results.points:
print(f"[{point.score:.3f}] {point.payload['content']}")Qdrant Strengths and Limitations
Strengths:
- Built in Rust — excellent query performance and memory efficiency
- Rich filtering (nested conditions, geo, range, full-text)
- Payload (metadata) indexing for fast filtered queries
- Quantization support for reduced memory usage
- Self-hosted or managed cloud options
Limitations:
- Smaller ecosystem than Postgres or Pinecone
- Self-hosted means you manage infrastructure
- Less mature tooling for backups and monitoring
- Newer project — less battle-tested in production
Verdict: Best self-hosted option for teams that need high performance and are comfortable managing infrastructure. The Rust foundation gives it an edge on raw performance.
Head-to-Head Comparison
| Factor | ChromaDB | pgvector | Pinecone | Qdrant |
|---|---|---|---|---|
| Setup | pip install |
Postgres extension | API key | Docker/binary |
| Operations | Zero | Postgres ops | Zero (managed) | Self-managed |
| Max scale | ~500K vectors | ~5-10M per table | Billions | ~100M self-hosted |
| Query latency (1M) | ~5ms | ~3-10ms | ~10-50ms | ~2-5ms |
| Insert speed | Fast | Medium | Medium | Fast |
| Metadata filtering | Basic | Full SQL | Good | Rich |
| Hybrid search | No | With pg_trgm | No | Yes |
| Authentication | None | Postgres auth | API key | API key |
| Backups | Manual | pg_dump | Automatic | Manual |
| Cost | Free | Postgres hosting | Pay-per-use | Free (self-hosted) |
| Vendor lock-in | Low | None | High | Low |
| Best for | Prototyping | Postgres shops | Zero-ops at scale | Self-hosted perf |
Performance Benchmarks
These benchmarks use OpenAI text-embedding-3-small (1536 dimensions), cosine similarity, and a single-machine setup. Results are from a 4-core, 16GB RAM machine.
Insert Speed (vectors per second)
| Database | 100K vectors | 1M vectors | 10M vectors |
|---|---|---|---|
| ChromaDB | 5,000/s | 3,500/s | N/A (OOM) |
| pgvector (no index) | 8,000/s | 6,000/s | 4,000/s |
| pgvector (HNSW) | 2,000/s | 1,200/s | 800/s |
| Pinecone (serverless) | 3,000/s | 2,500/s | 2,000/s |
| Qdrant | 10,000/s | 7,000/s | 5,000/s |
Note: pgvector with HNSW index is slow on inserts because the index updates on every write. Build the index after bulk loading for better performance.
Query Latency (p50, top-5 results)
| Database | 100K vectors | 1M vectors | 10M vectors |
|---|---|---|---|
| ChromaDB | 2ms | 5ms | N/A |
| pgvector (HNSW) | 2ms | 4ms | 12ms |
| pgvector (IVFFlat) | 5ms | 10ms | 25ms |
| Pinecone | 15ms | 20ms | 30ms |
| Qdrant | 1ms | 3ms | 8ms |
Pinecone latency includes network round-trip (managed service). Local databases benefit from no network hop. In production, add 10-50ms for network latency to Pinecone numbers based on region.
Query Latency with Metadata Filter (1M vectors)
| Database | No filter | 1 filter | 3 filters |
|---|---|---|---|
| ChromaDB | 5ms | 8ms | 15ms |
| pgvector | 4ms | 6ms | 10ms |
| Pinecone | 20ms | 25ms | 30ms |
| Qdrant | 3ms | 4ms | 6ms |
Qdrant and pgvector handle metadata filtering most efficiently. ChromaDB degrades more noticeably with complex filters.
Migration Patterns
Eventually you will outgrow ChromaDB and need to migrate. Here are tested patterns.
ChromaDB to pgvector
import chromadb
import psycopg2
import json
def migrate_chroma_to_pgvector(
chroma_path: str,
collection_name: str,
pg_connection_string: str,
):
"""Migrate all vectors from ChromaDB to pgvector."""
# Connect to ChromaDB
chroma = chromadb.PersistentClient(path=chroma_path)
collection = chroma.get_collection(collection_name)
# Get all data
total = collection.count()
print(f"Migrating {total} vectors from ChromaDB to pgvector...")
batch_size = 1000
conn = psycopg2.connect(pg_connection_string)
cur = conn.cursor()
for offset in range(0, total, batch_size):
# Fetch batch from ChromaDB
results = collection.get(
limit=batch_size,
offset=offset,
include=["documents", "metadatas", "embeddings"],
)
# Insert into pgvector
for i in range(len(results["ids"])):
cur.execute(
"""
INSERT INTO documents (content, embedding, metadata, source)
VALUES (%s, %s::vector, %s, %s)
ON CONFLICT DO NOTHING
""",
(
results["documents"][i],
str(results["embeddings"][i]),
json.dumps(results["metadatas"][i]),
results["metadatas"][i].get("source", "chromadb_migration"),
),
)
conn.commit()
print(f" Migrated {min(offset + batch_size, total)}/{total}")
cur.close()
conn.close()
print("Migration complete. Now build the HNSW index:")
print(" CREATE INDEX idx_embedding ON documents USING hnsw (embedding vector_cosine_ops);")ChromaDB to Pinecone
import chromadb
from pinecone import Pinecone
def migrate_chroma_to_pinecone(
chroma_path: str,
collection_name: str,
pinecone_api_key: str,
pinecone_index_name: str,
namespace: str = "default",
):
"""Migrate all vectors from ChromaDB to Pinecone."""
chroma = chromadb.PersistentClient(path=chroma_path)
collection = chroma.get_collection(collection_name)
pc = Pinecone(api_key=pinecone_api_key)
index = pc.Index(pinecone_index_name)
total = collection.count()
batch_size = 100 # Pinecone recommends smaller batches
for offset in range(0, total, batch_size):
results = collection.get(
limit=batch_size,
offset=offset,
include=["documents", "metadatas", "embeddings"],
)
vectors = []
for i in range(len(results["ids"])):
metadata = results["metadatas"][i] or {}
metadata["content"] = (results["documents"][i] or "")[:1000]
vectors.append({
"id": results["ids"][i],
"values": results["embeddings"][i],
"metadata": metadata,
})
index.upsert(vectors=vectors, namespace=namespace)
print(f" Migrated {min(offset + batch_size, total)}/{total}")
print("Migration complete.")Migration Checklist
Before migrating in production:
- Verify embedding compatibility. If you used a custom embedding function in ChromaDB, use the same model and parameters in the target database.
- Build indexes after bulk load. For pgvector, load all data first, then create the HNSW/IVF index. Building the index during inserts is 5-10x slower.
- Test query parity. Run your test suite against both databases and compare results. They will not be identical (different ANN implementations), but should be close.
- Run in parallel. Write to both databases during migration. Switch reads once you have confirmed parity.
- Monitor latency after migration. The new database may have different performance characteristics under your specific query patterns.
Production Considerations
Backups
| Database | Backup Strategy |
|---|---|
| ChromaDB | Copy the data directory (cp -r ./chroma_data ./backup) |
| pgvector | Standard Postgres: pg_dump, WAL archiving, pg_basebackup |
| Pinecone | Automatic (managed). Create collections from backup via API. |
| Qdrant | Snapshot API: POST /collections/{name}/snapshots |
Monitoring
Key metrics to track for any vector database:
METRICS = {
"query_latency_p50": "Should be < 50ms for good UX",
"query_latency_p99": "Should be < 200ms",
"insert_latency_p50": "Track for batch ingestion pipelines",
"index_size_bytes": "Monitor for capacity planning",
"total_vectors": "Track growth rate",
"recall_at_k": "Measure with a ground truth test set",
"filter_overhead_ms": "Time added by metadata filters",
"error_rate": "Failed queries / total queries",
}For pgvector, add standard Postgres metrics: connection pool utilization, shared buffer hit ratio, WAL write rate, replication lag.
Scaling Patterns
Vertical scaling (bigger machine):
- Works up to ~5M vectors for pgvector, ~10M for Qdrant
- Simple but has a ceiling
Horizontal scaling:
- Pinecone: automatic (managed sharding)
- pgvector: Postgres sharding (Citus, partitioning by tenant)
- Qdrant: built-in sharding and replication
- ChromaDB: not supported (single process)
Multi-tenant isolation:
- Pinecone: namespaces (logical isolation within an index)
- pgvector: separate schemas or row-level security
- Qdrant: collections per tenant or payload-based filtering
- ChromaDB: separate collections
Decision Framework
Use this flowchart to pick the right vector database for your team:
Are you prototyping or in early development?
├── YES → ChromaDB (zero setup, iterate fast)
└── NO → Do you already run PostgreSQL?
├── YES → Will you exceed 5M vectors?
│ ├── YES → Consider Pinecone or Qdrant
│ └── NO → pgvector (no new infrastructure)
└── NO → Is operational simplicity your top priority?
├── YES → Pinecone (fully managed, zero ops)
└── NO → Do you need self-hosted for compliance/privacy?
├── YES → Qdrant (best self-hosted performance)
└── NO → Pinecone (default choice when not on Postgres)The most common production path: start with ChromaDB during development, migrate to pgvector when you go to production (because you probably already run Postgres), and consider Pinecone or Qdrant only if you hit pgvector’s scale limits.
Code Examples Summary
Here is a unified interface that works with any of the four databases. This abstraction lets you swap implementations without changing your RAG pipeline.
from abc import ABC, abstractmethod
class VectorStore(ABC):
"""Unified interface for vector databases."""
@abstractmethod
def add(self, ids: list[str], texts: list[str], metadatas: list[dict]) -> None:
"""Add documents to the store."""
pass
@abstractmethod
def search(self, query: str, n_results: int = 5, where: dict = None) -> list[dict]:
"""Search for similar documents."""
pass
@abstractmethod
def delete(self, ids: list[str]) -> None:
"""Delete documents by ID."""
pass
@abstractmethod
def count(self) -> int:
"""Return total document count."""
pass
class ChromaVectorStore(VectorStore):
def __init__(self, path: str, collection_name: str):
import chromadb
self.client = chromadb.PersistentClient(path=path)
self.collection = self.client.get_or_create_collection(collection_name)
def add(self, ids, texts, metadatas):
self.collection.add(ids=ids, documents=texts, metadatas=metadatas)
def search(self, query, n_results=5, where=None):
params = {"query_texts": [query], "n_results": n_results}
if where:
params["where"] = where
results = self.collection.query(**params)
return [
{"content": results["documents"][0][i], "metadata": results["metadatas"][0][i],
"score": 1 - results["distances"][0][i]}
for i in range(len(results["documents"][0]))
]
def delete(self, ids):
self.collection.delete(ids=ids)
def count(self):
return self.collection.count()
class PgVectorStore(VectorStore):
def __init__(self, connection_string: str):
import psycopg2
self.conn_string = connection_string
self._embed_client = OpenAI()
def _embed(self, text: str) -> list[float]:
resp = self._embed_client.embeddings.create(
model="text-embedding-3-small", input=text
)
return resp.data[0].embedding
def _get_conn(self):
import psycopg2
return psycopg2.connect(self.conn_string)
def add(self, ids, texts, metadatas):
conn = self._get_conn()
cur = conn.cursor()
for text_id, text, meta in zip(ids, texts, metadatas):
embedding = self._embed(text)
cur.execute(
"INSERT INTO documents (id, content, embedding, metadata) "
"VALUES (%s, %s, %s::vector, %s) ON CONFLICT (id) DO UPDATE "
"SET content = EXCLUDED.content, embedding = EXCLUDED.embedding",
(text_id, text, str(embedding), json.dumps(meta)),
)
conn.commit()
cur.close()
conn.close()
def search(self, query, n_results=5, where=None):
embedding = self._embed(query)
conn = self._get_conn()
cur = conn.cursor()
sql = (
"SELECT content, metadata, 1 - (embedding <=> %s::vector) as score "
"FROM documents ORDER BY embedding <=> %s::vector LIMIT %s"
)
cur.execute(sql, (str(embedding), str(embedding), n_results))
results = [
{"content": row[0], "metadata": row[1], "score": float(row[2])}
for row in cur.fetchall()
]
cur.close()
conn.close()
return results
def delete(self, ids):
conn = self._get_conn()
cur = conn.cursor()
cur.execute("DELETE FROM documents WHERE id = ANY(%s)", (ids,))
conn.commit()
cur.close()
conn.close()
def count(self):
conn = self._get_conn()
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM documents")
result = cur.fetchone()[0]
cur.close()
conn.close()
return result
# Usage — swap implementations without changing RAG code
# store = ChromaVectorStore("./chroma_db", "knowledge_base")
# store = PgVectorStore("postgresql://user:pass@localhost/ragdb")
# Your RAG pipeline doesn't care which one it is
# results = store.search("How do I reset my password?", n_results=5)Key Takeaways
-
ChromaDB for prototyping, pgvector for production. Most teams already run Postgres. Adding pgvector is one
CREATE EXTENSIONcommand. No new infrastructure, no new operational burden. -
HNSW is the default index choice. It gives the best query speed and recall. Use IVFFlat only when memory is constrained and you have 10M+ vectors.
-
ANN search is “approximate” in theory, near-perfect in practice. At 99% recall, you miss 1 in 100 nearest neighbors. For RAG, this has no measurable impact on answer quality.
-
Pinecone eliminates operations but introduces lock-in. If your team is small and does not want to manage infrastructure, it is a good trade. If you value portability, stick with pgvector or Qdrant.
-
Qdrant is the performance leader for self-hosted. The Rust foundation gives it an edge on query latency and insert speed. Choose it when you need maximum performance and can manage your own infrastructure.
-
Build a unified abstraction layer. The
VectorStoreinterface pattern lets you swap databases without rewriting your RAG pipeline. This is cheap insurance against changing requirements. -
Migrate by running in parallel. Write to both old and new databases, compare results, then switch reads. Never do a hard cutover on a production system.
-
The vector database is not the bottleneck for most teams. Retrieval quality depends more on your chunking strategy and embedding model than on which database you use. Get those right first (Lessons 5 and 7), then optimize the database.