using-document-databases
Document database implementation for flexible schema applications. Use when building content management, user profiles, catalogs, or event logging. Covers MongoDB (primary), DynamoDB, Firestore, schema design patterns, indexing strategies, and aggregation pipelines.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install ancoleman-ai-design-components-using-document-databases
Repository
Skill path: skills/using-document-databases
Document database implementation for flexible schema applications. Use when building content management, user profiles, catalogs, or event logging. Covers MongoDB (primary), DynamoDB, Firestore, schema design patterns, indexing strategies, and aggregation pipelines.
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, Backend, Tech Writer, Designer.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: ancoleman.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install using-document-databases into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/ancoleman/ai-design-components before adding using-document-databases to shared team environments
- Use using-document-databases for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: using-document-databases
description: Document database implementation for flexible schema applications. Use when building content management, user profiles, catalogs, or event logging. Covers MongoDB (primary), DynamoDB, Firestore, schema design patterns, indexing strategies, and aggregation pipelines.
---
# Document Database Implementation
Guide NoSQL document database selection and implementation for flexible schema applications across Python, TypeScript, Rust, and Go.
## When to Use This Skill
Use document databases when applications need:
- **Flexible schemas** - Data models evolve rapidly without migrations
- **Nested structures** - JSON-like hierarchical data
- **Horizontal scaling** - Built-in sharding and replication
- **Developer velocity** - Object-to-database mapping without ORM complexity
## Database Selection
### Quick Decision Framework
```
DEPLOYMENT ENVIRONMENT?
├── AWS-Native Application → DynamoDB
│ ✓ Serverless, auto-scaling, single-digit ms latency
│ ✗ Limited query flexibility
│
├── Firebase/GCP Ecosystem → Firestore
│ ✓ Real-time sync, offline support, mobile-first
│ ✗ More expensive for heavy reads
│
└── General-Purpose/Complex Queries → MongoDB
✓ Rich aggregation, full-text search, vector search
✓ ACID transactions, self-hosted or managed
```
### Database Comparison
| Database | Best For | Latency | Max Item | Query Language |
|----------|----------|---------|----------|----------------|
| **MongoDB** | General-purpose, complex queries | 1-5ms | 16MB | MQL (rich) |
| **DynamoDB** | AWS serverless, predictable performance | <10ms | 400KB | PartiQL (limited) |
| **Firestore** | Real-time apps, mobile-first | 50-200ms | 1MB | Firebase queries |
See `references/mongodb.md` for MongoDB details
See `references/dynamodb.md` for DynamoDB single-table design
See `references/firestore.md` for Firestore real-time patterns
## Schema Design Patterns
### Embedding vs Referencing
**Use the decision matrix in `references/schema-design-patterns.md`**
Quick guide:
| Relationship | Pattern | Example |
|--------------|---------|---------|
| One-to-Few | Embed | User addresses (2-3 max) |
| One-to-Many | Hybrid | Blog posts → comments |
| One-to-Millions | Reference | User → events (logging) |
| Many-to-Many | Reference | Products ↔ Categories |
### Embedding Example (MongoDB)
```javascript
// User with embedded addresses
{
_id: ObjectId("..."),
email: "[email protected]",
name: "Jane Doe",
addresses: [
{
type: "home",
street: "123 Main St",
city: "Boston",
default: true
}
],
preferences: {
theme: "dark",
notifications: { email: true, sms: false }
}
}
```
### Referencing Example (E-commerce)
```javascript
// Orders reference products
{
_id: ObjectId("..."),
userId: ObjectId("..."),
items: [
{
productId: ObjectId("..."), // Reference
priceAtPurchase: 49.99, // Denormalize (historical)
quantity: 2
}
],
totalAmount: 99.98
}
```
**When to denormalize:**
- Frequently read together
- Historical snapshots (prices, names)
- Read-heavy workloads
## Indexing Strategies
### MongoDB Index Types
```javascript
// 1. Single field (unique email)
db.users.createIndex({ email: 1 }, { unique: true })
// 2. Compound index (ORDER MATTERS!)
db.orders.createIndex({ status: 1, createdAt: -1 })
// 3. Partial index (index subset)
db.orders.createIndex(
{ userId: 1 },
{ partialFilterExpression: { status: { $eq: "pending" }}}
)
// 4. TTL index (auto-delete after 30 days)
db.sessions.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 2592000 }
)
// 5. Text index (full-text search)
db.articles.createIndex({
title: "text",
content: "text"
})
```
**Index Best Practices:**
- Add indexes for all query filters
- Compound index order: Equality → Range → Sort
- Use covering indexes (query + projection in index)
- Use `explain()` to verify index usage
- Monitor with Performance Advisor (Atlas)
**Validate indexes with the script:**
```bash
python scripts/validate_indexes.py
```
See `references/indexing-strategies.md` for complete guide.
## MongoDB Aggregation Pipelines
**Key Operators:** `$match` (filter), `$group` (aggregate), `$lookup` (join), `$unwind` (arrays), `$project` (reshape)
**For complete pipeline patterns and examples, see:** `references/aggregation-patterns.md`
## DynamoDB Single-Table Design
Design for access patterns using PK/SK patterns. Store multiple entity types in one table with composite keys.
**For complete single-table design patterns and GSI strategies, see:** `references/dynamodb.md`
## Firestore Real-Time Patterns
Use `onSnapshot()` for real-time listeners and Firestore security rules for access control.
**For complete real-time patterns and security rules, see:** `references/firestore.md`
## Multi-Language Examples
**Complete implementations available in `examples/` directory:**
- `examples/mongodb-fastapi/` - Python FastAPI + MongoDB
- `examples/mongodb-nextjs/` - TypeScript Next.js + MongoDB
- `examples/dynamodb-serverless/` - Python Lambda + DynamoDB
- `examples/firestore-react/` - React + Firestore real-time
## Frontend Skill Integration
- **Media Skill** - Use MongoDB GridFS for large file storage with metadata
- **AI Chat Skill** - MongoDB Atlas Vector Search for semantic conversation retrieval
- **Feedback Skill** - DynamoDB for high-throughput event logging with TTL
**For integration examples, see:** `references/skill-integrations.md`
## Performance Optimization
**Key practices:**
- Always use indexes for query filters (verify with `.explain()`)
- Use connection pooling (reuse clients across requests)
- Avoid collection scans in production
**For complete optimization guide, see:** `references/performance.md`
## Common Patterns
**Pagination:** Use cursor-based pagination for large datasets (recommended over offset)
**Soft Deletes:** Mark as deleted with timestamp instead of removing
**Audit Logs:** Store version history within documents
**For implementation details, see:** `references/common-patterns.md`
## Validation and Scripts
### Validate Index Coverage
```bash
# Run validation script
python scripts/validate_indexes.py --db myapp --collection orders
# Output:
# ✓ Query { status: "pending" } covered by index status_1
# ✗ Query { userId: "..." } missing index - add: { userId: 1 }
```
### Schema Analysis
```bash
# Analyze schema patterns
python scripts/analyze_schema.py --db myapp
# Output:
# Collection: users
# - Average document size: 2.4 KB
# - Embedding ratio: 87% (addresses, preferences)
# - Reference ratio: 13% (orderIds)
# Recommendation: Good balance
```
## Anti-Patterns to Avoid
**Unbounded Arrays:** Limit embedded arrays (use references for large collections)
**Over-Indexing:** Only index queried fields (indexes slow writes)
**DynamoDB Scans:** Always use Query with partition key (avoid Scan)
**For detailed anti-patterns, see:** `references/anti-patterns.md`
## Dependencies
### Python
```bash
# MongoDB
pip install motor pymongo
# DynamoDB
pip install boto3
# Firestore
pip install firebase-admin
```
### TypeScript
```bash
# MongoDB
npm install mongodb
# DynamoDB
npm install @aws-sdk/client-dynamodb @aws-sdk/util-dynamodb
# Firestore
npm install firebase firebase-admin
```
### Rust
```toml
# MongoDB
mongodb = "2.8"
# DynamoDB
aws-sdk-dynamodb = "1.0"
```
### Go
```bash
# MongoDB
go get go.mongodb.org/mongo-driver
# DynamoDB
go get github.com/aws/aws-sdk-go-v2/service/dynamodb
```
## Additional Resources
**Database-Specific Guides:**
- `references/mongodb.md` - Complete MongoDB documentation
- `references/dynamodb.md` - DynamoDB single-table patterns
- `references/firestore.md` - Firestore real-time guide
**Pattern Guides:**
- `references/schema-design-patterns.md` - Embedding vs referencing decisions
- `references/indexing-strategies.md` - Index optimization
- `references/aggregation-patterns.md` - MongoDB pipeline cookbook
- `references/common-patterns.md` - Pagination, soft deletes, audit logs
- `references/anti-patterns.md` - Mistakes to avoid
- `references/performance.md` - Query optimization
- `references/skill-integrations.md` - Frontend skill integration
**Examples:** `examples/mongodb-fastapi/`, `examples/mongodb-nextjs/`, `examples/dynamodb-serverless/`, `examples/firestore-react/`
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/validate_indexes.py
```python
#!/usr/bin/env python3
"""
MongoDB Index Validation Script
Analyzes MongoDB collections and validates index coverage for common queries.
Suggests missing indexes and identifies unused indexes.
Usage:
python validate_indexes.py --db myapp --collection orders
python validate_indexes.py --db myapp --all
"""
import argparse
import sys
from typing import Dict, List, Any
from pymongo import MongoClient
from pymongo.errors import OperationFailure
import json
class IndexValidator:
def __init__(self, uri: str, database: str):
self.client = MongoClient(uri)
self.db = self.client[database]
def get_indexes(self, collection_name: str) -> List[Dict[str, Any]]:
"""Get all indexes for a collection"""
collection = self.db[collection_name]
return list(collection.list_indexes())
def analyze_queries(self, collection_name: str) -> List[Dict[str, Any]]:
"""Analyze slow queries from profiler"""
try:
# Get profiling data (last 100 slow queries)
profiler_data = list(self.db.system.profile.find({
'ns': f"{self.db.name}.{collection_name}",
'millis': {'$gte': 100} # Queries taking >= 100ms
}).sort('ts', -1).limit(100))
return profiler_data
except OperationFailure:
print("⚠️ Profiling not enabled. Enable with: db.setProfilingLevel(1, { slowms: 100 })")
return []
def validate_index_coverage(self, collection_name: str) -> Dict[str, Any]:
"""Validate index coverage for a collection"""
collection = self.db[collection_name]
indexes = self.get_indexes(collection_name)
# Extract index fields
index_fields = {}
for idx in indexes:
key = idx.get('key', {})
name = idx['name']
if name != '_id_': # Skip default _id index
index_fields[name] = list(key.keys())
# Common query patterns to check
common_patterns = [
{'email': 1},
{'status': 1},
{'userId': 1},
{'createdAt': -1},
{'status': 1, 'createdAt': -1},
{'userId': 1, 'createdAt': -1}
]
results = {
'collection': collection_name,
'existing_indexes': indexes,
'coverage': [],
'missing_indexes': [],
'unused_indexes': []
}
# Check coverage for common patterns
for pattern in common_patterns:
covered = self._check_coverage(pattern, index_fields)
results['coverage'].append({
'query': pattern,
'covered': covered['covered'],
'index': covered.get('index')
})
if not covered['covered']:
results['missing_indexes'].append(pattern)
return results
def _check_coverage(self, query_pattern: Dict, index_fields: Dict) -> Dict:
"""Check if query pattern is covered by existing indexes"""
query_keys = list(query_pattern.keys())
for idx_name, idx_keys in index_fields.items():
# Check if index covers query
if len(query_keys) <= len(idx_keys):
# Check if query keys are prefix of index keys
if idx_keys[:len(query_keys)] == query_keys:
return {'covered': True, 'index': idx_name}
return {'covered': False}
def suggest_indexes(self, collection_name: str) -> List[Dict[str, Any]]:
"""Suggest indexes based on slow queries"""
slow_queries = self.analyze_queries(collection_name)
suggestions = []
for query in slow_queries:
if query.get('planSummary', '').startswith('COLLSCAN'):
# Collection scan detected
command = query.get('command', {})
filter_keys = command.get('filter', {}).keys()
if filter_keys:
suggestions.append({
'reason': 'Collection scan detected',
'query': command.get('filter'),
'execution_time': query.get('millis'),
'suggested_index': {field: 1 for field in filter_keys}
})
return suggestions
def get_index_usage_stats(self, collection_name: str) -> List[Dict[str, Any]]:
"""Get index usage statistics"""
collection = self.db[collection_name]
try:
stats = collection.aggregate([
{'$indexStats': {}}
])
return list(stats)
except OperationFailure:
print("⚠️ Index stats not available (MongoDB 3.2+ required)")
return []
def print_report(self, collection_name: str):
"""Print comprehensive index report"""
print(f"\n{'='*70}")
print(f"Index Validation Report: {self.db.name}.{collection_name}")
print(f"{'='*70}\n")
# Existing indexes
indexes = self.get_indexes(collection_name)
print("📊 EXISTING INDEXES")
print("-" * 70)
for idx in indexes:
key = idx.get('key', {})
name = idx['name']
unique = "UNIQUE" if idx.get('unique') else ""
sparse = "SPARSE" if idx.get('sparse') else ""
ttl = f"TTL ({idx.get('expireAfterSeconds')}s)" if idx.get('expireAfterSeconds') else ""
flags = " ".join(filter(None, [unique, sparse, ttl]))
print(f" ✓ {name}")
print(f" Fields: {list(key.keys())}")
if flags:
print(f" Flags: {flags}")
# Coverage validation
print("\n🔍 INDEX COVERAGE")
print("-" * 70)
validation = self.validate_index_coverage(collection_name)
for item in validation['coverage']:
query = item['query']
if item['covered']:
print(f" ✓ Query {query} covered by index: {item['index']}")
else:
print(f" ✗ Query {query} MISSING INDEX")
# Missing indexes
if validation['missing_indexes']:
print("\n⚠️ RECOMMENDED INDEXES")
print("-" * 70)
for idx in validation['missing_indexes']:
print(f" → Add index: {idx}")
print(f" Command: db.{collection_name}.createIndex({json.dumps(idx)})")
# Index usage stats
usage_stats = self.get_index_usage_stats(collection_name)
if usage_stats:
print("\n📈 INDEX USAGE STATISTICS")
print("-" * 70)
for stat in usage_stats:
name = stat['name']
ops = stat['accesses']['ops']
since = stat['accesses']['since']
print(f" • {name}: {ops} operations since {since}")
if ops == 0 and name != '_id_':
print(f" ⚠️ UNUSED - Consider removing")
# Suggestions from slow queries
suggestions = self.suggest_indexes(collection_name)
if suggestions:
print("\n💡 SUGGESTIONS FROM SLOW QUERIES")
print("-" * 70)
for suggestion in suggestions:
print(f" • Query: {suggestion['query']}")
print(f" Execution time: {suggestion['execution_time']} ms")
print(f" Suggested index: {suggestion['suggested_index']}")
print("\n" + "="*70 + "\n")
def validate_all_collections(self):
"""Validate indexes for all collections"""
collections = self.db.list_collection_names()
print(f"\nValidating {len(collections)} collections...\n")
for coll in collections:
if not coll.startswith('system.'):
self.print_report(coll)
def main():
parser = argparse.ArgumentParser(description='MongoDB Index Validator')
parser.add_argument('--uri', default='mongodb://localhost:27017/',
help='MongoDB connection URI')
parser.add_argument('--db', required=True,
help='Database name')
parser.add_argument('--collection',
help='Collection name (optional if --all is used)')
parser.add_argument('--all', action='store_true',
help='Validate all collections')
args = parser.parse_args()
if not args.collection and not args.all:
parser.error("Either --collection or --all must be specified")
try:
validator = IndexValidator(args.uri, args.db)
if args.all:
validator.validate_all_collections()
else:
validator.print_report(args.collection)
except Exception as e:
print(f"❌ Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()
```
### references/mongodb.md
```markdown
# MongoDB Complete Guide
Comprehensive MongoDB reference covering collections, queries, indexes, aggregation framework, and Atlas features.
## Table of Contents
- [Setup and Connection](#setup-and-connection)
- [Collections and Documents](#collections-and-documents)
- [CRUD Operations](#crud-operations)
- [Query Operators](#query-operators)
- [Indexes](#indexes)
- [Aggregation Framework](#aggregation-framework)
- [Transactions](#transactions)
- [Atlas Features](#atlas-features)
- [Performance Tuning](#performance-tuning)
---
## Setup and Connection
### MongoDB Atlas (Managed Service)
```javascript
// Connection string format
mongodb+srv://<username>:<password>@cluster0.xxxxx.mongodb.net/myapp?retryWrites=true&w=majority
```
### Python (Motor - Async)
```python
from motor.motor_asyncio import AsyncIOMotorClient
from pymongo.server_api import ServerApi
# Connect to Atlas
client = AsyncIOMotorClient(
"mongodb+srv://user:[email protected]/",
server_api=ServerApi('1'),
maxPoolSize=50,
minPoolSize=10
)
db = client.myapp
users_collection = db.users
# Test connection
async def ping():
await client.admin.command('ping')
print("Connected to MongoDB!")
```
### TypeScript (Native Driver)
```typescript
import { MongoClient, ServerApiVersion } from 'mongodb'
const client = new MongoClient(process.env.MONGODB_URI!, {
serverApi: {
version: ServerApiVersion.v1,
strict: true,
deprecationErrors: true,
}
})
await client.connect()
const db = client.db('myapp')
const users = db.collection('users')
```
### Connection Pooling Best Practices
```python
# Python: Reuse client across app lifecycle
# Create once at startup
client = AsyncIOMotorClient(
uri,
maxPoolSize=50, # Max concurrent connections
minPoolSize=10, # Maintain minimum pool
maxIdleTimeMS=30000, # Close idle connections after 30s
serverSelectionTimeoutMS=5000 # Timeout after 5s
)
# Use throughout application
async def get_user(email: str):
return await client.myapp.users.find_one({"email": email})
```
---
## Collections and Documents
### Document Structure
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439011"), // Auto-generated unique ID
email: "[email protected]",
name: "Jane Doe",
age: 32,
address: { // Embedded document
street: "123 Main St",
city: "Boston",
state: "MA"
},
hobbies: ["reading", "cycling"], // Array
metadata: { // Nested object
createdAt: ISODate("2025-01-15"),
updatedAt: ISODate("2025-11-28"),
version: 3
}
}
```
### Schema Validation (Optional)
```javascript
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["email", "name"],
properties: {
email: {
bsonType: "string",
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
},
name: {
bsonType: "string",
minLength: 1
},
age: {
bsonType: "int",
minimum: 0,
maximum: 150
}
}
}
},
validationLevel: "moderate", // moderate or strict
validationAction: "warn" // warn or error
})
```
---
## CRUD Operations
### Create (Insert)
```python
# Insert one
from bson import ObjectId
result = await db.users.insert_one({
"email": "[email protected]",
"name": "Jane Doe",
"createdAt": datetime.utcnow()
})
user_id = result.inserted_id
# Insert many
users = [
{"email": "[email protected]", "name": "User One"},
{"email": "[email protected]", "name": "User Two"}
]
result = await db.users.insert_many(users)
inserted_ids = result.inserted_ids
```
### Read (Find)
```python
# Find one
user = await db.users.find_one({"email": "[email protected]"})
# Find many with filter
active_users = await db.users.find({"status": "active"}).to_list(length=100)
# Find with projection (select specific fields)
users = await db.users.find(
{"status": "active"},
{"name": 1, "email": 1, "_id": 0} # Include name, email; exclude _id
).to_list(length=100)
# Find with sorting
users = await db.users.find().sort([
("createdAt", -1), # Descending
("name", 1) # Then ascending
]).to_list(length=100)
# Find with pagination
page_size = 20
skip = (page_number - 1) * page_size
users = await db.users.find().skip(skip).limit(page_size).to_list(length=100)
```
### Update
```python
# Update one
result = await db.users.update_one(
{"email": "[email protected]"}, # Filter
{"$set": {"name": "Jane Smith"}} # Update
)
print(f"Modified {result.modified_count} documents")
# Update many
result = await db.users.update_many(
{"status": "inactive"},
{"$set": {"archived": True}}
)
# Upsert (update or insert)
result = await db.users.update_one(
{"email": "[email protected]"},
{"$set": {"name": "New User"}},
upsert=True # Create if not found
)
```
### Delete
```python
# Delete one
result = await db.users.delete_one({"email": "[email protected]"})
# Delete many
result = await db.users.delete_many({"status": "archived"})
print(f"Deleted {result.deleted_count} documents")
# Soft delete (recommended)
result = await db.users.update_one(
{"email": "[email protected]"},
{"$set": {"deleted": True, "deletedAt": datetime.utcnow()}}
)
```
---
## Query Operators
### Comparison Operators
```python
# Equal
db.users.find({"age": 30})
# Not equal
db.users.find({"age": {"$ne": 30}})
# Greater than, greater than or equal
db.users.find({"age": {"$gt": 25}})
db.users.find({"age": {"$gte": 25}})
# Less than, less than or equal
db.users.find({"age": {"$lt": 40}})
db.users.find({"age": {"$lte": 40}})
# In array
db.users.find({"status": {"$in": ["active", "pending"]}})
# Not in array
db.users.find({"status": {"$nin": ["archived", "deleted"]}})
```
### Logical Operators
```python
# AND (implicit)
db.users.find({"status": "active", "age": {"$gte": 18}})
# AND (explicit)
db.users.find({"$and": [
{"status": "active"},
{"age": {"$gte": 18}}
]})
# OR
db.users.find({"$or": [
{"status": "active"},
{"status": "pending"}
]})
# NOT
db.users.find({"age": {"$not": {"$gte": 18}}})
# NOR (not or)
db.users.find({"$nor": [
{"status": "archived"},
{"deleted": True}
]})
```
### Element Operators
```python
# Field exists
db.users.find({"email": {"$exists": True}})
# Field type
db.users.find({"age": {"$type": "int"}})
```
### Array Operators
```python
# All elements match
db.users.find({"hobbies": {"$all": ["reading", "cycling"]}})
# Array size
db.users.find({"hobbies": {"$size": 3}})
# Element match (for embedded documents)
db.orders.find({
"items": {
"$elemMatch": {
"productId": "PROD-123",
"quantity": {"$gte": 2}
}
}
})
```
### String Operators (Regex)
```python
# Case-insensitive search
db.users.find({"name": {"$regex": "jane", "$options": "i"}})
# Starts with
db.users.find({"email": {"$regex": "^admin"}})
# Contains
db.users.find({"name": {"$regex": "doe"}})
```
---
## Indexes
### Index Types
```python
# Single field index
await db.users.create_index("email", unique=True)
# Compound index (order matters!)
await db.orders.create_index([
("status", 1), # Ascending
("createdAt", -1) # Descending
])
# Multikey index (on array fields)
await db.users.create_index("tags") # Indexes each element
# Text index (full-text search)
await db.articles.create_index([
("title", "text"),
("content", "text")
])
# Geospatial index
await db.locations.create_index([("coordinates", "2dsphere")])
# Hashed index (for sharding)
await db.users.create_index([("userId", "hashed")])
# Wildcard index (for dynamic schemas)
await db.products.create_index({"specifications.$**": 1})
```
### Index Options
```python
# Unique index
await db.users.create_index("email", unique=True)
# Partial index (index subset of documents)
await db.orders.create_index(
"userId",
partialFilterExpression={"status": {"$eq": "pending"}}
)
# TTL index (auto-delete after expiration)
await db.sessions.create_index(
"createdAt",
expireAfterSeconds=86400 # 24 hours
)
# Sparse index (only index documents with field)
await db.users.create_index("phoneNumber", sparse=True)
# Case-insensitive index
await db.users.create_index(
"email",
collation={"locale": "en", "strength": 2}
)
```
### Index Management
```python
# List indexes
indexes = await db.users.list_indexes().to_list(length=100)
# Drop index
await db.users.drop_index("email_1")
# Drop all indexes (except _id)
await db.users.drop_indexes()
# Rebuild indexes
await db.users.reindex()
```
### Index Analysis
```python
# Explain query execution plan
explain = await db.users.find({"status": "active"}).explain()
print(f"Execution time: {explain['executionStats']['executionTimeMillis']} ms")
print(f"Documents examined: {explain['executionStats']['totalDocsExamined']}")
print(f"Keys examined: {explain['executionStats']['totalKeysExamined']}")
print(f"Stage: {explain['executionStats']['executionStages']['stage']}")
# Stage should be "IXSCAN" (index scan), not "COLLSCAN" (collection scan)
```
---
## Aggregation Framework
### Pipeline Stages
```javascript
// Complete aggregation example
db.orders.aggregate([
// Stage 1: Match (filter)
{ $match: {
orderDate: { $gte: ISODate("2025-11-01") },
status: "completed"
}},
// Stage 2: Lookup (join)
{ $lookup: {
from: "users",
localField: "userId",
foreignField: "_id",
as: "user"
}},
// Stage 3: Unwind (flatten array)
{ $unwind: "$user" },
// Stage 4: Unwind items
{ $unwind: "$items" },
// Stage 5: Lookup products
{ $lookup: {
from: "products",
localField: "items.productId",
foreignField: "_id",
as: "product"
}},
// Stage 6: Unwind products
{ $unwind: "$product" },
// Stage 7: Group by category
{ $group: {
_id: "$product.category",
totalRevenue: {
$sum: { $multiply: ["$items.quantity", "$items.price"] }
},
totalQuantity: { $sum: "$items.quantity" },
orderCount: { $sum: 1 },
avgOrderValue: { $avg: "$totalAmount" }
}},
// Stage 8: Project (reshape)
{ $project: {
_id: 0,
category: "$_id",
totalRevenue: 1,
totalQuantity: 1,
orderCount: 1,
avgOrderValue: { $round: ["$avgOrderValue", 2] }
}},
// Stage 9: Sort
{ $sort: { totalRevenue: -1 }},
// Stage 10: Limit
{ $limit: 10 }
])
```
### Grouping Accumulators
```javascript
// Sum, average, min, max
{ $group: {
_id: "$category",
total: { $sum: "$price" },
avg: { $avg: "$price" },
min: { $min: "$price" },
max: { $max: "$price" },
count: { $sum: 1 }
}}
// Push to array
{ $group: {
_id: "$userId",
orders: { $push: "$orderId" }
}}
// Add to set (unique values)
{ $group: {
_id: "$userId",
uniqueProducts: { $addToSet: "$productId" }
}}
// First, last
{ $group: {
_id: "$userId",
firstOrder: { $first: "$orderDate" },
lastOrder: { $last: "$orderDate" }
}}
```
### Conditional Aggregation
```javascript
// $cond (if-then-else)
{ $project: {
discount: {
$cond: {
if: { $gte: ["$totalAmount", 100] },
then: { $multiply: ["$totalAmount", 0.1] },
else: 0
}
}
}}
// $switch (multi-way branch)
{ $project: {
tier: {
$switch: {
branches: [
{ case: { $gte: ["$totalAmount", 1000] }, then: "platinum" },
{ case: { $gte: ["$totalAmount", 500] }, then: "gold" },
{ case: { $gte: ["$totalAmount", 100] }, then: "silver" }
],
default: "bronze"
}
}
}}
```
### Faceted Search
```javascript
// Multi-dimensional aggregation
db.products.aggregate([
{ $match: { category: "electronics" }},
{ $facet: {
// Facet 1: Price ranges
"priceRanges": [
{ $bucket: {
groupBy: "$price",
boundaries: [0, 50, 100, 200, 500],
default: "500+",
output: { count: { $sum: 1 }}
}}
],
// Facet 2: Top brands
"topBrands": [
{ $group: { _id: "$brand", count: { $sum: 1 }}},
{ $sort: { count: -1 }},
{ $limit: 5 }
],
// Facet 3: Results
"results": [
{ $sort: { popularity: -1 }},
{ $limit: 20 }
]
}}
])
```
---
## Transactions
### Multi-Document ACID Transactions
```python
from motor.motor_asyncio import AsyncIOMotorClient
client = AsyncIOMotorClient(uri)
async def transfer_funds(from_user_id, to_user_id, amount):
async with await client.start_session() as session:
async with session.start_transaction():
# Debit from sender
result = await db.accounts.update_one(
{"userId": from_user_id, "balance": {"$gte": amount}},
{"$inc": {"balance": -amount}},
session=session
)
if result.modified_count == 0:
await session.abort_transaction()
raise ValueError("Insufficient funds")
# Credit to receiver
await db.accounts.update_one(
{"userId": to_user_id},
{"$inc": {"balance": amount}},
session=session
)
# Commit transaction
# Auto-committed when exiting context
```
### Transaction Best Practices
- Keep transactions short (< 1 second)
- Limit to 1000 documents
- Use retryable writes for single operations
- Avoid long-running queries in transactions
- Use `writeConcern: "majority"` for consistency
---
## Atlas Features
### Atlas Search (Lucene-based Full-Text)
```javascript
// Create search index (in Atlas UI or CLI)
{
"mappings": {
"dynamic": false,
"fields": {
"title": { "type": "string" },
"content": { "type": "string" },
"tags": { "type": "stringFacet" }
}
}
}
// Search query
db.articles.aggregate([
{
$search: {
index: "default",
text: {
query: "mongodb aggregation",
path: ["title", "content"],
fuzzy: { maxEdits: 1 }
}
}
},
{
$project: {
title: 1,
content: 1,
score: { $meta: "searchScore" }
}
},
{ $limit: 10 }
])
```
### Atlas Vector Search (AI/RAG)
```javascript
// Create vector search index
{
"fields": [{
"type": "vector",
"path": "embedding",
"numDimensions": 1536, // OpenAI ada-002
"similarity": "cosine"
}]
}
// Vector search query
db.messages.aggregate([
{
$vectorSearch: {
queryVector: embeddingVector, // From OpenAI/etc
path: "embedding",
numCandidates: 100,
limit: 5,
index: "vector_index"
}
},
{
$project: {
content: 1,
score: { $meta: "vectorSearchScore" }
}
}
])
```
### Atlas Triggers (Database Events)
```javascript
// Trigger function (in Atlas)
exports = async function(changeEvent) {
const { fullDocument, operationType } = changeEvent
if (operationType === 'insert') {
// Send welcome email
await context.functions.execute("sendWelcomeEmail", fullDocument.email)
}
}
```
---
## Performance Tuning
### Query Optimization
```python
# Use projection (fetch only needed fields)
users = await db.users.find(
{"status": "active"},
{"name": 1, "email": 1, "_id": 0}
).to_list(length=100)
# Use covered queries (query + projection in index)
# Index: { status: 1, name: 1, email: 1 }
# Query only uses index, no document fetch!
```
### Batch Operations
```python
# Batch inserts
from pymongo import InsertOne, UpdateOne
requests = [
InsertOne({"email": "[email protected]"}),
UpdateOne({"email": "[email protected]"}, {"$set": {"name": "Updated"}})
]
result = await db.users.bulk_write(requests)
```
### Read Preference
```python
# Read from secondaries (eventual consistency)
from pymongo import ReadPreference
result = await db.users.find(
{"status": "active"}
).read_preference(ReadPreference.SECONDARY_PREFERRED).to_list(length=100)
```
### Write Concern
```python
# Acknowledge writes after majority of replicas
from pymongo import WriteConcern
result = await db.users.with_options(
write_concern=WriteConcern(w="majority", wtimeout=5000)
).insert_one({"email": "[email protected]"})
```
### Atlas Performance Advisor
- Automatically suggests indexes for slow queries
- Identifies unused indexes
- Analyzes query patterns
- Available in Atlas UI
---
## Common Patterns
### Pagination (Cursor-Based)
```python
# Better than skip/limit for large datasets
async def get_page(last_id=None, page_size=20):
query = {}
if last_id:
query["_id"] = {"$gt": ObjectId(last_id)}
results = await db.products.find(query).sort("_id", 1).limit(page_size).to_list(length=100)
return {
"data": results,
"nextCursor": str(results[-1]["_id"]) if results else None
}
```
### Soft Deletes
```python
# Create compound index excluding deleted
await db.users.create_index(
[("email", 1)],
partialFilterExpression={"deleted": {"$ne": True}}
)
# Mark as deleted
await db.users.update_one(
{"_id": user_id},
{"$set": {"deleted": True, "deletedAt": datetime.utcnow()}}
)
# Query non-deleted
users = await db.users.find({"deleted": {"$ne": True}}).to_list(length=100)
```
### Audit Logs
```python
# Store version history
{
"_id": ObjectId("..."),
"documentId": "doc123",
"version": 3,
"content": "Current content",
"history": [
{"version": 1, "content": "Original", "timestamp": "..."},
{"version": 2, "content": "Updated", "timestamp": "..."}
]
}
```
---
## TypeScript Type Safety
```typescript
import { MongoClient, Document, ObjectId } from 'mongodb'
interface User extends Document {
_id?: ObjectId
email: string
name: string
age?: number
createdAt: Date
}
const client = new MongoClient(process.env.MONGODB_URI!)
const db = client.db('myapp')
const users = db.collection<User>('users')
// Type-safe operations
const user = await users.findOne({ email: '[email protected]' })
if (user) {
console.log(user.name) // TypeScript knows this exists
}
await users.insertOne({
email: '[email protected]',
name: 'New User',
createdAt: new Date()
})
```
---
## Monitoring and Debugging
### Connection Pool Monitoring
```python
from pymongo import monitoring
class ConnectionPoolLogger(monitoring.ConnectionPoolListener):
def pool_created(self, event):
print(f"Pool created: {event.address}")
def connection_checked_out(self, event):
print(f"Connection checked out: {event.connection_id}")
monitoring.register(ConnectionPoolLogger())
```
### Slow Query Logging
```javascript
// Enable profiling (level 1 = slow queries only)
db.setProfilingLevel(1, { slowms: 100 })
// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10)
```
---
This comprehensive guide covers MongoDB from basics to advanced patterns. For aggregation cookbook, see `aggregation-patterns.md`.
```
### references/dynamodb.md
```markdown
# DynamoDB Complete Guide
AWS DynamoDB single-table design, GSI patterns, and serverless best practices.
## Table of Contents
- [Core Concepts](#core-concepts)
- [Single-Table Design](#single-table-design)
- [Primary Keys](#primary-keys)
- [Global Secondary Indexes](#global-secondary-indexes)
- [Query Patterns](#query-patterns)
- [DynamoDB Streams](#dynamodb-streams)
- [Pricing Optimization](#pricing-optimization)
- [Performance Best Practices](#performance-best-practices)
---
## Core Concepts
### DynamoDB vs Traditional Databases
| Feature | DynamoDB | MongoDB/SQL |
|---------|----------|-------------|
| **Data Model** | Key-value, document | Document, relational |
| **Queries** | By primary key + GSI only | Flexible queries |
| **Scaling** | Automatic, unlimited | Manual sharding/replication |
| **Consistency** | Eventual (default), strong (optional) | Tunable |
| **Latency** | Single-digit ms guaranteed | Varies |
| **Pricing** | Per request or provisioned | Per instance/cluster |
### Key Terminology
- **Partition Key (PK)**: Hash key for data distribution
- **Sort Key (SK)**: Range key for sorting within partition
- **GSI**: Global Secondary Index (alternate query patterns)
- **LSI**: Local Secondary Index (different sort key, same PK)
- **WCU**: Write Capacity Unit (1 KB/sec)
- **RCU**: Read Capacity Unit (4 KB/sec for eventual, 4 KB/sec for strong)
---
## Single-Table Design
### Philosophy
Design for **access patterns**, not normalization. Store multiple entity types in one table using composite keys.
### Entity Types in One Table
```javascript
// Users table with multiple entity types
// User metadata
{
PK: "USER#12345",
SK: "METADATA",
email: "[email protected]",
name: "Jane Doe",
createdAt: "2025-01-15T10:00:00Z"
}
// User's orders
{
PK: "USER#12345",
SK: "ORDER#2025-001234",
orderNumber: "ORD-2025-001234",
totalAmount: 249.97,
status: "shipped",
orderDate: "2025-11-25T14:30:00Z"
}
// Order detail (different access pattern)
{
PK: "ORDER#2025-001234",
SK: "METADATA",
userId: "USER#12345",
totalAmount: 249.97,
status: "shipped"
}
// Order items
{
PK: "ORDER#2025-001234",
SK: "ITEM#001",
productId: "PROD-456",
quantity: 2,
price: 49.99,
name: "Widget Pro"
}
// Product catalog
{
PK: "PRODUCT#456",
SK: "METADATA",
name: "Widget Pro",
category: "widgets",
price: 49.99,
inventory: 245
}
```
### Access Patterns Enabled
```python
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('AppData')
# 1. Get user metadata
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345') & Key('SK').eq('METADATA')
)
# 2. Get all orders for user
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345') & Key('SK').begins_with('ORDER#')
)
# 3. Get order with items
response = table.query(
KeyConditionExpression=Key('PK').eq('ORDER#2025-001234')
)
# 4. Get specific order
response = table.query(
KeyConditionExpression=Key('PK').eq('ORDER#2025-001234') & Key('SK').eq('METADATA')
)
```
### Composite Key Strategies
**Pattern 1: Entity Type Prefix**
```
PK: USER#12345
SK: METADATA
SK: ORDER#001
SK: ORDER#002
```
**Pattern 2: Hierarchical**
```
PK: TENANT#abc
SK: USER#12345
SK: USER#12345#ORDER#001
SK: USER#12345#ORDER#002
```
**Pattern 3: Reverse Relationship**
```
// Same item, two representations
PK: USER#12345, SK: ORDER#001
PK: ORDER#001, SK: USER#12345
```
---
## Primary Keys
### Partition Key Only
```python
# Create table with partition key only
dynamodb = boto3.client('dynamodb')
dynamodb.create_table(
TableName='Users',
KeySchema=[
{'AttributeName': 'userId', 'KeyType': 'HASH'} # PK only
],
AttributeDefinitions=[
{'AttributeName': 'userId', 'AttributeType': 'S'}
],
BillingMode='PAY_PER_REQUEST'
)
# Put item
table.put_item(Item={
'userId': 'user123',
'email': '[email protected]',
'name': 'Jane Doe'
})
# Get item
response = table.get_item(Key={'userId': 'user123'})
```
### Partition Key + Sort Key
```python
# Create table with composite key
dynamodb.create_table(
TableName='AppData',
KeySchema=[
{'AttributeName': 'PK', 'KeyType': 'HASH'}, # Partition Key
{'AttributeName': 'SK', 'KeyType': 'RANGE'} # Sort Key
],
AttributeDefinitions=[
{'AttributeName': 'PK', 'AttributeType': 'S'},
{'AttributeName': 'SK', 'AttributeType': 'S'}
],
BillingMode='PAY_PER_REQUEST'
)
```
### Partition Key Design
**Good Partition Keys:**
- High cardinality (many unique values)
- Even access distribution
- No hot partitions
```python
# GOOD: User ID (unique per user)
PK: "USER#12345"
# GOOD: Composite (tenant + user)
PK: "TENANT#abc#USER#12345"
# BAD: Status (hot partition on "active")
PK: "STATUS#active" # All active users in one partition!
# BAD: Date (hot partition for current date)
PK: "DATE#2025-11-28" # All today's writes to one partition!
```
**Fix Hot Partitions:**
```python
# Add random suffix to distribute writes
import random
suffix = random.randint(0, 9)
PK: f"DATE#2025-11-28#{suffix}"
```
---
## Global Secondary Indexes (GSI)
### When to Use GSI
- Query by attributes other than primary key
- Different sort order
- Sparse indexes (not all items have attribute)
### GSI Example: Query Orders by Status
```python
# Base table: PK = userId, SK = timestamp
# GSI: PK = status, SK = timestamp (query all pending orders)
dynamodb.create_table(
TableName='Orders',
KeySchema=[
{'AttributeName': 'userId', 'KeyType': 'HASH'},
{'AttributeName': 'timestamp', 'KeyType': 'RANGE'}
],
AttributeDefinitions=[
{'AttributeName': 'userId', 'AttributeType': 'S'},
{'AttributeName': 'timestamp', 'AttributeType': 'S'},
{'AttributeName': 'status', 'AttributeType': 'S'}
],
GlobalSecondaryIndexes=[
{
'IndexName': 'StatusIndex',
'KeySchema': [
{'AttributeName': 'status', 'KeyType': 'HASH'},
{'AttributeName': 'timestamp', 'KeyType': 'RANGE'}
],
'Projection': {'ProjectionType': 'ALL'},
'BillingMode': 'PAY_PER_REQUEST'
}
],
BillingMode='PAY_PER_REQUEST'
)
# Query GSI
table = dynamodb.Table('Orders')
response = table.query(
IndexName='StatusIndex',
KeyConditionExpression=Key('status').eq('pending')
)
```
### GSI Projection Types
```python
# ALL: Project all attributes
'Projection': {'ProjectionType': 'ALL'}
# KEYS_ONLY: Project only keys (smallest index)
'Projection': {'ProjectionType': 'KEYS_ONLY'}
# INCLUDE: Project specific attributes
'Projection': {
'ProjectionType': 'INCLUDE',
'NonKeyAttributes': ['email', 'name', 'totalAmount']
}
```
### GSI Best Practices
1. **Use sparse indexes** (only items with attribute are indexed)
2. **Projection**: Use KEYS_ONLY or INCLUDE to reduce storage
3. **Limit GSIs**: Maximum 20 per table
4. **GSI writes**: Every item update may update multiple GSIs (cost!)
---
## Query Patterns
### Query vs Scan
```python
# GOOD: Query (efficient, uses index)
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345')
)
# BAD: Scan (reads entire table, expensive!)
response = table.scan(
FilterExpression=Attr('email').eq('[email protected]')
)
```
### Query Operators
```python
from boto3.dynamodb.conditions import Key, Attr
# Equals
Key('PK').eq('USER#12345')
# Less than, greater than
Key('timestamp').lt('2025-11-01')
Key('timestamp').gte('2025-11-01')
# Between
Key('timestamp').between('2025-11-01', '2025-11-30')
# Begins with (sort key only)
Key('SK').begins_with('ORDER#')
```
### Filter Expressions (Post-Query Filtering)
```python
# Query + filter (filter after query, doesn't save RCUs)
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345'),
FilterExpression=Attr('status').eq('active') & Attr('age').gte(18)
)
```
### Pagination
```python
# Paginate through large result sets
last_evaluated_key = None
while True:
if last_evaluated_key:
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345'),
ExclusiveStartKey=last_evaluated_key
)
else:
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345')
)
items = response['Items']
# Process items
last_evaluated_key = response.get('LastEvaluatedKey')
if not last_evaluated_key:
break
```
---
## Update Operations
### Update Expressions
```python
# Set attribute
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='SET #name = :name',
ExpressionAttributeNames={'#name': 'name'},
ExpressionAttributeValues={':name': 'Jane Smith'}
)
# Increment counter
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='SET loginCount = loginCount + :inc',
ExpressionAttributeValues={':inc': 1}
)
# Add to list
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='SET tags = list_append(tags, :tag)',
ExpressionAttributeValues={':tag': ['new-tag']}
)
# Add to set
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='ADD emailSet :email',
ExpressionAttributeValues={':email': {'[email protected]'}}
)
# Remove attribute
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='REMOVE deprecated_field'
)
```
### Conditional Updates
```python
# Update only if condition met
from botocore.exceptions import ClientError
try:
table.update_item(
Key={'userId': 'user123'},
UpdateExpression='SET balance = balance - :amount',
ConditionExpression='balance >= :amount',
ExpressionAttributeValues={':amount': 100}
)
except ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
print("Insufficient balance")
```
### Atomic Counters
```python
# Increment view count (atomic)
response = table.update_item(
Key={'postId': 'post123'},
UpdateExpression='SET viewCount = if_not_exists(viewCount, :start) + :inc',
ExpressionAttributeValues={':start': 0, ':inc': 1},
ReturnValues='UPDATED_NEW'
)
print(f"New count: {response['Attributes']['viewCount']}")
```
---
## DynamoDB Streams
### Enable Change Data Capture
```python
# Create table with streams enabled
dynamodb.create_table(
TableName='Orders',
# ... (key schema, attributes)
StreamSpecification={
'StreamEnabled': True,
'StreamViewType': 'NEW_AND_OLD_IMAGES' # Full document before/after
}
)
```
### Stream View Types
- `KEYS_ONLY`: Only key attributes
- `NEW_IMAGE`: Entire item after update
- `OLD_IMAGE`: Entire item before update
- `NEW_AND_OLD_IMAGES`: Both before and after
### Process Streams with Lambda
```python
# Lambda function triggered by DynamoDB Stream
def lambda_handler(event, context):
for record in event['Records']:
if record['eventName'] == 'INSERT':
new_item = record['dynamodb']['NewImage']
print(f"New order: {new_item}")
# Send notification
send_email(new_item['email']['S'], "Order Confirmed")
elif record['eventName'] == 'MODIFY':
old_item = record['dynamodb']['OldImage']
new_item = record['dynamodb']['NewImage']
# Check if status changed
if old_item['status']['S'] != new_item['status']['S']:
print(f"Status changed: {old_item['status']['S']} -> {new_item['status']['S']}")
elif record['eventName'] == 'REMOVE':
old_item = record['dynamodb']['OldImage']
print(f"Order deleted: {old_item}")
```
---
## Pricing Optimization
### Billing Modes
| Mode | Use Case | Cost |
|------|----------|------|
| **On-Demand** | Unpredictable traffic, dev/test | $1.25/million writes, $0.25/million reads |
| **Provisioned** | Predictable traffic | $0.47/WCU/month, $0.09/RCU/month |
| **Reserved** | Steady workloads (1-3 year) | Save up to 77% |
### Cost Optimization Strategies
```python
# 1. Use BatchGetItem (up to 100 items)
response = dynamodb.batch_get_item(
RequestItems={
'Users': {
'Keys': [
{'userId': 'user1'},
{'userId': 'user2'},
{'userId': 'user3'}
]
}
}
)
# 2. Use BatchWriteItem (up to 25 items)
dynamodb.batch_write_item(
RequestItems={
'Users': [
{'PutRequest': {'Item': {'userId': 'user1', 'name': 'User 1'}}},
{'PutRequest': {'Item': {'userId': 'user2', 'name': 'User 2'}}}
]
}
)
# 3. Use projection expressions (fetch only needed attributes)
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345'),
ProjectionExpression='email, #name, createdAt',
ExpressionAttributeNames={'#name': 'name'}
)
# 4. Use eventually consistent reads (half the RCUs)
response = table.get_item(
Key={'userId': 'user123'},
ConsistentRead=False # Default
)
# 5. Use TTL for auto-expiring data (free deletes!)
table.meta.client.update_time_to_live(
TableName='Sessions',
TimeToLiveSpecification={
'Enabled': True,
'AttributeName': 'ttl'
}
)
# Add TTL to item (Unix timestamp)
import time
ttl_value = int(time.time()) + 86400 # Expire in 24 hours
table.put_item(Item={
'sessionId': 'session123',
'ttl': ttl_value
})
```
---
## Performance Best Practices
### Design for Even Distribution
```python
# GOOD: High cardinality partition key
PK: f"USER#{uuid.uuid4()}"
# GOOD: Composite key with multiple dimensions
PK: f"TENANT#{tenant_id}#USER#{user_id}"
# BAD: Low cardinality (hot partitions)
PK: f"STATUS#{status}" # Only 3 values: active, pending, archived
```
### Use Sparse Indexes
```python
# Only index items with specific attribute (saves storage)
# GSI: PK = emailVerified, SK = timestamp
# Only items with emailVerified=true are indexed
table.put_item(Item={
'userId': 'user123',
'email': '[email protected]',
'emailVerified': True, # This item appears in GSI
'timestamp': '2025-11-28T10:00:00Z'
})
table.put_item(Item={
'userId': 'user456',
'email': '[email protected]',
# No emailVerified - this item NOT in GSI
'timestamp': '2025-11-28T11:00:00Z'
})
```
### Adjacent Item Pattern (Reduce Queries)
```python
# Store related items with adjacent sort keys
# Query once, get all related data
# User metadata
PK: "USER#12345", SK: "A#METADATA"
# User's addresses
PK: "USER#12345", SK: "B#ADDRESS#001"
PK: "USER#12345", SK: "B#ADDRESS#002"
# User's orders (most recent first)
PK: "USER#12345", SK: "C#ORDER#2025-11-28#001"
PK: "USER#12345", SK: "C#ORDER#2025-11-27#002"
# One query gets everything
response = table.query(
KeyConditionExpression=Key('PK').eq('USER#12345')
)
```
### Composite Sort Keys
```python
# Hierarchical data in sort key
SK: "COUNTRY#USA#STATE#MA#CITY#Boston"
# Query all items in USA
Key('SK').begins_with('COUNTRY#USA')
# Query all items in Massachusetts
Key('SK').begins_with('COUNTRY#USA#STATE#MA')
# Query all items in Boston
Key('SK').begins_with('COUNTRY#USA#STATE#MA#CITY#Boston')
```
---
## TypeScript Examples (AWS SDK v3)
```typescript
import {
DynamoDBClient,
PutItemCommand,
GetItemCommand,
QueryCommand,
UpdateItemCommand
} from '@aws-sdk/client-dynamodb'
import { marshall, unmarshall } from '@aws-sdk/util-dynamodb'
const client = new DynamoDBClient({ region: 'us-east-1' })
// Put item
await client.send(new PutItemCommand({
TableName: 'Users',
Item: marshall({
userId: 'user123',
email: '[email protected]',
name: 'Jane Doe',
createdAt: new Date().toISOString()
})
}))
// Get item
const response = await client.send(new GetItemCommand({
TableName: 'Users',
Key: marshall({ userId: 'user123' })
}))
const user = unmarshall(response.Item!)
// Query
const queryResponse = await client.send(new QueryCommand({
TableName: 'Orders',
KeyConditionExpression: 'PK = :pk',
ExpressionAttributeValues: marshall({
':pk': 'USER#12345'
})
}))
const items = queryResponse.Items!.map(item => unmarshall(item))
// Update
await client.send(new UpdateItemCommand({
TableName: 'Users',
Key: marshall({ userId: 'user123' }),
UpdateExpression: 'SET #name = :name',
ExpressionAttributeNames: { '#name': 'name' },
ExpressionAttributeValues: marshall({ ':name': 'Jane Smith' })
}))
```
---
## Common Patterns
### Multi-Tenant Architecture
```python
# Partition by tenant
PK: "TENANT#abc#USER#12345"
SK: "METADATA"
# Query all users in tenant
response = table.query(
KeyConditionExpression=Key('PK').begins_with('TENANT#abc#USER#')
)
```
### Time-Series Data
```python
# Partition by entity, sort by timestamp
PK: "SENSOR#sensor123"
SK: "2025-11-28T10:30:45.123Z"
# Query range
response = table.query(
KeyConditionExpression=Key('PK').eq('SENSOR#sensor123') &
Key('SK').between('2025-11-28T00:00:00Z', '2025-11-28T23:59:59Z')
)
# Use GSI for cross-sensor queries
# GSI: PK = date, SK = sensorId
```
### Versioning
```python
# Store versions with sort key
PK: "DOC#doc123"
SK: "VERSION#001"
SK: "VERSION#002"
SK: "VERSION#003"
# Get latest version
response = table.query(
KeyConditionExpression=Key('PK').eq('DOC#doc123'),
ScanIndexForward=False, # Descending order
Limit=1
)
```
---
This guide covers DynamoDB single-table design and AWS-specific patterns. For Python FastAPI + MongoDB examples, see `../examples/dynamodb-serverless/`.
```
### references/firestore.md
```markdown
# Firestore Complete Guide
Firebase/GCP Firestore real-time sync, security rules, and mobile-first patterns.
## Table of Contents
- [Core Concepts](#core-concepts)
- [Data Model](#data-model)
- [Real-Time Listeners](#real-time-listeners)
- [Security Rules](#security-rules)
- [Queries](#queries)
- [Offline Support](#offline-support)
- [Performance Best Practices](#performance-best-practices)
---
## Core Concepts
### Firestore vs Realtime Database
| Feature | Firestore | Realtime Database |
|---------|-----------|-------------------|
| **Data Model** | Collections & documents | JSON tree |
| **Queries** | Rich queries, indexes | Limited queries |
| **Scaling** | Automatic | Manual sharding |
| **Offline** | Full offline support | Limited |
| **Pricing** | Per operation | Per GB downloaded |
### Key Features
- **Real-time sync**: Live updates across all clients
- **Offline-first**: Local cache, auto-sync when online
- **Security rules**: Declarative access control
- **Atomic operations**: Batched writes, transactions
- **Automatic indexing**: Composite indexes for queries
---
## Data Model
### Collections and Documents
```
users (collection)
├── user123 (document)
│ ├── email: "[email protected]"
│ ├── name: "Jane Doe"
│ └── orders (subcollection)
│ ├── order001 (document)
│ └── order002 (document)
└── user456 (document)
└── ...
```
### Document Structure
```typescript
// Document in users collection
{
// Auto-generated ID
id: "user123",
// Document data
email: "[email protected]",
name: "Jane Doe",
age: 32,
address: {
street: "123 Main St",
city: "Boston",
state: "MA"
},
tags: ["premium", "verified"],
createdAt: Timestamp,
metadata: {
lastLogin: Timestamp,
loginCount: 42
}
}
```
**Document Limits:**
- Max size: 1 MB
- Max depth: 20 levels
- Max field name: 1,500 bytes
---
## Real-Time Listeners
### React Component with Real-Time Updates
```typescript
import { collection, query, where, onSnapshot } from 'firebase/firestore'
import { useEffect, useState } from 'react'
function OrderList({ userId }: { userId: string }) {
const [orders, setOrders] = useState([])
const [loading, setLoading] = useState(true)
useEffect(() => {
const q = query(
collection(db, 'orders'),
where('userId', '==', userId),
where('status', '==', 'pending')
)
// Real-time listener
const unsubscribe = onSnapshot(q, (snapshot) => {
const orderData = snapshot.docs.map(doc => ({
id: doc.id,
...doc.data()
}))
setOrders(orderData)
setLoading(false)
}, (error) => {
console.error('Error:', error)
})
// Cleanup on unmount
return () => unsubscribe()
}, [userId])
if (loading) return <div>Loading...</div>
return (
<div>
{orders.map(order => (
<div key={order.id}>{order.orderNumber}</div>
))}
</div>
)
}
```
### Listening to Document Changes
```typescript
import { doc, onSnapshot } from 'firebase/firestore'
// Listen to single document
const unsubscribe = onSnapshot(doc(db, 'users', userId), (doc) => {
if (doc.exists()) {
console.log('User data:', doc.data())
}
})
// Detect change type
onSnapshot(collection(db, 'orders'), (snapshot) => {
snapshot.docChanges().forEach((change) => {
if (change.type === 'added') {
console.log('New order:', change.doc.data())
}
if (change.type === 'modified') {
console.log('Modified order:', change.doc.data())
}
if (change.type === 'removed') {
console.log('Removed order:', change.doc.data())
}
})
})
```
---
## Security Rules
### Basic Rules Structure
```javascript
// firestore.rules
rules_version = '2';
service cloud.firestore {
match /databases/{database}/documents {
// Rules go here
}
}
```
### Common Patterns
```javascript
rules_version = '2';
service cloud.firestore {
match /databases/{database}/documents {
// 1. Public read, authenticated write
match /products/{productId} {
allow read: if true;
allow write: if request.auth != null;
}
// 2. User can only access their own data
match /users/{userId} {
allow read, write: if request.auth != null && request.auth.uid == userId;
}
// 3. Orders: users can only see their own
match /orders/{orderId} {
// Read: must be authenticated and own the order
allow read: if request.auth != null &&
resource.data.userId == request.auth.uid;
// Create: must be authenticated and set userId to their own ID
allow create: if request.auth != null &&
request.resource.data.userId == request.auth.uid;
// Update/Delete: must own the order
allow update, delete: if request.auth != null &&
resource.data.userId == request.auth.uid;
}
// 4. Admin-only writes
match /config/{document} {
allow read: if true;
allow write: if request.auth != null &&
request.auth.token.admin == true;
}
// 5. Validate data on write
match /posts/{postId} {
allow create: if request.auth != null &&
request.resource.data.title is string &&
request.resource.data.title.size() > 0 &&
request.resource.data.title.size() <= 100 &&
request.resource.data.userId == request.auth.uid;
}
// 6. Subcollection access
match /users/{userId}/orders/{orderId} {
allow read, write: if request.auth != null && request.auth.uid == userId;
}
// 7. Rate limiting (prevent abuse)
match /posts/{postId} {
allow create: if request.auth != null &&
request.time > resource.data.lastPost + duration.value(1, 'm');
}
}
}
```
### Rule Functions
```javascript
// Helper functions
function isSignedIn() {
return request.auth != null;
}
function isOwner(userId) {
return request.auth.uid == userId;
}
function isAdmin() {
return isSignedIn() && request.auth.token.admin == true;
}
function validString(field, minLen, maxLen) {
let value = request.resource.data[field];
return value is string &&
value.size() >= minLen &&
value.size() <= maxLen;
}
rules_version = '2';
service cloud.firestore {
match /databases/{database}/documents {
match /posts/{postId} {
allow read: if true;
allow create: if isSignedIn() &&
isOwner(request.resource.data.userId) &&
validString('title', 1, 100);
}
}
}
```
---
## Queries
### Basic Queries
```typescript
import { collection, query, where, orderBy, limit, getDocs } from 'firebase/firestore'
// Simple equality
const q = query(
collection(db, 'orders'),
where('userId', '==', 'user123')
)
const snapshot = await getDocs(q)
snapshot.forEach(doc => console.log(doc.data()))
// Multiple conditions (AND)
const q = query(
collection(db, 'products'),
where('category', '==', 'electronics'),
where('price', '<=', 500),
orderBy('price', 'asc')
)
// OR queries (requires composite index)
const q = query(
collection(db, 'products'),
or(
where('category', '==', 'electronics'),
where('category', '==', 'books')
)
)
// In queries (up to 10 values)
const q = query(
collection(db, 'products'),
where('category', 'in', ['electronics', 'books', 'toys'])
)
// Array contains
const q = query(
collection(db, 'users'),
where('tags', 'array-contains', 'premium')
)
// Array contains any
const q = query(
collection(db, 'users'),
where('tags', 'array-contains-any', ['premium', 'verified'])
)
```
### Pagination
```typescript
// First page
const first = query(
collection(db, 'products'),
orderBy('price'),
limit(20)
)
const snapshot = await getDocs(first)
const lastVisible = snapshot.docs[snapshot.docs.length - 1]
// Next page
const next = query(
collection(db, 'products'),
orderBy('price'),
startAfter(lastVisible),
limit(20)
)
```
### Composite Indexes
Firestore automatically creates indexes for simple queries. Composite indexes required for:
- Multiple inequality filters
- Inequality + orderBy on different fields
- OR queries
```typescript
// Requires composite index: category (asc), price (asc)
const q = query(
collection(db, 'products'),
where('category', '==', 'electronics'),
orderBy('price', 'asc')
)
```
**Create index via Firebase Console or CLI:**
```bash
firebase deploy --only firestore:indexes
```
---
## CRUD Operations
### Create
```typescript
import { collection, addDoc, setDoc, doc } from 'firebase/firestore'
// Auto-generate ID
const docRef = await addDoc(collection(db, 'users'), {
email: '[email protected]',
name: 'Jane Doe',
createdAt: new Date()
})
console.log('Created with ID:', docRef.id)
// Custom ID
await setDoc(doc(db, 'users', 'user123'), {
email: '[email protected]',
name: 'Jane Doe'
})
// Merge (update if exists, create if not)
await setDoc(doc(db, 'users', 'user123'), {
lastLogin: new Date()
}, { merge: true })
```
### Read
```typescript
import { doc, getDoc, collection, getDocs } from 'firebase/firestore'
// Get single document
const docSnap = await getDoc(doc(db, 'users', 'user123'))
if (docSnap.exists()) {
console.log(docSnap.data())
}
// Get all documents in collection
const querySnapshot = await getDocs(collection(db, 'users'))
querySnapshot.forEach(doc => console.log(doc.id, doc.data()))
```
### Update
```typescript
import { doc, updateDoc, increment, arrayUnion, serverTimestamp } from 'firebase/firestore'
// Update fields
await updateDoc(doc(db, 'users', 'user123'), {
name: 'Jane Smith',
'address.city': 'Boston' // Nested field
})
// Increment counter
await updateDoc(doc(db, 'posts', 'post123'), {
views: increment(1)
})
// Add to array (no duplicates)
await updateDoc(doc(db, 'users', 'user123'), {
tags: arrayUnion('premium')
})
// Remove from array
await updateDoc(doc(db, 'users', 'user123'), {
tags: arrayRemove('trial')
})
// Server timestamp
await updateDoc(doc(db, 'users', 'user123'), {
updatedAt: serverTimestamp()
})
```
### Delete
```typescript
import { doc, deleteDoc, deleteField } from 'firebase/firestore'
// Delete document
await deleteDoc(doc(db, 'users', 'user123'))
// Delete field
await updateDoc(doc(db, 'users', 'user123'), {
phoneNumber: deleteField()
})
```
---
## Transactions and Batches
### Transactions (Atomic Reads and Writes)
```typescript
import { runTransaction } from 'firebase/firestore'
// Transfer credits between users
await runTransaction(db, async (transaction) => {
const fromRef = doc(db, 'users', 'user123')
const toRef = doc(db, 'users', 'user456')
const fromDoc = await transaction.get(fromRef)
if (!fromDoc.exists()) {
throw new Error('User not found')
}
const currentBalance = fromDoc.data().credits
if (currentBalance < 100) {
throw new Error('Insufficient credits')
}
transaction.update(fromRef, { credits: currentBalance - 100 })
transaction.update(toRef, { credits: increment(100) })
})
```
### Batched Writes
```typescript
import { writeBatch } from 'firebase/firestore'
// Batch write (up to 500 operations)
const batch = writeBatch(db)
batch.set(doc(db, 'users', 'user1'), { name: 'User 1' })
batch.update(doc(db, 'users', 'user2'), { active: true })
batch.delete(doc(db, 'users', 'user3'))
await batch.commit()
```
---
## Offline Support
### Enable Offline Persistence
```typescript
import { initializeFirestore, persistentLocalCache } from 'firebase/firestore'
const db = initializeFirestore(app, {
localCache: persistentLocalCache()
})
```
### Offline Behavior
```typescript
import { onSnapshot } from 'firebase/firestore'
// Listener works offline
onSnapshot(collection(db, 'orders'), (snapshot) => {
snapshot.forEach(doc => {
// fromCache indicates if data is from local cache
console.log(`${doc.id} (from cache: ${doc.metadata.fromCache})`)
})
})
// Writes queued offline, synced when online
await addDoc(collection(db, 'orders'), {
userId: 'user123',
items: [...]
})
// If offline, write is queued and will sync when online
```
---
## Performance Best Practices
### Minimize Document Reads
```typescript
// BAD: Read same document multiple times
const userDoc = await getDoc(doc(db, 'users', userId))
// ... later ...
const userDoc2 = await getDoc(doc(db, 'users', userId)) // Duplicate read!
// GOOD: Cache document in memory
const userDoc = await getDoc(doc(db, 'users', userId))
const userData = userDoc.data()
// Use userData throughout component
```
### Use Subcollections for Large Arrays
```typescript
// BAD: Store unbounded array in document
{
userId: "user123",
orders: [
{ orderId: "order1", ... },
{ orderId: "order2", ... },
// ... 1000s of orders (exceeds 1MB limit!)
]
}
// GOOD: Use subcollection
// users/user123/orders/order1
// users/user123/orders/order2
const ordersRef = collection(db, 'users', userId, 'orders')
```
### Denormalize for Read Performance
```typescript
// Store frequently accessed data together
{
postId: "post123",
title: "My Post",
content: "...",
// Denormalize author info (instead of reference)
author: {
id: "user123",
name: "Jane Doe",
avatar: "/avatars/jane.jpg"
},
// Update pattern: when user updates profile, update all posts
}
```
### Use Server Timestamps
```typescript
import { serverTimestamp } from 'firebase/firestore'
// Better than client timestamp (avoids clock skew)
await addDoc(collection(db, 'posts'), {
title: 'My Post',
createdAt: serverTimestamp()
})
```
---
## Mobile Integration (React Native)
```typescript
import { initializeApp } from 'firebase/app'
import { getFirestore, collection, onSnapshot } from 'firebase/firestore'
import { useEffect, useState } from 'react'
const firebaseConfig = {
apiKey: "...",
authDomain: "...",
projectId: "..."
}
const app = initializeApp(firebaseConfig)
const db = getFirestore(app)
function OrdersScreen({ userId }) {
const [orders, setOrders] = useState([])
useEffect(() => {
const unsubscribe = onSnapshot(
collection(db, 'orders'),
where('userId', '==', userId),
(snapshot) => {
const orderData = snapshot.docs.map(doc => ({
id: doc.id,
...doc.data()
}))
setOrders(orderData)
}
)
return () => unsubscribe()
}, [userId])
return (
<FlatList
data={orders}
keyExtractor={item => item.id}
renderItem={({ item }) => <OrderItem order={item} />}
/>
)
}
```
---
This guide covers Firestore real-time patterns and mobile-first architecture. For complete React implementation, see `../examples/firestore-react/`.
```
### references/schema-design-patterns.md
```markdown
# Schema Design Patterns
Embedding vs referencing decision framework for document databases.
## Table of Contents
- [Core Decision Matrix](#core-decision-matrix)
- [Embedding Pattern](#embedding-pattern)
- [Referencing Pattern](#referencing-pattern)
- [Hybrid Pattern](#hybrid-pattern)
- [Denormalization Strategies](#denormalization-strategies)
- [Anti-Patterns](#anti-patterns)
---
## Core Decision Matrix
```
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING VS REFERENCING DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RELATIONSHIP TYPE? │
│ ├── ONE-TO-FEW (1 to <10) │
│ │ → EMBED │
│ │ Example: User → Addresses (2-3 max) │
│ │ { │
│ │ userId: "123", │
│ │ addresses: [ │
│ │ { type: "home", street: "123 Main" }, │
│ │ { type: "work", street: "456 Office" } │
│ │ ] │
│ │ } │
│ │ │
│ ├── ONE-TO-MANY (10 to 1000) │
│ │ → HYBRID (embed summary, reference details) │
│ │ Example: Blog Post → Comments │
│ │ { │
│ │ postId: "post123", │
│ │ recentComments: [/* last 10 */], │
│ │ commentCount: 245 // Reference rest in comments coll │
│ │ } │
│ │ │
│ ├── ONE-TO-MILLIONS │
│ │ → REFERENCE │
│ │ Example: User → Events (logging) │
│ │ Users: { userId: "123" } │
│ │ Events: { userId: "123", timestamp: "...", type: "..." } │
│ │ │
│ └── MANY-TO-MANY │
│ → REFERENCE │
│ Example: Products ↔ Categories │
│ Products: { productId: "p1", categoryIds: ["c1", "c2"] } │
│ Categories: { categoryId: "c1", name: "Electronics" } │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### Decision Factors
| Factor | Embed | Reference |
|--------|-------|-----------|
| **Relationship Size** | Few (<10) | Many (>100) |
| **Access Pattern** | Always together | Sometimes separate |
| **Update Frequency** | Child rarely changes | Child frequently changes |
| **Data Growth** | Bounded | Unbounded |
| **Query Pattern** | Parent + children | Children independent queries |
| **Consistency** | Always in sync | May be stale |
---
## Embedding Pattern
### When to Embed
1. **One-to-few relationships** (< 10 subdocuments)
2. **Data always accessed together**
3. **Child data rarely changes**
4. **No need to query children independently**
5. **Strong consistency required**
### Example 1: User Profile
```javascript
// Good: Embed addresses (bounded, always accessed together)
{
_id: ObjectId("..."),
email: "[email protected]",
name: "Jane Doe",
// Embed addresses (few, fixed size)
addresses: [
{
type: "home",
street: "123 Main St",
city: "Boston",
state: "MA",
zip: "02101",
default: true
},
{
type: "work",
street: "456 Business Ave",
city: "Boston",
state: "MA",
zip: "02102",
default: false
}
],
// Embed preferences (key-value pairs)
preferences: {
theme: "dark",
language: "en-US",
notifications: {
email: true,
sms: false,
push: true
}
},
metadata: {
createdAt: ISODate("2025-01-15"),
lastLogin: ISODate("2025-11-28"),
loginCount: 42
}
}
```
**Benefits:**
- One query gets everything
- Atomic updates (update user and addresses together)
- No joins needed
- Strong consistency
**Query:**
```javascript
// Single query gets user with all addresses
const user = await db.users.findOne({ email: "[email protected]" })
console.log(user.addresses) // Immediately available
```
### Example 2: Product with Specifications
```javascript
{
_id: ObjectId("..."),
sku: "WGT-PRO-001",
name: "Widget Pro",
price: 49.99,
category: "widgets",
// Embed specifications (varies by product type)
specifications: {
weight: "2.5 lbs",
dimensions: {
length: 10,
width: 8,
height: 3,
unit: "inches"
},
color: "silver",
material: "aluminum",
warranty: "2 years"
},
// Embed images (few, fixed URLs)
images: [
{ url: "/images/wgt-pro-001-front.jpg", type: "front" },
{ url: "/images/wgt-pro-001-side.jpg", type: "side" }
],
inventory: 245,
tags: ["professional", "bestseller", "new"]
}
```
---
## Referencing Pattern
### When to Reference
1. **One-to-many or many-to-many** (unbounded growth)
2. **Children queried independently**
3. **Child data frequently changes**
4. **Large subdocuments** (approaching 16MB limit)
5. **Different access patterns for parent/children**
### Example 1: Blog Posts and Comments
```javascript
// Posts collection
{
_id: ObjectId("..."),
title: "MongoDB Schema Design",
slug: "mongodb-schema-design",
content: "...",
authorId: ObjectId("..."), // Reference to users collection
tags: ["mongodb", "database", "schema"],
publishedAt: ISODate("2025-11-15"),
// Summary stats (denormalized)
stats: {
views: 1247,
likes: 89,
commentCount: 23 // Count, not actual comments
}
}
// Comments collection (separate)
{
_id: ObjectId("..."),
postId: ObjectId("..."), // Reference to posts
authorId: ObjectId("..."), // Reference to users
content: "Great article!",
createdAt: ISODate("2025-11-16")
}
// Users collection (separate)
{
_id: ObjectId("..."),
username: "janedoe",
email: "[email protected]",
avatar: "/avatars/jane.jpg"
}
```
**Query Pattern:**
```javascript
// Get post
const post = await db.posts.findOne({ slug: "mongodb-schema-design" })
// Get comments for post (separate query)
const comments = await db.comments.find({ postId: post._id }).toArray()
// Join with aggregation pipeline
const postsWithComments = await db.posts.aggregate([
{ $match: { slug: "mongodb-schema-design" }},
{ $lookup: {
from: "comments",
localField: "_id",
foreignField: "postId",
as: "comments"
}}
])
```
### Example 2: E-commerce Orders
```javascript
// Orders collection
{
_id: ObjectId("..."),
orderNumber: "ORD-2025-001234",
userId: ObjectId("..."), // Reference to users
items: [
{
productId: ObjectId("..."), // Reference to products
quantity: 2,
priceAtPurchase: 49.99, // Denormalized (historical)
name: "Widget Pro" // Denormalized (performance)
},
{
productId: ObjectId("..."),
quantity: 1,
priceAtPurchase: 149.99,
name: "Gadget Ultra"
}
],
// Snapshot of shipping address (not reference)
shippingAddress: {
street: "123 Main St",
city: "Boston",
state: "MA",
zip: "02101"
},
totalAmount: 249.97,
status: "shipped",
orderDate: ISODate("2025-11-25"),
shippedDate: ISODate("2025-11-26")
}
// Products collection (referenced)
{
_id: ObjectId("..."),
name: "Widget Pro",
price: 49.99, // Current price (may differ from order)
inventory: 245,
category: "widgets"
}
```
---
## Hybrid Pattern
### Strategy: Embed Summary, Reference Details
Best for one-to-many relationships where you need both summary and detail access.
### Example 1: Blog Post with Comment Preview
```javascript
// Post with embedded recent comments
{
_id: ObjectId("..."),
title: "MongoDB Schema Design",
content: "...",
// Embed recent comments for preview
recentComments: [
{
commentId: ObjectId("..."),
authorName: "John Smith",
content: "Great article!",
createdAt: ISODate("2025-11-28")
},
{
commentId: ObjectId("..."),
authorName: "Alice Johnson",
content: "Very helpful, thanks!",
createdAt: ISODate("2025-11-27")
}
// Last 5-10 comments embedded
],
commentCount: 245, // Total count
// Full comments in separate collection
}
// Comments collection (all comments)
{
_id: ObjectId("..."),
postId: ObjectId("..."),
authorId: ObjectId("..."),
authorName: "John Smith", // Denormalized for performance
content: "Great article!",
createdAt: ISODate("2025-11-28")
}
```
**Update Pattern:**
```javascript
// When new comment added:
// 1. Insert into comments collection
const commentId = await db.comments.insertOne({
postId: postId,
authorId: userId,
authorName: "New User",
content: "New comment"
})
// 2. Update post with recent comments + count
await db.posts.updateOne(
{ _id: postId },
{
$push: {
recentComments: {
$each: [{
commentId: commentId.insertedId,
authorName: "New User",
content: "New comment",
createdAt: new Date()
}],
$position: 0, // Add to beginning
$slice: 10 // Keep only 10 most recent
}
},
$inc: { commentCount: 1 }
}
)
```
### Example 2: Product with Review Stats
```javascript
{
_id: ObjectId("..."),
name: "Widget Pro",
price: 49.99,
// Embed review summary
reviews: {
averageRating: 4.6,
totalCount: 1247,
ratingDistribution: {
5: 892,
4: 245,
3: 78,
2: 21,
1: 11
},
// Embed featured reviews
featured: [
{
reviewId: ObjectId("..."),
rating: 5,
title: "Excellent product!",
excerpt: "This widget exceeded my expectations...",
author: "John Smith",
createdAt: ISODate("2025-11-20")
}
// 3-5 featured reviews
]
}
// Full reviews in separate collection
}
```
---
## Denormalization Strategies
### When to Denormalize
1. **Read-heavy workloads** (optimize reads over writes)
2. **Frequently accessed together**
3. **Historical snapshots** (prices, names at transaction time)
4. **Reduce query complexity** (avoid aggregation lookups)
### Example 1: Social Media Post
```javascript
{
_id: ObjectId("..."),
content: "Check out this amazing feature!",
authorId: ObjectId("..."), // Reference
// Denormalize author info (avoid join on every post read)
authorName: "Jane Doe", // Duplicated from users
authorAvatar: "/avatars/jane.jpg", // Duplicated from users
authorVerified: true, // Duplicated from users
createdAt: ISODate("2025-11-28"),
likes: 42,
comments: 17,
shares: 8
}
```
**Update Pattern:**
```javascript
// When user updates profile, update all their posts
await db.users.updateOne(
{ _id: userId },
{ $set: { name: "Jane Smith", avatar: "/avatars/jane-new.jpg" }}
)
// Propagate to posts (eventual consistency acceptable)
await db.posts.updateMany(
{ authorId: userId },
{ $set: {
authorName: "Jane Smith",
authorAvatar: "/avatars/jane-new.jpg"
}}
)
```
### Example 2: Order Historical Data
```javascript
{
_id: ObjectId("..."),
orderNumber: "ORD-2025-001234",
items: [
{
productId: ObjectId("..."),
// Denormalize product info (snapshot at purchase time)
name: "Widget Pro", // May change in products collection
sku: "WGT-PRO-001",
priceAtPurchase: 49.99, // Current price may differ
category: "widgets",
imageUrl: "/images/wgt.jpg"
}
],
totalAmount: 249.97
}
```
**Why?** Order shows what customer actually bought. If product name or price changes, order should reflect historical data.
---
## Anti-Patterns
### ❌ Unbounded Arrays
```javascript
// BAD: Array grows indefinitely
{
userId: "user123",
events: [/* 10,000+ events */] // Document too large! (>16MB limit)
}
// GOOD: Use references
{
userId: "user123",
recentEvents: [/* Last 10 events */] // Embed recent
}
// Store full history in separate collection
db.events.find({ userId: "user123" })
```
### ❌ Deep Nesting
```javascript
// BAD: Too many levels (hard to query, update)
{
company: {
departments: [
{
name: "Engineering",
teams: [
{
name: "Backend",
members: [
{
name: "Jane",
projects: [
{ name: "API", tasks: [...] }
]
}
]
}
]
}
]
}
}
// GOOD: Flatten with references
// Companies collection
{ _id: "comp1", name: "Acme Corp" }
// Departments collection
{ _id: "dept1", companyId: "comp1", name: "Engineering" }
// Teams collection
{ _id: "team1", departmentId: "dept1", name: "Backend" }
// Members collection
{ _id: "mem1", teamId: "team1", name: "Jane" }
```
### ❌ Massive Documents
```javascript
// BAD: Single document with all user data
{
userId: "user123",
profile: {...},
orders: [/* 1000s of orders */],
messages: [/* 10000s of messages */],
events: [/* 100000s of events */]
}
// Document size: >16MB (MongoDB limit!)
// GOOD: Separate collections
// Users: { userId: "user123", profile: {...} }
// Orders: { userId: "user123", ... }
// Messages: { userId: "user123", ... }
// Events: { userId: "user123", ... }
```
### ❌ Premature Referencing
```javascript
// BAD: Reference for small, bounded data
{
userId: "user123",
addressId: ObjectId("...") // Reference to addresses collection
}
// Requires join for every user fetch!
// GOOD: Embed small, bounded data
{
userId: "user123",
addresses: [
{ type: "home", street: "123 Main" } // Max 2-3 addresses
]
}
```
---
## Schema Evolution
### Adding Fields (Schema-less Advantage)
```javascript
// Old documents
{ userId: "user123", name: "Jane" }
// New documents (with new field)
{ userId: "user456", name: "John", phoneNumber: "555-1234" }
// Handle in application code
const user = await db.users.findOne({ userId: "user123" })
const phone = user.phoneNumber || null // Graceful handling
```
### Migrating Schema
```javascript
// Gradual migration (no downtime)
// 1. Application handles both old and new formats
// 2. Migrate documents over time
// Migration script
await db.users.updateMany(
{ version: { $exists: false }}, // Old documents
[
{
$set: {
fullName: { $concat: ["$firstName", " ", "$lastName"] },
version: 2
}
},
{ $unset: ["firstName", "lastName"] }
]
)
```
---
## Performance Considerations
### Document Size Limits
| Database | Max Document Size |
|----------|-------------------|
| **MongoDB** | 16 MB |
| **DynamoDB** | 400 KB |
| **Firestore** | 1 MB |
**Rule of thumb:** Keep documents < 1 MB for best performance.
### Read vs Write Optimization
| Pattern | Read Performance | Write Performance | Use Case |
|---------|------------------|-------------------|----------|
| **Embed** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Read-heavy |
| **Reference** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Write-heavy |
| **Denormalize** | ⭐⭐⭐⭐⭐ | ⭐⭐ | Read >> Writes |
---
This guide provides the decision framework for schema design. For database-specific patterns, see `mongodb.md`, `dynamodb.md`, `firestore.md`.
```
### references/indexing-strategies.md
```markdown
# MongoDB Indexing Strategies
Complete guide to index types, optimization, and best practices for document databases.
## Table of Contents
- [Index Types](#index-types)
- [Single Field Index](#single-field-index)
- [Compound Index](#compound-index)
- [Text Index (Full-Text Search)](#text-index-full-text-search)
- [Geospatial Index](#geospatial-index)
- [Partial Index (Index Subset)](#partial-index-index-subset)
- [TTL Index (Auto-Delete)](#ttl-index-auto-delete)
- [Sparse Index](#sparse-index)
- [Index Selection Rules](#index-selection-rules)
- [Covering Indexes](#covering-indexes)
- [Performance Optimization](#performance-optimization)
- [Index Intersection](#index-intersection)
- [Index Prefix Usage](#index-prefix-usage)
- [Monitoring and Analysis](#monitoring-and-analysis)
- [List Indexes](#list-indexes)
- [Index Usage Stats](#index-usage-stats)
- [Explain Query](#explain-query)
- [Slow Query Log](#slow-query-log)
- [Index Maintenance](#index-maintenance)
- [Rebuild Index](#rebuild-index)
- [Drop Unused Indexes](#drop-unused-indexes)
- [Best Practices](#best-practices)
- [Anti-Patterns](#anti-patterns)
- [Resources](#resources)
## Index Types
### Single Field Index
```javascript
db.users.createIndex({ email: 1 }, { unique: true }) // Ascending
db.posts.createIndex({ createdAt: -1 }) // Descending
```
### Compound Index
**Order matters!** Equality → Range → Sort
```javascript
// Query: WHERE status = 'active' AND createdAt > date ORDER BY createdAt
db.orders.createIndex({ status: 1, createdAt: -1 })
// Query: WHERE userId = 123 AND status = 'active' ORDER BY createdAt
db.orders.createIndex({ userId: 1, status: 1, createdAt: -1 })
```
### Text Index (Full-Text Search)
```javascript
db.articles.createIndex({
title: "text",
content: "text",
tags: "text"
}, {
weights: { title: 3, content: 1, tags: 2 }, // Title 3x more important
name: "article_text_index"
})
// Search
db.articles.find({ $text: { $search: "mongodb indexing" } })
.sort({ score: { $meta: "textScore" } })
```
### Geospatial Index
```javascript
db.locations.createIndex({ location: "2dsphere" })
// Find nearby
db.locations.find({
location: {
$near: {
$geometry: { type: "Point", coordinates: [-73.97, 40.77] },
$maxDistance: 5000, // 5km
}
}
})
```
### Partial Index (Index Subset)
```javascript
// Only index active users
db.users.createIndex(
{ email: 1 },
{ partialFilterExpression: { status: { $eq: "active" } } }
)
// Only index large orders
db.orders.createIndex(
{ userId: 1, createdAt: -1 },
{ partialFilterExpression: { amount: { $gte: 1000 } } }
)
```
### TTL Index (Auto-Delete)
```javascript
// Auto-delete sessions after 30 days
db.sessions.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 2592000 } // 30 days
)
```
### Sparse Index
```javascript
// Only index documents with field
db.users.createIndex({ phone: 1 }, { sparse: true })
```
## Index Selection Rules
**Compound index order:**
1. Equality filters first
2. Range filters second
3. Sort fields last
```javascript
// Query: status = X AND createdAt > Y ORDER BY createdAt
// Index: { status: 1, createdAt: -1 } ✓ Correct order
// Index: { createdAt: -1, status: 1 } ✗ Wrong order
```
## Covering Indexes
Index includes all queried fields (no document fetch needed):
```javascript
// Query needs: userId, createdAt, amount
db.orders.createIndex({ userId: 1, createdAt: -1, amount: 1 })
// Query
db.orders.find(
{ userId: 123 },
{ userId: 1, createdAt: 1, amount: 1, _id: 0 } // Only indexed fields
).sort({ createdAt: -1 })
// COVERED - no document fetch!
```
## Performance Optimization
### Index Intersection
MongoDB can combine multiple indexes:
```javascript
db.orders.createIndex({ userId: 1 })
db.orders.createIndex({ status: 1 })
// MongoDB automatically intersects indexes
db.orders.find({ userId: 123, status: "pending" })
```
### Index Prefix Usage
Compound indexes can support prefix queries:
```javascript
db.orders.createIndex({ userId: 1, status: 1, createdAt: -1 })
// Supports queries on:
// { userId }
// { userId, status }
// { userId, status, createdAt }
// Does NOT support:
// { status }
// { createdAt }
```
## Monitoring and Analysis
### List Indexes
```javascript
db.collection.getIndexes()
```
### Index Usage Stats
```javascript
db.collection.aggregate([{ $indexStats: {} }])
```
### Explain Query
```javascript
db.orders.explain("executionStats").find({ userId: 123 })
// Check for:
// - "stage": "IXSCAN" (good - uses index)
// - "stage": "COLLSCAN" (bad - full collection scan)
// - "totalDocsExamined" should be close to "nReturned"
```
### Slow Query Log
```javascript
// Enable profiling
db.setProfilingLevel(1, { slowms: 100 }) // Log queries >100ms
// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10)
```
## Index Maintenance
### Rebuild Index
```javascript
db.collection.reIndex() // Rebuilds all indexes
```
### Drop Unused Indexes
```javascript
// Find unused indexes
db.collection.aggregate([{ $indexStats: {} }])
// Drop index
db.collection.dropIndex("index_name")
```
## Best Practices
1. **Index queried fields** - Every filter/sort should have index
2. **Compound index order** - Equality, Range, Sort
3. **Use covering indexes** - Avoid document fetches
4. **Partial indexes** - Index only needed subset
5. **Monitor index usage** - Drop unused indexes
6. **Limit index count** - Each index slows writes
7. **Use explain()** - Verify index usage
8. **Index foreign keys** - For $lookup performance
9. **TTL indexes** - Auto-cleanup for temp data
10. **Text indexes** - One per collection max
## Anti-Patterns
❌ **Over-indexing** - 10+ indexes per collection slows writes
❌ **Wrong compound order** - Range before equality
❌ **No index on filters** - Collection scans on large data
❌ **Index low-cardinality** - Boolean fields rarely help
❌ **Ignore index size** - Indexes consume RAM
## Resources
- MongoDB Indexes: https://www.mongodb.com/docs/manual/indexes/
- Index Performance: https://www.mongodb.com/docs/manual/core/index-performance/
```
### references/aggregation-patterns.md
```markdown
# MongoDB Aggregation Pipeline Patterns
Complete guide to MongoDB aggregation framework for complex queries, analytics, and data transformations.
## Table of Contents
- [Aggregation Pipeline Basics](#aggregation-pipeline-basics)
- [Core Stages](#core-stages)
- [$match (Filter)](#match-filter)
- [$group (Aggregate)](#group-aggregate)
- [$project (Reshape)](#project-reshape)
- [$lookup (Join)](#lookup-join)
- [$unwind (Flatten Arrays)](#unwind-flatten-arrays)
- [Common Patterns](#common-patterns)
- [Top N per Category](#top-n-per-category)
- [Time-Based Aggregation](#time-based-aggregation)
- [Moving Average](#moving-average)
- [Pagination](#pagination)
- [Advanced Patterns](#advanced-patterns)
- [Faceted Search](#faceted-search)
- [Full-Text Search + Aggregation](#full-text-search-aggregation)
- [Conditional Aggregation](#conditional-aggregation)
- [Performance Optimization](#performance-optimization)
- [Index Usage](#index-usage)
- [$match and $project Early](#match-and-project-early)
- [Limit Result Size](#limit-result-size)
- [Pipeline Validation](#pipeline-validation)
- [Best Practices](#best-practices)
- [Resources](#resources)
## Aggregation Pipeline Basics
Pipeline stages process documents sequentially:
```javascript
db.collection.aggregate([
{ $match: { status: "active" } }, // Filter
{ $group: { _id: "$category" } }, // Group
{ $sort: { count: -1 } }, // Sort
{ $limit: 10 }, // Limit
])
```
## Core Stages
### $match (Filter)
```javascript
// Filter before expensive operations
db.orders.aggregate([
{ $match: {
status: "completed",
createdAt: { $gte: ISODate("2025-01-01") }
}},
// ... other stages
])
```
**Best practice:** Use $match early to reduce document count.
### $group (Aggregate)
```javascript
// Revenue by category
db.orders.aggregate([
{ $group: {
_id: "$category",
totalRevenue: { $sum: "$amount" },
avgOrder: { $avg: "$amount" },
count: { $sum: 1 },
maxOrder: { $max: "$amount" },
minOrder: { $min: "$amount" },
}},
])
```
**Accumulators:** `$sum`, `$avg`, `$max`, `$min`, `$first`, `$last`, `$push`, `$addToSet`
### $project (Reshape)
```javascript
// Select and transform fields
db.users.aggregate([
{ $project: {
fullName: { $concat: ["$firstName", " ", "$lastName"] },
email: 1, // Include field
_id: 0, // Exclude field
ageGroup: {
$cond: {
if: { $gte: ["$age", 18] },
then: "adult",
else: "minor"
}
}
}},
])
```
### $lookup (Join)
```javascript
// Join orders with users
db.orders.aggregate([
{ $lookup: {
from: "users",
localField: "userId",
foreignField: "_id",
as: "user"
}},
{ $unwind: "$user" }, // Flatten array
])
```
### $unwind (Flatten Arrays)
```javascript
// Expand array elements into separate documents
db.posts.aggregate([
{ $unwind: "$tags" }, // Create one doc per tag
{ $group: {
_id: "$tags",
count: { $sum: 1 }
}},
])
```
## Common Patterns
### Top N per Category
```javascript
// Top 3 products per category
db.products.aggregate([
{ $sort: { category: 1, sales: -1 } },
{ $group: {
_id: "$category",
products: { $push: "$$ROOT" },
}},
{ $project: {
category: "$_id",
topProducts: { $slice: ["$products", 3] }
}},
])
```
### Time-Based Aggregation
```javascript
// Daily revenue
db.orders.aggregate([
{ $match: {
createdAt: { $gte: ISODate("2025-11-01") }
}},
{ $group: {
_id: {
$dateToString: { format: "%Y-%m-%d", date: "$createdAt" }
},
revenue: { $sum: "$amount" },
orders: { $sum: 1 },
}},
{ $sort: { _id: 1 } },
])
```
### Moving Average
```javascript
// 7-day moving average
db.metrics.aggregate([
{ $setWindowFields: {
sortBy: { date: 1 },
output: {
movingAvg: {
$avg: "$value",
window: { documents: [-6, 0] } // Current + 6 previous
}
}
}},
])
```
### Pagination
```javascript
// Efficient pagination
db.products.aggregate([
{ $match: { category: "electronics" } },
{ $sort: { createdAt: -1 } },
{ $skip: 20 }, // Page 2 (skip 20)
{ $limit: 10 }, // 10 per page
])
// Total count for pagination
db.products.aggregate([
{ $match: { category: "electronics" } },
{ $facet: {
items: [
{ $skip: 20 },
{ $limit: 10 },
],
totalCount: [
{ $count: "count" },
],
}},
])
```
## Advanced Patterns
### Faceted Search
```javascript
// Multiple aggregations in one query
db.products.aggregate([
{ $match: { price: { $lte: 1000 } } },
{ $facet: {
byCategory: [
{ $group: { _id: "$category", count: { $sum: 1 } } },
{ $sort: { count: -1 } },
],
byBrand: [
{ $group: { _id: "$brand", count: { $sum: 1 } } },
{ $sort: { count: -1 } },
],
priceRanges: [
{ $bucket: {
groupBy: "$price",
boundaries: [0, 100, 500, 1000],
default: "Other",
output: { count: { $sum: 1 } }
}},
],
}},
])
```
### Full-Text Search + Aggregation
```javascript
db.articles.aggregate([
{ $match: { $text: { $search: "mongodb aggregation" } } },
{ $addFields: {
score: { $meta: "textScore" }
}},
{ $sort: { score: -1 } },
{ $limit: 10 },
])
```
### Conditional Aggregation
```javascript
// Revenue by payment method
db.orders.aggregate([
{ $group: {
_id: null,
creditCardRevenue: {
$sum: { $cond: [
{ $eq: ["$paymentMethod", "credit_card"] },
"$amount",
0
]}
},
paypalRevenue: {
$sum: { $cond: [
{ $eq: ["$paymentMethod", "paypal"] },
"$amount",
0
]}
},
}},
])
```
## Performance Optimization
### Index Usage
```javascript
// Use indexes for $match and $sort
db.orders.createIndex({ status: 1, createdAt: -1 });
db.orders.aggregate([
{ $match: { status: "active" } }, // Uses index
{ $sort: { createdAt: -1 } }, // Uses index
// ... other stages
])
// Check index usage
db.orders.explain().aggregate([...])
```
### $match and $project Early
```javascript
// ✅ Good: Filter and project early
db.large_collection.aggregate([
{ $match: { active: true } }, // Reduce documents
{ $project: { needed_field: 1 } }, // Reduce field count
{ $lookup: ... }, // Expensive operation on smaller dataset
])
// ❌ Bad: Expensive operations on full collection
db.large_collection.aggregate([
{ $lookup: ... }, // Processes all documents
{ $match: { active: true } }, // Filter after expensive operation
])
```
### Limit Result Size
```javascript
// Limit intermediate results
db.products.aggregate([
{ $match: { inStock: true } },
{ $sort: { popularity: -1 } },
{ $limit: 100 }, // Limit before expensive stages
{ $lookup: { /* join details */ } },
])
```
## Pipeline Validation
```javascript
// Use $merge for debugging
db.orders.aggregate([
{ $match: { status: "pending" } },
{ $merge: { into: "debug_stage1" } }, // Save intermediate result
{ $group: { _id: "$userId", total: { $sum: "$amount" } } },
{ $merge: { into: "debug_stage2" } },
])
```
## Best Practices
1. **$match early** - Filter before expensive operations
2. **Use indexes** - Ensure $match and $sort use indexes
3. **$project unwanted fields** - Reduce memory usage
4. **Limit results** - Use $limit early when possible
5. **Avoid $lookup on large collections** - Index foreign keys
6. **Test with explain()** - Verify index usage
7. **Use $facet sparingly** - Multiple sub-pipelines are expensive
8. **Consider denormalization** - Avoid $lookup for frequent queries
## Resources
- MongoDB Aggregation Docs: https://www.mongodb.com/docs/manual/aggregation/
- Aggregation Pipeline Operators: https://www.mongodb.com/docs/manual/reference/operator/aggregation/
```
### references/skill-integrations.md
```markdown
# Document Database Integration with Frontend Skills
Integration patterns for connecting MongoDB/DynamoDB with frontend component skills.
## Table of Contents
- [Forms Skill Integration](#forms-skill-integration)
- [Form Submission → MongoDB](#form-submission-mongodb)
- [Tables Skill Integration](#tables-skill-integration)
- [MongoDB → TanStack Table](#mongodb-tanstack-table)
- [Search-Filter Skill Integration](#search-filter-skill-integration)
- [MongoDB Full-Text Search](#mongodb-full-text-search)
- [Media Skill Integration](#media-skill-integration)
- [GridFS for Large Files](#gridfs-for-large-files)
- [Dashboard Skill Integration](#dashboard-skill-integration)
- [Real-Time Metrics with Change Streams](#real-time-metrics-with-change-streams)
- [AI Chat Skill Integration](#ai-chat-skill-integration)
- [MongoDB Atlas Vector Search](#mongodb-atlas-vector-search)
- [Feedback Skill Integration](#feedback-skill-integration)
- [Event Logging with TTL](#event-logging-with-ttl)
- [Best Practices](#best-practices)
- [Connection Reuse](#connection-reuse)
- [Error Handling](#error-handling)
- [Validation](#validation)
- [Resources](#resources)
## Forms Skill Integration
### Form Submission → MongoDB
```typescript
// Frontend (React Hook Form)
import { useForm } from 'react-hook-form';
function CreateUserForm() {
const { register, handleSubmit } = useForm();
const onSubmit = async (data) => {
await fetch('/api/users', {
method: 'POST',
body: JSON.stringify(data),
});
};
return (
<form onSubmit={handleSubmit(onSubmit)}>
<input {...register('email')} />
<input {...register('name')} />
<button>Submit</button>
</form>
);
}
// Backend (FastAPI + MongoDB)
from motor.motor_asyncio import AsyncIOMotorClient
from pydantic import BaseModel
class UserCreate(BaseModel):
email: str
name: str
@app.post("/api/users")
async def create_user(user: UserCreate):
result = await db.users.insert_one({
"email": user.email,
"name": user.name,
"createdAt": datetime.utcnow(),
})
return {"id": str(result.inserted_id)}
```
## Tables Skill Integration
### MongoDB → TanStack Table
```typescript
// Frontend
import { useQuery } from '@tanstack/react-query';
import { useReactTable, getCoreRowModel } from '@tanstack/react-table';
function UserTable() {
const { data } = useQuery({
queryKey: ['users'],
queryFn: () => fetch('/api/users').then(r => r.json()),
});
const table = useReactTable({
data: data?.users || [],
columns,
getCoreRowModel: getCoreRowModel(),
});
return <Table />;
}
// Backend (Cursor Pagination)
@app.get("/api/users")
async def list_users(cursor: Optional[str] = None, limit: int = 20):
query = {}
if cursor:
query["_id"] = {"$lt": ObjectId(cursor)}
users = await db.users.find(query).sort("_id", -1).limit(limit).to_list(limit)
return {
"users": users,
"nextCursor": str(users[-1]["_id"]) if users else None,
"hasMore": len(users) == limit,
}
```
## Search-Filter Skill Integration
### MongoDB Full-Text Search
```javascript
// Create text index
db.products.createIndex({
name: "text",
description: "text",
tags: "text"
})
// Search with filters
db.products.find({
$text: { $search: "laptop" },
price: { $lte: 1000 },
category: "electronics",
}).sort({ score: { $meta: "textScore" } })
```
```typescript
// Frontend
@app.get("/api/products/search")
async def search_products(
q: str,
category: Optional[str] = None,
maxPrice: Optional[float] = None,
):
query = { "$text": { "$search": q } }
if category:
query["category"] = category
if maxPrice:
query["price"] = { "$lte": maxPrice }
products = await db.products.find(query).sort([
("score", { "$meta": "textScore" })
]).to_list(20)
return {"products": products}
```
## Media Skill Integration
### GridFS for Large Files
```python
from gridfs import GridFS
from motor.motor_asyncio import AsyncIOMotorGridFS
fs = AsyncIOMotorGridFS(db)
# Upload file
@app.post("/api/upload")
async def upload_file(file: UploadFile):
file_id = await fs.upload_from_stream(
filename=file.filename,
source=file.file,
metadata={
"contentType": file.content_type,
"uploadedBy": current_user.id,
"uploadedAt": datetime.utcnow(),
}
)
return {"fileId": str(file_id)}
# Download file
@app.get("/api/files/{file_id}")
async def download_file(file_id: str):
grid_out = await fs.open_download_stream(ObjectId(file_id))
return StreamingResponse(
grid_out,
media_type=grid_out.metadata.get("contentType"),
)
```
## Dashboard Skill Integration
### Real-Time Metrics with Change Streams
```python
# Backend (FastAPI + SSE)
from fastapi.responses import StreamingResponse
@app.get("/api/metrics/stream")
async def stream_metrics():
async def generate():
async with db.orders.watch() as stream:
async for change in stream:
if change["operationType"] == "insert":
# Calculate updated metrics
metrics = await calculate_metrics()
yield f"data: {json.dumps(metrics)}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
)
```
```typescript
// Frontend
useEffect(() => {
const es = new EventSource('/api/metrics/stream');
es.onmessage = (e) => {
const metrics = JSON.parse(e.data);
setDashboardMetrics(metrics);
};
return () => es.close();
}, []);
```
## AI Chat Skill Integration
### MongoDB Atlas Vector Search
```python
# Create vector search index (Atlas UI or API)
# Index definition:
{
"mappings": {
"fields": {
"embedding": {
"type": "knnVector",
"dimensions": 1024,
"similarity": "cosine"
}
}
}
}
# Semantic search in chat history
@app.post("/api/chat/search")
async def search_chat_history(query: str):
query_embedding = voyage_ai.embed(query)
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "embedding",
"queryVector": query_embedding,
"numCandidates": 100,
"limit": 5,
}
},
{
"$project": {
"message": 1,
"timestamp": 1,
"score": { "$meta": "vectorSearchScore" }
}
}
]
results = await db.chat_messages.aggregate(pipeline).to_list(5)
return {"results": results}
```
## Feedback Skill Integration
### Event Logging with TTL
```javascript
// Auto-delete events after 90 days
db.events.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 7776000 } // 90 days
)
// Log user events
db.events.insertOne({
userId: 123,
type: "button_click",
button: "submit",
page: "/checkout",
createdAt: new Date(),
})
// Aggregate for analytics
db.events.aggregate([
{ $match: { type: "button_click" } },
{ $group: {
_id: { button: "$button", page: "$page" },
count: { $sum: 1 }
}},
{ $sort: { count: -1 } },
])
```
## Best Practices
### Connection Reuse
```typescript
// ✓ Singleton pattern
let client: MongoClient;
export async function getDatabase() {
if (!client) {
client = new MongoClient(process.env.MONGODB_URI);
await client.connect();
}
return client.db('myapp');
}
// ✗ Don't create new connections per request
async function handler() {
const client = new MongoClient(uri); // BAD
await client.connect();
}
```
### Error Handling
```typescript
try {
await db.users.insertOne({ email: "[email protected]" });
} catch (error) {
if (error.code === 11000) {
return { error: "Email already exists" };
}
throw error;
}
```
### Validation
```javascript
// Schema validation (MongoDB 3.6+)
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["email", "name"],
properties: {
email: {
bsonType: "string",
pattern: "^.+@.+\..+$"
},
age: {
bsonType: "int",
minimum: 0,
maximum: 120
}
}
}
}
})
```
## Resources
- Motor (Async Python): https://motor.readthedocs.io/
- MongoDB Node Driver: https://www.mongodb.com/docs/drivers/node/
- Prisma MongoDB: https://www.prisma.io/docs/concepts/database-connectors/mongodb
```
### references/performance.md
```markdown
# MongoDB Performance Optimization
Query optimization, index strategies, and production tuning for document databases.
## Table of Contents
- [Query Optimization](#query-optimization)
- [1. Use Indexes](#1-use-indexes)
- [2. Covered Queries](#2-covered-queries)
- [3. Projection (Select Only Needed Fields)](#3-projection-select-only-needed-fields)
- [Index Performance](#index-performance)
- [Index Selectivity](#index-selectivity)
- [Index Size](#index-size)
- [Compound Index Prefix](#compound-index-prefix)
- [Connection Pooling](#connection-pooling)
- [Aggregation Optimization](#aggregation-optimization)
- [Pipeline Order](#pipeline-order)
- [$lookup Performance](#lookup-performance)
- [Write Performance](#write-performance)
- [Bulk Inserts](#bulk-inserts)
- [Limit Index Count](#limit-index-count)
- [Monitoring Queries](#monitoring-queries)
- [Enable Profiling](#enable-profiling)
- [Query Analysis](#query-analysis)
- [Production Tuning](#production-tuning)
- [WiredTiger Cache](#wiredtiger-cache)
- [Read Preference](#read-preference)
- [Write Concern](#write-concern)
- [Best Practices](#best-practices)
- [Performance Checklist](#performance-checklist)
- [Resources](#resources)
## Query Optimization
### 1. Use Indexes
```javascript
// Check if query uses index
db.orders.explain("executionStats").find({ status: "pending" })
// Look for:
{
"executionStats": {
"executionSuccess": true,
"nReturned": 100, // Documents returned
"totalDocsExamined": 100, // Should match nReturned
"executionTimeMillis": 5,
"stage": "IXSCAN" // ✓ Index scan (good)
}
}
// Bad example (no index):
"stage": "COLLSCAN", // ✗ Collection scan (bad)
"totalDocsExamined": 1000000, // Scanned all docs
"nReturned": 100, // Only returned 100
```
**Target:** `totalDocsExamined / nReturned ≈ 1.0`
### 2. Covered Queries
Query entirely satisfied by index (no document fetch):
```javascript
db.users.createIndex({ userId: 1, email: 1, name: 1 })
// Covered query (fast)
db.users.find(
{ userId: 123 },
{ userId: 1, email: 1, name: 1, _id: 0 } // Only indexed fields
)
// Check explain output:
"totalDocsExamined": 0 // ✓ Zero docs examined (covered)
```
### 3. Projection (Select Only Needed Fields)
```javascript
// ❌ Fetch entire document
db.users.find({ _id: userId })
// ✅ Project only needed fields
db.users.find({ _id: userId }, { email: 1, name: 1 })
```
## Index Performance
### Index Selectivity
**Good indexes** are selective (filter out most documents):
```javascript
// ✓ Good: email (unique, very selective)
db.users.createIndex({ email: 1 })
// ✗ Bad: boolean field (low selectivity)
db.users.createIndex({ isActive: 1 }) // Only 2 values, not helpful
```
### Index Size
```javascript
// Check index sizes
db.users.stats().indexSizes
// Keep indexes in RAM
// Formula: Total index size < Available RAM
```
### Compound Index Prefix
```javascript
db.orders.createIndex({ userId: 1, status: 1, createdAt: -1 })
// Supports:
// { userId } ✓
// { userId, status } ✓
// { userId, status, createdAt } ✓
// Does NOT support:
// { status } ✗
// { createdAt } ✗
// { status, createdAt } ✗
```
## Connection Pooling
```javascript
const { MongoClient } = require('mongodb');
const client = new MongoClient(uri, {
maxPoolSize: 10, // Max connections
minPoolSize: 2, // Min connections kept open
maxIdleTimeMS: 30000, // Close idle connections after 30s
});
```
**Pool sizing:**
- Web API: 10-20 connections
- Serverless: 1-2 connections + connection pooler (mongos)
- Background workers: 5-10 connections
## Aggregation Optimization
### Pipeline Order
```javascript
// ✓ Optimize: Filter early, limit early
db.orders.aggregate([
{ $match: { status: "completed" } }, // Reduce documents
{ $match: { amount: { $gte: 100 } } }, // Further reduce
{ $sort: { createdAt: -1 } },
{ $limit: 100 }, // Limit before expensive operations
{ $lookup: { from: "users" } }, // Expensive, but on 100 docs only
])
// ✗ Bad: Expensive operations on all docs
db.orders.aggregate([
{ $lookup: { from: "users" } }, // Join all orders
{ $match: { status: "completed" } }, // Filter after expensive join
{ $limit: 100 },
])
```
### $lookup Performance
```javascript
// Index foreign key
db.users.createIndex({ _id: 1 }) // Usually exists
db.orders.createIndex({ userId: 1 }) // Add if missing
// Limit lookup results
db.orders.aggregate([
{ $lookup: {
from: "orderItems",
localField: "_id",
foreignField: "orderId",
as: "items",
pipeline: [
{ $limit: 10 }, // Limit items per order
]
}},
])
```
## Write Performance
### Bulk Inserts
```javascript
// ✓ Batch inserts (10-100x faster)
db.events.insertMany(documents, { ordered: false })
// ✗ Individual inserts (slow)
for (const doc of documents) {
await db.events.insertOne(doc);
}
```
### Limit Index Count
**Each index slows writes:**
- 0 indexes: Fastest writes, slow reads
- 5 indexes: Balanced
- 10+ indexes: Slow writes, need justification
```javascript
// Check index impact
db.users.stats().indexSizes
// Drop unused indexes
db.users.dropIndex("unused_index_name")
```
## Monitoring Queries
### Enable Profiling
```javascript
// Profile slow queries (>100ms)
db.setProfilingLevel(1, { slowms: 100 })
// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10).pretty()
// Disable profiling
db.setProfilingLevel(0)
```
### Query Analysis
```javascript
// Find queries missing indexes
db.system.profile.find({
"planSummary": { $eq: "COLLSCAN" }
}).limit(10)
```
## Production Tuning
### WiredTiger Cache
```javascript
// Set in mongod.conf
storage:
wiredTiger:
engineConfig:
cacheSizeGB: 4 // 50-80% of available RAM
```
### Read Preference
```javascript
// Distribute reads to replicas
const client = new MongoClient(uri, {
readPreference: 'secondaryPreferred', // Read from secondary if available
});
```
### Write Concern
```javascript
// Balance durability vs performance
db.collection.insertOne(doc, {
writeConcern: {
w: "majority", // Wait for majority acknowledgment
j: true, // Wait for journal
wtimeout: 5000 // Timeout after 5s
}
})
```
## Best Practices
1. **Indexes for all queries** - Avoid collection scans
2. **Explain every query** - Verify index usage
3. **Connection pooling** - Reuse connections
4. **Batch operations** - Use bulkWrite, insertMany
5. **Monitor slow queries** - Set profiling level
6. **Limit document size** - <1MB for performance
7. **Project only needed** - Don't fetch unused fields
8. **Use covered queries** - Index-only access
9. **Pagination** - Use cursor-based, not offset
10. **Regular maintenance** - Monitor index usage, rebuild if needed
## Performance Checklist
- [ ] All queries have supporting indexes
- [ ] Index usage verified with explain()
- [ ] Slow query profiling enabled (>100ms)
- [ ] Connection pool sized appropriately
- [ ] Bulk operations used for batch inserts
- [ ] Documents <1MB (ideally <100KB)
- [ ] Projections used (not fetching all fields)
- [ ] Pagination implemented (cursor-based)
- [ ] Read preference configured for replicas
- [ ] WiredTiger cache sized correctly
## Resources
- MongoDB Performance: https://www.mongodb.com/docs/manual/administration/analyzing-mongodb-performance/
- Query Optimization: https://www.mongodb.com/docs/manual/core/query-optimization/
```
### references/common-patterns.md
```markdown
# Common MongoDB Patterns
Frequently-used patterns for pagination, soft deletes, audit logs, and data modeling.
## Table of Contents
- [Pagination Patterns](#pagination-patterns)
- [Cursor-Based (Recommended)](#cursor-based-recommended)
- [Offset-Based (Simple Cases)](#offset-based-simple-cases)
- [Range-Based (Time Series)](#range-based-time-series)
- [Soft Deletes](#soft-deletes)
- [Pattern 1: Boolean Flag](#pattern-1-boolean-flag)
- [Pattern 2: Status Field](#pattern-2-status-field)
- [Audit Logs](#audit-logs)
- [Pattern 1: Embedded History](#pattern-1-embedded-history)
- [Pattern 2: Separate Audit Collection](#pattern-2-separate-audit-collection)
- [Versioning Documents](#versioning-documents)
- [Pattern: Version Number + History](#pattern-version-number-history)
- [Counter Pattern (Atomic Increments)](#counter-pattern-atomic-increments)
- [Hierarchical Data](#hierarchical-data)
- [Pattern 1: Parent Reference](#pattern-1-parent-reference)
- [Pattern 2: Materialized Path](#pattern-2-materialized-path)
- [Upsert Pattern](#upsert-pattern)
- [Bulk Operations](#bulk-operations)
- [Caching Pattern](#caching-pattern)
- [Best Practices](#best-practices)
- [Resources](#resources)
## Pagination Patterns
### Cursor-Based (Recommended)
**Handles real-time changes, no skipped records:**
```javascript
// First page
db.posts.find({})
.sort({ _id: -1 })
.limit(20)
// Next page (using last _id as cursor)
db.posts.find({ _id: { $lt: lastSeenId } })
.sort({ _id: -1 })
.limit(20)
```
**API response:**
```javascript
{
"items": [...],
"nextCursor": "507f1f77bcf86cd799439011",
"hasMore": true
}
```
### Offset-Based (Simple Cases)
**Only for static datasets <10K records:**
```javascript
const page = 2;
const perPage = 20;
db.posts.find({})
.sort({ createdAt: -1 })
.skip(page * perPage)
.limit(perPage)
```
**Problem:** Performance degrades with large skip values.
### Range-Based (Time Series)
```javascript
// Page by date range
db.events.find({
createdAt: {
$gte: ISODate("2025-12-01"),
$lt: ISODate("2025-12-02")
}
})
```
## Soft Deletes
### Pattern 1: Boolean Flag
```javascript
// Schema
{
_id: ObjectId("..."),
email: "[email protected]",
deletedAt: null, // or ISODate("...")
isDeleted: false,
}
// Soft delete
db.users.updateOne(
{ _id: userId },
{ $set: { isDeleted: true, deletedAt: new Date() } }
)
// Query (exclude deleted)
db.users.find({ isDeleted: { $ne: true } })
// Create index for efficient querying
db.users.createIndex({ isDeleted: 1, createdAt: -1 })
```
### Pattern 2: Status Field
```javascript
// Schema with status
{
_id: ObjectId("..."),
status: "active", // active | archived | deleted
statusChangedAt: ISODate("..."),
}
// Archive (soft delete)
db.posts.updateOne(
{ _id: postId },
{ $set: { status: "archived", statusChangedAt: new Date() } }
)
// Query active only
db.posts.find({ status: "active" })
// Index
db.posts.createIndex({ status: 1, createdAt: -1 })
```
## Audit Logs
### Pattern 1: Embedded History
```javascript
{
_id: ObjectId("..."),
email: "[email protected]",
name: "John Doe",
history: [
{
action: "created",
timestamp: ISODate("2025-01-01T00:00:00Z"),
by: ObjectId("admin_id"),
},
{
action: "updated",
timestamp: ISODate("2025-02-01T00:00:00Z"),
by: ObjectId("user_id"),
changes: { name: { old: "John", new: "John Doe" } },
},
],
}
```
**Use when:** <100 updates per document
### Pattern 2: Separate Audit Collection
```javascript
// Main collection
db.users.updateOne({ _id: userId }, { $set: { name: "New Name" } })
// Audit log collection
db.audit_log.insertOne({
collection: "users",
documentId: userId,
action: "update",
changes: { name: { old: "John", new: "New Name" } },
userId: currentUserId,
timestamp: new Date(),
ip: "192.168.1.1",
})
```
**Use when:** Many updates, need queryable audit history
## Versioning Documents
### Pattern: Version Number + History
```javascript
{
_id: ObjectId("..."),
version: 3,
content: "Current content",
versions: [
{ version: 1, content: "Original", createdAt: ISODate("...") },
{ version: 2, content: "Updated", createdAt: ISODate("...") },
{ version: 3, content: "Current content", createdAt: ISODate("...") },
],
}
// Update with versioning
db.documents.updateOne(
{ _id: docId },
{
$inc: { version: 1 },
$set: { content: newContent },
$push: {
versions: {
version: currentVersion + 1,
content: newContent,
createdAt: new Date(),
}
}
}
)
```
## Counter Pattern (Atomic Increments)
```javascript
// Page view counter
db.posts.updateOne(
{ _id: postId },
{ $inc: { views: 1 } }
)
// Multiple counters
db.posts.updateOne(
{ _id: postId },
{
$inc: { views: 1, shares: 1 },
$set: { lastViewed: new Date() }
}
)
```
## Hierarchical Data
### Pattern 1: Parent Reference
```javascript
// Category tree
{
_id: ObjectId("..."),
name: "Electronics",
parentId: null, // Root category
}
{
_id: ObjectId("..."),
name: "Laptops",
parentId: ObjectId("electronics_id"), // Child
}
// Find all children
db.categories.find({ parentId: electronicsId })
// Find path to root
function getPath(categoryId) {
const path = [];
let current = db.categories.findOne({ _id: categoryId });
while (current) {
path.unshift(current.name);
current = current.parentId ?
db.categories.findOne({ _id: current.parentId }) :
null;
}
return path;
}
```
### Pattern 2: Materialized Path
```javascript
{
_id: ObjectId("..."),
name: "Laptops",
path: "Electronics,Computers,Laptops", // Full path
}
// Find all descendants
db.categories.find({ path: /^Electronics,Computers/ })
// Index for performance
db.categories.createIndex({ path: 1 })
```
## Upsert Pattern
```javascript
// Insert if not exists, update if exists
db.users.updateOne(
{ email: "[email protected]" },
{
$set: { name: "John", lastLogin: new Date() },
$setOnInsert: { createdAt: new Date() }, // Only on insert
},
{ upsert: true }
)
```
## Bulk Operations
```javascript
// Batch inserts (faster than individual)
db.users.insertMany([
{ email: "[email protected]", name: "User 1" },
{ email: "[email protected]", name: "User 2" },
// ... thousands more
], { ordered: false }) // Continue on error
// Bulk write operations
db.users.bulkWrite([
{
insertOne: {
document: { email: "[email protected]", name: "New User" }
}
},
{
updateOne: {
filter: { _id: ObjectId("...") },
update: { $set: { name: "Updated" } }
}
},
{
deleteOne: {
filter: { _id: ObjectId("...") }
}
}
])
```
## Caching Pattern
```javascript
// Cache frequently accessed data
db.users.aggregate([
{ $match: { _id: userId } },
{ $lookup: {
from: "posts",
localField: "_id",
foreignField: "userId",
as: "recentPosts",
pipeline: [
{ $sort: { createdAt: -1 } },
{ $limit: 5 }
]
}},
{ $project: {
email: 1,
name: 1,
postCount: { $size: "$recentPosts" },
recentPosts: { $slice: ["$recentPosts", 3] }
}},
{ $merge: {
into: "user_cache",
whenMatched: "replace",
whenNotMatched: "insert"
}}
])
```
## Best Practices
1. **Index all queries** - Every filter should have supporting index
2. **Compound index order** - Equality, Range, Sort
3. **Use partial indexes** - Index only needed subset
4. **Cursor pagination** - Avoid skip for large offsets
5. **Soft deletes** - Preserve data, add isDeleted flag
6. **Audit critical changes** - Log who/what/when
7. **Version important docs** - Maintain history
8. **Use upserts** - Idempotent operations
9. **Bulk operations** - Batch for performance
10. **Monitor slow queries** - Enable profiling
## Resources
- MongoDB Schema Design: https://www.mongodb.com/docs/manual/core/data-modeling-introduction/
- Index Best Practices: https://www.mongodb.com/docs/manual/applications/indexes/
```
### references/anti-patterns.md
```markdown
# MongoDB Anti-Patterns
Common mistakes and how to avoid them.
## Table of Contents
- [1. Unbounded Arrays](#1-unbounded-arrays)
- [2. Over-Embedding](#2-over-embedding)
- [3. Collection Scans](#3-collection-scans)
- [4. Large Documents (>1MB)](#4-large-documents-1mb)
- [5. Inefficient $lookup](#5-inefficient-lookup)
- [6. Massive Projections](#6-massive-projections)
- [7. Client-Side Joins](#7-client-side-joins)
- [8. Unbounded $group](#8-unbounded-group)
- [9. No Error Handling](#9-no-error-handling)
- [10. Wrong Data Types](#10-wrong-data-types)
- [Summary of Best Practices](#summary-of-best-practices)
- [Resources](#resources)
## 1. Unbounded Arrays
❌ **Problem:**
```javascript
{
_id: ObjectId("..."),
userId: 123,
events: [
{ type: "login", timestamp: "..." },
{ type: "click", timestamp: "..." },
// ... 10,000 more events (document grows forever)
]
}
```
✅ **Solution:** Use separate collection with reference
```javascript
// User document (bounded)
{ _id: ObjectId("..."), userId: 123, email: "..." }
// Events collection
{ _id: ObjectId("..."), userId: 123, type: "login", timestamp: "..." }
// Query recent events
db.events.find({ userId: 123 }).sort({ timestamp: -1 }).limit(100)
```
**Rule:** Arrays with potential for >100 elements should be separate collections.
## 2. Over-Embedding
❌ **Problem:**
```javascript
{
_id: ObjectId("..."),
title: "Blog Post",
author: {
_id: ObjectId("..."),
name: "John",
email: "[email protected]",
bio: "Long biography...",
socialLinks: [...], // Embedded author data duplicated in every post
},
comments: [
{
author: { /* full author embedded again */ },
replies: [
{ author: { /* embedded again */ } } // Deeply nested
]
}
]
}
```
✅ **Solution:** Reference pattern
```javascript
// Post (minimal author info)
{
_id: ObjectId("..."),
title: "Blog Post",
authorId: ObjectId("author_id"), // Reference only
authorName: "John", // Denormalize only display name
}
// Fetch author details when needed
const post = db.posts.findOne({ _id: postId })
const author = db.users.findOne({ _id: post.authorId })
```
## 3. Collection Scans
❌ **Problem:**
```javascript
// No index on status
db.orders.find({ status: "pending" }) // COLLSCAN on 1M documents
```
✅ **Solution:** Create index
```javascript
db.orders.createIndex({ status: 1, createdAt: -1 })
db.orders.find({ status: "pending" }) // IXSCAN
```
## 4. Large Documents (>1MB)
❌ **Problem:**
```javascript
{
_id: ObjectId("..."),
productId: 123,
largeImage: "<base64 encoded 5MB image>", // 16MB doc limit approaching
}
```
✅ **Solution:** Use GridFS or external storage
```javascript
// Reference S3/GridFS
{
_id: ObjectId("..."),
productId: 123,
imageUrl: "https://cdn.example.com/products/123.jpg",
thumbnailUrl: "https://cdn.example.com/products/123_thumb.jpg",
}
```
## 5. Inefficient $lookup
❌ **Problem:**
```javascript
// $lookup without index on foreign key
db.orders.aggregate([
{ $lookup: {
from: "users", // No index on users._id
localField: "userId",
foreignField: "_id",
as: "user"
}}
])
```
✅ **Solution:** Index foreign keys
```javascript
// Create index on lookup field
db.users.createIndex({ _id: 1 }) // Usually exists by default
// Or denormalize frequently accessed fields
{
_id: ObjectId("..."),
userId: 123,
userName: "John", // Denormalized for display
userEmail: "[email protected]",
}
```
## 6. Massive Projections
❌ **Problem:**
```javascript
// Selecting all fields when only need few
db.users.find({}, { password: 0 }) // Returns everything except password
```
✅ **Solution:** Explicit projection
```javascript
// Select only needed fields
db.users.find({}, { email: 1, name: 1, _id: 0 })
// Covering index (no document fetch)
db.users.createIndex({ email: 1, name: 1 })
db.users.find({}, { email: 1, name: 1, _id: 0 })
```
## 7. Client-Side Joins
❌ **Problem:**
```javascript
// N+1 queries
const posts = await db.posts.find({}).toArray();
for (const post of posts) {
post.author = await db.users.findOne({ _id: post.authorId }); // N queries!
}
```
✅ **Solution:** Aggregation or denormalization
```javascript
// Aggregation (server-side join)
db.posts.aggregate([
{ $lookup: {
from: "users",
localField: "authorId",
foreignField: "_id",
as: "author"
}},
{ $unwind: "$author" },
])
// Or denormalize
{
_id: ObjectId("..."),
authorId: ObjectId("..."),
authorName: "John", // Cached for display
}
```
## 8. Unbounded $group
❌ **Problem:**
```javascript
// Group by high-cardinality field
db.events.aggregate([
{ $group: {
_id: "$userId", // Millions of unique users
events: { $push: "$$ROOT" } // Massive memory usage
}}
])
```
✅ **Solution:** Add $match and $limit
```javascript
db.events.aggregate([
{ $match: { timestamp: { $gte: recentDate } } }, // Filter first
{ $group: { _id: "$userId", count: { $sum: 1 } } }, // Don't $push all docs
{ $limit: 1000 }, // Limit results
])
```
## 9. No Error Handling
❌ **Problem:**
```javascript
const user = await db.users.insertOne({ email: "[email protected]" });
// Throws on duplicate email if unique index exists
```
✅ **Solution:** Handle errors
```javascript
try {
const user = await db.users.insertOne({ email: "[email protected]" });
} catch (error) {
if (error.code === 11000) { // Duplicate key error
throw new Error("Email already exists");
}
throw error;
}
```
## 10. Wrong Data Types
❌ **Problem:**
```javascript
{
createdAt: "2025-12-03T10:00:00Z", // String, not Date
price: "49.99", // String, not Number
isActive: "true", // String, not Boolean
}
```
✅ **Solution:** Use proper types
```javascript
{
createdAt: ISODate("2025-12-03T10:00:00Z"), // Date object
price: 49.99, // Number
isActive: true, // Boolean
}
```
## Summary of Best Practices
✅ **DO:**
- Reference for one-to-many (>100 related docs)
- Index all query filters
- Use cursor-based pagination
- Implement soft deletes for important data
- Denormalize display fields
- Use proper data types
- Handle errors explicitly
- Limit array sizes (<100 elements)
- Use aggregation for complex queries
- Monitor slow queries
❌ **DON'T:**
- Embed unbounded arrays
- Do client-side joins (N+1 queries)
- Skip indexing query fields
- Store large binary data in documents
- Use offset pagination for large datasets
- Store strings when you need dates/numbers
- Ignore 16MB document limit
- Over-normalize (too many $lookups)
- Create 10+ indexes per collection
- Use $regex without index
## Resources
- MongoDB Anti-Patterns: https://www.mongodb.com/developer/products/mongodb/schema-design-anti-pattern-summary/
```