deploying-on-aws
Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install ancoleman-ai-design-components-deploying-on-aws
Repository
Skill path: skills/deploying-on-aws
Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.
Open repositoryBest for
Primary workflow: Run DevOps.
Technical facets: Full Stack, Backend, DevOps.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: ancoleman.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install deploying-on-aws into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/ancoleman/ai-design-components before adding deploying-on-aws to shared team environments
- Use deploying-on-aws for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: deploying-on-aws
description: Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.
---
# AWS Patterns
## Purpose
This skill provides decision frameworks and implementation patterns for Amazon Web Services. Navigate AWS's 200+ services through proven selection criteria, architectural patterns, and Well-Architected Framework principles. Focus on practical service selection, cost-aware design, and modern 2025 patterns including Lambda SnapStart, EventBridge Pipes, and S3 Express One Zone.
Use this skill when designing AWS solutions, selecting services for specific workloads, implementing serverless or container architectures, or optimizing existing AWS infrastructure for cost, performance, and reliability.
## When to Use This Skill
Invoke this skill when:
- Choosing between Lambda, Fargate, ECS, EKS, or EC2 for compute workloads
- Selecting database services (RDS, Aurora, DynamoDB) based on access patterns
- Designing VPC architecture for multi-tier applications
- Implementing serverless patterns with API Gateway and Lambda
- Building container-based microservices on ECS or EKS
- Applying AWS Well-Architected Framework to designs
- Optimizing AWS costs while maintaining performance
- Implementing security best practices (IAM, KMS, encryption)
## Core Service Selection Frameworks
### Compute Service Selection
**Decision Flow:**
```
Execution Duration:
<15 minutes → Evaluate Lambda
>15 minutes → Evaluate containers or VMs
Event-Driven/Scheduled:
YES → Lambda (serverless)
NO → Consider traffic patterns
Containerized:
YES → Need Kubernetes?
YES → EKS
NO → ECS (Fargate or EC2)
NO → Evaluate EC2 or containerize first
Special Requirements:
GPU/Windows/BYOL licensing → EC2
Predictable high traffic → EC2 or ECS on EC2 (cost optimization)
Variable traffic → Lambda or Fargate
```
**Quick Reference:**
| Workload | Primary Choice | Cost Model | Key Benefit |
|----------|---------------|------------|-------------|
| API Backend | Lambda + API Gateway | Pay per request | Auto-scale, no servers |
| Microservices | ECS on Fargate | Pay for runtime | Simple operations |
| Kubernetes Apps | EKS | $73/mo + compute | Portability, ecosystem |
| Batch Jobs | Lambda or Fargate Spot | Request/spot pricing | Cost efficiency |
| Long-Running | EC2 Reserved Instances | 30-60% savings | Predictable cost |
For detailed service comparisons including cost examples, performance characteristics, and use case guidance, see `references/compute-services.md`.
### Database Service Selection
**Decision Matrix by Access Pattern:**
| Access Pattern | Data Model | Primary Choice | Key Criteria |
|----------------|------------|----------------|--------------|
| Transactional (OLTP) | Relational | Aurora | Performance + HA |
| Simple CRUD | Relational | RDS PostgreSQL | Cost vs. features |
| Key-Value Lookups | NoSQL | DynamoDB | Serverless scale |
| Document Storage | JSON/BSON | DynamoDB | Flexibility vs. MongoDB compat |
| Caching | In-Memory | ElastiCache Redis | Speed + durability |
| Analytics (OLAP) | Columnar | Redshift/Athena | Dedicated vs. serverless |
| Time-Series | Timestamped | Timestream | Purpose-built |
**Query Complexity Guide:**
- **Simple Key-Value:** DynamoDB (single-digit ms latency)
- **Moderate Joins (2-3 tables):** Aurora or RDS (cost vs. performance)
- **Complex Analytics:** Redshift (dedicated) or Athena (serverless, query S3)
- **Real-Time Streams:** DynamoDB Streams + Lambda
For storage class selection, cost comparisons, and migration patterns, see `references/database-services.md`.
### Storage Service Selection
**Primary Decision Tree:**
```
Data Type:
Objects (files, media) → S3 + lifecycle policies
Blocks (databases, boot volumes) → EBS
Shared Files (cross-instance) → Evaluate protocol
File Protocol Required:
NFS (Linux) → EFS
SMB (Windows) → FSx for Windows
High-Performance HPC → FSx for Lustre
Multi-Protocol + Enterprise → FSx for NetApp ONTAP
```
**Cost Comparison (1TB/month):**
| Service | Monthly Cost | Access Pattern |
|---------|--------------|----------------|
| S3 Standard | $23 | Frequent access |
| S3 Standard-IA | $12.50 | Infrequent (>30 days) |
| S3 Glacier Instant | $4 | Archive, instant retrieval |
| EBS gp3 | $80 | Block storage |
| EFS Standard | $300 | Shared files, frequent |
| EFS IA | $25 | Shared files, infrequent |
**Recommendation:** Use S3 for 80%+ of storage needs. Use EFS/FSx only when shared file access is required.
For S3 storage classes, EBS volume types, and lifecycle policy examples, see `references/storage-services.md`.
## Serverless Architecture Patterns
### Pattern 1: REST API (Lambda + API Gateway + DynamoDB)
**Architecture:**
```
Client → API Gateway (HTTP API) → Lambda → DynamoDB
↓
S3 (file uploads)
```
**Use When:**
- Building RESTful APIs with CRUD operations
- Variable or unpredictable traffic
- Minimal operational overhead desired
- Pay-per-request cost model acceptable
**Cost Estimate (1M requests/month):**
- API Gateway: $3.50
- Lambda: $3.53
- DynamoDB: ~$7.50
- **Total: ~$15/month** (vs. Fargate ~$35+, EC2 ~$50+)
**Key Components:**
- API Gateway HTTP API (cheaper than REST API)
- Lambda with appropriate memory allocation (1024MB typically optimal)
- DynamoDB on-demand billing (for variable traffic)
- CloudWatch Logs for debugging
See `examples/cdk/serverless-api/` and `examples/terraform/serverless-api/` for complete implementations.
### Pattern 2: Event-Driven Processing (EventBridge + Lambda + SQS)
**Architecture:**
```
S3 Upload → EventBridge Rule → Lambda (process) → DynamoDB (metadata)
↓
SQS (downstream tasks)
```
**Use When:**
- Asynchronous file processing
- Decoupled microservices communication
- Fan-out patterns (one event, multiple consumers)
- Need retry logic and dead-letter queues
**Key Features (2025):**
- **EventBridge Pipes:** Simplified source → filter → enrichment → target
- **Lambda Response Streaming:** Stream responses up to 20MB
- **Step Functions Distributed Map:** Process millions of items in parallel
See `references/serverless-patterns.md` for additional patterns including Step Functions orchestration, API Gateway WebSockets, and Lambda SnapStart configuration.
## Container Architecture Patterns
### Pattern 1: ECS on Fargate (Serverless Containers)
**Architecture:**
```
ALB → ECS Service (Fargate tasks) → RDS Aurora
↓
ElastiCache Redis
```
**Use When:**
- Containerized applications without cluster management
- Variable traffic with auto-scaling
- Avoid EC2 instance management
- Docker-based deployment
**Key Components:**
- Application Load Balancer (path-based routing)
- ECS Cluster with Fargate launch type
- Task definitions (CPU, memory, container image)
- Auto-scaling based on CPU/memory or custom metrics
- Service Connect for built-in service mesh (2025 feature)
**Cost Model (2 vCPU, 4GB RAM, 24/7):**
- Fargate: ~$70/month
- ALB: ~$20/month
- RDS Aurora db.t3.medium: ~$50/month
- **Total: ~$140/month**
### Pattern 2: EKS (Kubernetes on AWS)
**Use When:**
- Kubernetes expertise exists in team
- Multi-cloud or hybrid cloud strategy
- Need Kubernetes ecosystem (Helm, Operators, Istio)
- Complex workload orchestration requirements
**Key Features (2025):**
- **EKS Auto Mode:** Fully managed node lifecycle
- **EKS Pod Identities:** Simplified IAM (replaces IRSA)
- **EKS Hybrid Nodes:** Run on-premises nodes
**Cost Considerations:**
- EKS control plane: $73/month per cluster
- Worker nodes: Fargate or EC2 pricing
- Use EKS on Fargate for simplicity, EC2 for cost optimization
For ECS task definitions, EKS cluster setup with CDK/Terraform, and service mesh patterns, see `references/container-patterns.md`.
## Networking Essentials
### VPC Architecture
**Standard 3-Tier Pattern:**
```
VPC: 10.0.0.0/16
Per Availability Zone (deploy across 3 AZs):
Public Subnet: 10.0.X.0/24 (ALB, NAT Gateway)
Private Subnet: 10.0.1X.0/24 (ECS, Lambda, app tier)
Database Subnet: 10.0.2X.0/24 (RDS, Aurora, isolated)
```
**Best Practices:**
- Use /16 for VPC CIDR (65,536 IPs for growth)
- Use /24 for subnet CIDRs (256 IPs, 251 usable)
- Deploy across minimum 2 AZs (3 recommended) for high availability
- Use Security Groups (stateful) for instance-level firewall
- Enable VPC Flow Logs for troubleshooting
### Load Balancing
**Service Selection:**
| Load Balancer | Protocol | Use Case | Key Feature |
|---------------|----------|----------|-------------|
| ALB | HTTP/HTTPS | Web apps, APIs | Path/host routing, Lambda targets |
| NLB | TCP/UDP | High performance | Static IP, ultra-low latency |
| GWLB | Layer 3 | Security appliances | Inline inspection |
**ALB Features:**
- Path-based routing: `/api` → backend, `/web` → frontend
- Host-based routing: `api.example.com`, `web.example.com`
- WebSocket and gRPC support
- Integration with Lambda (serverless backends)
For CloudFront CDN patterns, Route 53 routing policies, and VPC peering configurations, see `references/networking.md`.
## Security Best Practices
### IAM Principles
**Least Privilege Pattern:**
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/uploads/*"
}]
}
```
**Core Practices:**
- Use IAM roles (not users) for applications
- Implement least privilege (grant minimum permissions needed)
- Enable MFA for privileged users
- Use IAM Access Analyzer to validate policies
- Leverage AWS Organizations SCPs for guardrails
### Data Protection
**Encryption Requirements:**
| Service | At-Rest Encryption | In-Transit Encryption |
|---------|-------------------|----------------------|
| S3 | SSE-S3 or SSE-KMS | HTTPS (TLS 1.2+) |
| EBS | KMS encryption | N/A (within instance) |
| RDS/Aurora | KMS encryption | TLS connections |
| DynamoDB | KMS encryption | HTTPS API |
**Secrets Management:**
- **Secrets Manager:** Database credentials with automatic rotation
- **Parameter Store:** Application configuration (free tier available)
- **KMS:** Encryption key management (customer-managed keys)
For WAF rules, GuardDuty configuration, and network security patterns, see `references/security.md`.
## AWS Well-Architected Framework
### Six Pillars Overview
**1. Operational Excellence**
- Infrastructure as code (CDK, Terraform, CloudFormation)
- Automated deployments (CI/CD pipelines)
- Observability (CloudWatch Logs, Metrics, X-Ray)
- Runbooks and playbooks for common operations
**2. Security**
- Strong identity foundation (IAM roles and policies)
- Defense in depth (Security Groups, NACLs, WAF)
- Data protection (encryption at rest and in transit)
- Detective controls (CloudTrail, GuardDuty, Security Hub)
**3. Reliability**
- Multi-AZ deployments (RDS Multi-AZ, Aurora replicas)
- Auto-scaling (EC2 ASG, ECS Service Auto Scaling)
- Backup and recovery (automated snapshots, cross-region)
- Chaos engineering (Fault Injection Simulator)
**4. Performance Efficiency**
- Right-size resources (use Compute Optimizer)
- Use managed services (reduce operational overhead)
- Caching strategies (CloudFront, ElastiCache, DAX)
- Monitor and optimize continuously
**5. Cost Optimization**
- Right-sizing compute (match capacity to demand)
- Pricing models (Reserved Instances, Savings Plans, Spot)
- Storage optimization (S3 Intelligent-Tiering, lifecycle policies)
- Cost monitoring (Cost Explorer, Budgets, Trusted Advisor)
**6. Sustainability (Added 2024)**
- Use Graviton processors (60% less energy, 25% better performance)
- Optimize workload placement (renewable energy regions)
- Storage efficiency (delete unused data, compression)
- Software optimization (efficient code, async processing)
For detailed pillar implementation guides, architectural review checklists, and Well-Architected Tool integration, see `references/well-architected.md`.
## Infrastructure as Code
### Tool Selection
**AWS CDK (Cloud Development Kit):**
- **Languages:** TypeScript, Python, Java, C#, Go
- **Best For:** AWS-native workloads, type-safe infrastructure
- **Key Benefit:** High-level constructs, synthesizes to CloudFormation
- **Example:** `examples/cdk/serverless-api/`
**Terraform:**
- **Language:** HCL (HashiCorp Configuration Language)
- **Best For:** Multi-cloud environments
- **Key Benefit:** Largest ecosystem, mature state management
- **Example:** `examples/terraform/serverless-api/`
**CloudFormation:**
- **Language:** YAML or JSON
- **Best For:** Native AWS integration, no additional tools
- **Key Benefit:** AWS service support on day 1
- **Example:** `examples/cloudformation/lambda-api.yaml`
### CDK Quick Start
```bash
# Install CDK CLI
npm install -g aws-cdk
# Initialize new project
cdk init app --language=typescript
npm install
# Deploy infrastructure
cdk bootstrap # One-time setup
cdk deploy
```
### Terraform Quick Start
```bash
# Install Terraform
brew install terraform # macOS
# Initialize project
terraform init
# Preview changes
terraform plan
# Apply changes
terraform apply
```
For complete working examples with VPC networking, multi-tier applications, and event-driven architectures, see the `examples/` directory.
## Cost Optimization Strategies
### Compute Cost Optimization
**Right-Sizing:**
- Use AWS Compute Optimizer for EC2/Lambda recommendations
- Monitor CloudWatch metrics (CPU, memory utilization)
- Start conservatively, scale based on actual usage
**Pricing Models:**
| Model | Commitment | Savings | Best For |
|-------|------------|---------|----------|
| On-Demand | None | 0% | Variable workloads |
| Savings Plans | 1-3 years | 30-40% | Flexible compute |
| Reserved Instances | 1-3 years | 30-60% | Predictable workloads |
| Spot Instances | None | 60-90% | Fault-tolerant tasks |
**Graviton Advantage:**
- Graviton3 instances: 25% better performance vs. Graviton2
- 60% less energy consumption
- Available: EC2, Lambda, Fargate, RDS, ElastiCache
### Storage Cost Optimization
**S3 Lifecycle Policies:**
```
Day 0-30: S3 Standard ($0.023/GB)
Day 30-90: S3 Standard-IA ($0.0125/GB)
Day 90-365: S3 Glacier Instant ($0.004/GB)
Day 365+: S3 Deep Archive ($0.00099/GB)
```
**EBS Optimization:**
- Use gp3 volumes (20% cheaper than gp2, configurable IOPS)
- Delete unused snapshots
- Archive old snapshots (75% cheaper)
**Monitoring:**
- Enable AWS Cost Explorer (free)
- Set up AWS Budgets with alerts
- Use Cost Allocation Tags for attribution
- Review Trusted Advisor cost checks
## Common Patterns and Examples
### Serverless Three-Tier Application
```
CloudFront (CDN)
→ S3 (React frontend)
→ API Gateway (REST API)
→ Lambda (business logic)
→ DynamoDB (data)
→ S3 (file storage)
```
**Complete CDK implementation:** `examples/cdk/three-tier-app/`
**Complete Terraform implementation:** `examples/terraform/three-tier-app/`
### Containerized Microservices
```
Route 53 (DNS)
→ CloudFront (CDN)
→ ALB (load balancer)
→ ECS Fargate (services)
→ RDS Aurora (database)
→ ElastiCache Redis (cache)
```
**Complete implementation:** `examples/cdk/ecs-fargate/`
### Event-Driven Data Pipeline
```
S3 Upload
→ EventBridge Rule
→ Lambda (transform)
→ Kinesis Firehose
→ S3 Data Lake
→ Athena (query)
```
**Complete implementation:** `examples/cdk/event-driven/`
## Integration with Other Skills
### Related Skills
- **infrastructure-as-code** - Multi-cloud IaC concepts, CDK and Terraform patterns
- **kubernetes-operations** - EKS cluster operations, kubectl, Helm charts
- **building-ci-pipelines** - CodePipeline, CodeBuild, GitHub Actions → AWS
- **secret-management** - Secrets Manager rotation, Parameter Store hierarchies
- **observability** - CloudWatch advanced queries, X-Ray distributed tracing
- **security-hardening** - IAM policy best practices, security automation
- **disaster-recovery** - Multi-region strategies, backup automation
### Cross-Skill Patterns
**EKS + kubernetes-operations:**
- Use this skill for EKS cluster provisioning (CDK/Terraform)
- Use kubernetes-operations for kubectl, Helm, application deployment
**Secrets Management:**
- Use this skill for Secrets Manager/Parameter Store setup
- Use secret-management skill for rotation policies, access patterns
**CI/CD Integration:**
- Use this skill for CodePipeline infrastructure
- Use building-ci-pipelines skill for pipeline configuration
## Reference Documentation
### Detailed Guides
- **Compute Services:** `references/compute-services.md` - Lambda, Fargate, ECS, EKS, EC2 deep dive
- **Database Services:** `references/database-services.md` - RDS, Aurora, DynamoDB, ElastiCache comparison
- **Storage Services:** `references/storage-services.md` - S3 classes, EBS types, EFS/FSx selection
- **Networking:** `references/networking.md` - VPC design, load balancing, CloudFront, Route 53
- **Security:** `references/security.md` - IAM patterns, KMS, Secrets Manager, WAF
- **Serverless Patterns:** `references/serverless-patterns.md` - Advanced Lambda, Step Functions, EventBridge
- **Container Patterns:** `references/container-patterns.md` - ECS Service Connect, EKS Pod Identities
- **Well-Architected:** `references/well-architected.md` - Six pillars implementation guide
### Working Examples
- **CDK Examples:** `examples/cdk/` - TypeScript implementations
- **Terraform Examples:** `examples/terraform/` - HCL implementations
- **CloudFormation Examples:** `examples/cloudformation/` - YAML templates
### Utility Scripts
- **Cost Estimation:** `scripts/cost-estimate.sh` - Estimate infrastructure costs
- **Resource Audit:** `scripts/resource-audit.sh` - Audit AWS resources
- **Security Check:** `scripts/security-check.sh` - Basic security validation
## AWS Service Updates (2025)
**Recent Innovations to Consider:**
- **Lambda SnapStart:** Near-instant cold starts for Java functions
- **Lambda Response Streaming:** Stream responses up to 20MB
- **EventBridge Pipes:** Simplified event processing (source → filter → enrichment → target)
- **S3 Express One Zone:** 10x faster S3, single-digit millisecond latency
- **ECS Service Connect:** Built-in service mesh for ECS
- **EKS Auto Mode:** Fully managed Kubernetes node lifecycle
- **EKS Pod Identities:** Simplified IAM for pods (replaces IRSA)
- **Aurora Limitless Database:** Horizontal write scaling beyond single-writer limit
- **DynamoDB Standard-IA:** Infrequent access tables at 60% cost savings
- **RDS Blue/Green Deployments:** Zero-downtime version upgrades
---
## Quick Decision Checklist
**Before choosing a service, answer:**
1. **Traffic Pattern:** Predictable or variable? (affects compute choice)
2. **Data Model:** Relational, key-value, document, or graph? (affects database choice)
3. **Access Pattern:** Frequent, infrequent, or archive? (affects storage class)
4. **Latency Requirements:** Milliseconds, seconds, or minutes acceptable?
5. **Scaling Needs:** Vertical (bigger instances) or horizontal (more instances)?
6. **Operational Overhead:** Prefer managed services or need control?
7. **Cost Sensitivity:** Optimize for cost, performance, or balance?
8. **Compliance Requirements:** Data residency, encryption, audit logging needed?
**Then consult the relevant decision framework in this skill or detailed references.**
## Getting Started
**For New AWS Projects:**
1. Define architecture using Well-Architected Framework pillars
2. Choose compute service using decision tree (Lambda/Fargate/ECS/EKS/EC2)
3. Select database based on access patterns and data model
4. Design VPC with 3-tier subnet architecture
5. Implement IaC using CDK or Terraform (see examples/)
6. Apply security best practices (IAM, encryption, logging)
7. Set up monitoring and cost tracking
**For Existing AWS Projects:**
1. Run AWS Trusted Advisor for recommendations
2. Review Well-Architected Framework pillars
3. Optimize costs (right-size, Reserved Instances, storage lifecycle)
4. Migrate to modern services (EC2 → Fargate, RDS → Aurora)
5. Improve security posture (enable GuardDuty, implement least privilege)
6. Automate with IaC (reverse-engineer to Terraform or CDK)
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/compute-services.md
```markdown
# AWS Compute Services - Deep Dive
## Table of Contents
1. [Lambda (Serverless Functions)](#lambda-serverless-functions)
2. [Fargate (Serverless Containers)](#fargate-serverless-containers)
3. [ECS (Elastic Container Service)](#ecs-elastic-container-service)
4. [EKS (Elastic Kubernetes Service)](#eks-elastic-kubernetes-service)
5. [EC2 (Virtual Machines)](#ec2-virtual-machines)
6. [Service Comparison Matrix](#service-comparison-matrix)
7. [Migration Paths](#migration-paths)
---
## Lambda (Serverless Functions)
### Overview
AWS Lambda runs code without provisioning servers. Pay only for compute time consumed. Supports multiple languages and automatic scaling.
### Key Specifications (2025)
- **Execution Time:** 1ms to 15 minutes maximum
- **Memory:** 128MB to 10,240MB (increments of 1MB)
- **Storage:** 512MB to 10GB ephemeral storage (/tmp)
- **Deployment Package:** 50MB zipped, 250MB unzipped
- **Concurrent Executions:** 1,000 default (can increase via quota)
- **Supported Runtimes:** Node.js, Python, Java, Go, .NET, Ruby, Custom (containers)
### Performance Features (2025)
**Lambda SnapStart (Java):**
- Near-instant cold starts for Java functions
- Caches initialized execution environment
- 10x faster startup vs. traditional Java
**Lambda Response Streaming:**
- Stream responses up to 20MB
- Progressive results for large payloads
- Ideal for generative AI, video processing
**Provisioned Concurrency:**
- Pre-initialized execution environments
- Sub-10ms cold starts
- Predictable performance for latency-sensitive apps
### Cost Model
**Request Pricing:**
- Free tier: 1M requests/month (perpetual)
- $0.20 per 1M requests thereafter
**Compute Pricing (us-east-1):**
- $0.0000166667 per GB-second
- Free tier: 400,000 GB-seconds/month
**Example Calculations:**
```
Scenario: API with 5M requests/month, 512MB memory, 200ms avg execution
Requests: (5M - 1M free) × $0.20/1M = $0.80
Compute: 5M × 0.2s × 0.5GB × $0.0000166667 = $8.33
Total: $9.13/month
```
### Use Cases
**Ideal:**
- API backends (via API Gateway)
- File processing (S3 triggers)
- Scheduled jobs (EventBridge cron)
- Stream processing (Kinesis, DynamoDB Streams)
- WebHooks and event handlers
**Avoid:**
- Long-running tasks (>15 minutes)
- Stateful applications
- Predictable high throughput (EC2 cheaper at scale)
- Large deployment packages (>250MB)
### Best Practices
1. **Optimize Memory Allocation:**
- CPU scales with memory (1,769MB = 1 vCPU)
- Test different memory sizes (more memory = faster execution = lower cost)
- Use AWS Lambda Power Tuning tool
2. **Reduce Cold Starts:**
- Minimize dependencies
- Use SnapStart for Java
- Provision concurrency for critical functions
- Keep functions warm with scheduled pings (if cost-effective)
3. **Environment Variables:**
- Use for configuration (no code changes)
- Encrypt sensitive values with KMS
- Consider Parameter Store or Secrets Manager for secrets
4. **Observability:**
- Enable X-Ray tracing
- Use structured logging (JSON)
- Create CloudWatch dashboards
- Set up alarms for errors and throttling
---
## Fargate (Serverless Containers)
### Overview
AWS Fargate runs containers without managing servers. Pay for vCPU and memory used. Works with ECS and EKS.
### Key Specifications
**CPU Options:**
- 0.25 vCPU to 16 vCPU
- Must match valid CPU/memory combinations
**Memory Options:**
- 0.5GB to 120GB
- Scales with CPU selection
**Platform Versions:**
- **1.4.0:** Current default, supports EFS, container insights
- **1.3.0:** Legacy, missing some features
### Cost Model (Linux, us-east-1)
**Per vCPU-hour:** $0.04048
**Per GB-hour:** $0.004445
**Example Configurations:**
| vCPU | Memory | Hourly | Monthly (24/7) |
|------|--------|--------|----------------|
| 0.25 | 0.5GB | $0.01 | $7.50 |
| 0.5 | 1GB | $0.02 | $15.00 |
| 1 | 2GB | $0.05 | $35.00 |
| 2 | 4GB | $0.10 | $70.00 |
| 4 | 8GB | $0.20 | $140.00 |
**Fargate Spot:**
- 70% discount vs. on-demand Fargate
- Can be interrupted with 2-minute notice
- Ideal for fault-tolerant batch jobs
### Use Cases
**Ideal:**
- Containerized microservices
- Batch processing (with Fargate Spot)
- CI/CD build agents
- Variable traffic applications
- Multi-hour running tasks
**Avoid:**
- Extremely cost-sensitive at high scale (EC2 cheaper)
- GPU workloads (use EC2)
- Stateful apps requiring persistent local storage
- High-performance computing
### Best Practices
1. **Task Sizing:**
- Start small, monitor CloudWatch Container Insights
- Scale up based on actual utilization
- Use Application Auto Scaling
2. **Networking:**
- Use awsvpc network mode (required)
- Each task gets ENI with private IP
- Use Security Groups for network isolation
3. **Storage:**
- Ephemeral storage: 20GB default (can increase to 200GB)
- Persistent storage: Mount EFS volumes
- Logs: Send to CloudWatch Logs
---
## ECS (Elastic Container Service)
### Overview
AWS-native container orchestration. Simpler than Kubernetes. Tight integration with AWS services.
### Launch Types
**Fargate:**
- Serverless, no EC2 management
- Pay per task
**EC2:**
- Manage EC2 instances
- Lower cost at scale
- More control
**External:**
- Run on on-premises servers
- ECS Anywhere
### Key Features (2025)
**ECS Service Connect:**
- Built-in service mesh
- Service discovery without custom code
- Load balancing and circuit breaking
**ECS Exec:**
- Interactive shell access to containers
- Debugging without SSH
**Capacity Providers:**
- Auto-scale between Fargate and EC2
- Mix spot and on-demand instances
### Cost Model
**No ECS Control Plane Fees:**
- Only pay for underlying compute (Fargate or EC2)
**Example (10 services, t3.medium EC2):**
- EC2: 10 × $30/month = $300
- ECS: $0
- **Total: $300/month**
### Use Cases
**Ideal:**
- Docker-based applications
- AWS-native deployments
- Simpler than Kubernetes requirements
- Tight ALB/CloudWatch/IAM integration
**Avoid:**
- Multi-cloud portability needed (use EKS)
- Team has Kubernetes expertise
- Need Kubernetes ecosystem (Helm, Operators)
---
## EKS (Elastic Kubernetes Service)
### Overview
Managed Kubernetes control plane. Full Kubernetes compatibility. Multi-cloud/hybrid portability.
### Key Specifications
**Control Plane:**
- Highly available across 3 AZs
- Automatic version upgrades
- Integrated with AWS IAM
**Supported Versions:**
- Kubernetes 1.25 to 1.28 (as of 2025)
- Automatic minor version upgrades available
### Key Features (2025)
**EKS Auto Mode:**
- Fully managed node lifecycle
- Automatic capacity provisioning
- No manual node group management
**EKS Pod Identities:**
- Simplified IAM for pods
- Replaces IRSA (IAM Roles for Service Accounts)
- Easier setup and debugging
**EKS Hybrid Nodes:**
- Run Kubernetes nodes on-premises
- Consistent management plane
### Cost Model
**Control Plane:** $0.10/hour = $73/month per cluster
**Worker Nodes:**
- Fargate: Per-task pricing
- EC2: Instance pricing
- On-Demand, Reserved, or Spot
**Example (3 m5.large nodes on-demand):**
- Control plane: $73/month
- Nodes: 3 × $70 = $210/month
- **Total: $283/month**
### Use Cases
**Ideal:**
- Kubernetes expertise exists
- Multi-cloud/hybrid strategy
- Complex orchestration needs
- Kubernetes ecosystem required (Helm, Operators, Istio)
**Avoid:**
- Team lacks Kubernetes knowledge
- Simple workloads (over-engineering)
- Cost-sensitive (ECS cheaper)
### Best Practices
1. **Node Groups:**
- Use managed node groups
- Mix on-demand (baseline) + spot (burst)
- Use Auto Mode for simplicity
2. **Networking:**
- Use AWS VPC CNI plugin
- Enable Pod Security Groups
- Use Network Policies for isolation
3. **Storage:**
- Use EBS CSI Driver for persistent volumes
- Use EFS CSI Driver for shared storage
- Implement StorageClasses for automation
---
## EC2 (Virtual Machines)
### Overview
Virtual servers in the cloud. Full OS control. Widest instance type selection.
### Instance Families
**General Purpose (T, M):**
- Balanced CPU, memory, network
- t3: Burstable, cost-effective
- m5: Consistent performance
**Compute Optimized (C):**
- High CPU-to-memory ratio
- Batch processing, HPC, gaming
**Memory Optimized (R, X):**
- High memory-to-CPU ratio
- Databases, caches, in-memory analytics
**Storage Optimized (I, D):**
- High IOPS, throughput
- NoSQL databases, data warehousing
**Accelerated Computing (P, G, Inf):**
- GPU, FPGA, inference
- ML training, rendering, genomics
### Pricing Models
**On-Demand:**
- Pay by the second
- No commitment
- Highest cost
**Reserved Instances:**
- 1-year or 3-year commitment
- 30-60% savings
- Predictable workloads
**Savings Plans:**
- 1-year or 3-year commitment
- Flexible across instance families
- 30-40% savings
**Spot Instances:**
- Bid on spare capacity
- 60-90% savings
- Can be interrupted with 2-minute notice
### Cost Examples (us-east-1, On-Demand)
| Instance | vCPU | Memory | Hourly | Monthly | Use Case |
|----------|------|--------|--------|---------|----------|
| t3.micro | 2 | 1GB | $0.0104 | $7.60 | Dev/test |
| t3.medium | 2 | 4GB | $0.0416 | $30.37 | Small apps |
| m5.large | 2 | 8GB | $0.096 | $70.08 | General purpose |
| c5.xlarge | 4 | 8GB | $0.17 | $124.10 | Compute heavy |
| r5.large | 2 | 16GB | $0.126 | $91.98 | Memory heavy |
### Use Cases
**Ideal:**
- Maximum OS control
- GPU/FPGA workloads
- Windows Server
- BYOL licensing
- Predictable high traffic (with Reserved Instances)
**Avoid:**
- Variable traffic (use Lambda/Fargate)
- Minimal ops desired
- Serverless patterns applicable
---
## Service Comparison Matrix
| Criteria | Lambda | Fargate | ECS (EC2) | EKS | EC2 |
|----------|--------|---------|-----------|-----|-----|
| **Ops Overhead** | Minimal | Low | Medium | High | High |
| **Cost (variable)** | Excellent | Good | Fair | Fair | Poor |
| **Cost (predictable)** | Poor | Fair | Good | Good | Excellent |
| **Cold Start** | Yes | No | No | No | No |
| **Max Runtime** | 15 min | Unlimited | Unlimited | Unlimited | Unlimited |
| **Portability** | Low | Medium | Medium | High | High |
| **Scaling Speed** | Instant | Fast | Medium | Medium | Slow |
| **State Management** | None | Limited | Good | Excellent | Excellent |
---
## Migration Paths
### VM to Containers
```
EC2 → ECS on EC2 → ECS on Fargate → Serverless (Lambda)
```
**Step 1: EC2 → ECS on EC2**
- Containerize application (Dockerfile)
- Deploy to ECS with EC2 launch type
- Benefit: Better resource utilization
**Step 2: ECS on EC2 → ECS on Fargate**
- Migrate task definitions to Fargate
- Remove EC2 instance management
- Benefit: No server operations
**Step 3: ECS on Fargate → Lambda (if applicable)**
- Refactor to event-driven functions
- Use API Gateway for HTTP
- Benefit: Pay-per-request pricing
### Monolith to Microservices
```
Single EC2 → Multiple ECS Services → Lambda Functions
```
**Strategy:**
- Identify bounded contexts
- Extract services incrementally
- Use API Gateway or ALB for routing
- Implement service mesh (ECS Service Connect)
### On-Premises to AWS
```
On-Prem VMs → EC2 → Containers (ECS/EKS) → Serverless
```
**Tools:**
- AWS Application Migration Service (MGN)
- VM Import/Export
- Database Migration Service (DMS)
```
### references/database-services.md
```markdown
# AWS Database Services - Deep Dive
## Table of Contents
1. [RDS (Relational Database Service)](#rds-relational-database-service)
2. [Aurora (AWS-Native Relational)](#aurora-aws-native-relational)
3. [DynamoDB (NoSQL)](#dynamodb-nosql)
4. [DocumentDB (MongoDB Compatible)](#documentdb-mongodb-compatible)
5. [ElastiCache and MemoryDB](#elasticache-and-memorydb)
6. [Database Selection Decision Tree](#database-selection-decision-tree)
7. [Migration Strategies](#migration-strategies)
---
## RDS (Relational Database Service)
### Supported Engines
| Engine | Latest Version | Best For |
|--------|---------------|----------|
| PostgreSQL | 15.x | Modern apps, JSON support |
| MySQL | 8.0.x | Legacy compatibility |
| MariaDB | 10.11.x | MySQL fork, enhanced |
| Oracle | 19c, 21c | Enterprise apps, BYOL |
| SQL Server | 2019, 2022 | Microsoft ecosystem |
### Instance Classes
**General Purpose (db.t3, db.m5):**
- Balanced CPU, memory
- t3: Burstable, cost-effective
- m5: Consistent performance
**Memory Optimized (db.r5, db.x2):**
- High memory-to-CPU ratio
- Large datasets, caching
- r5: Latest generation
**Burstable (db.t4g - Graviton):**
- ARM-based processors
- 40% better price-performance
- Sustainable performance
### Cost Model (PostgreSQL db.t3.medium, us-east-1)
**Instance:** $0.068/hour = $49.64/month
**Storage (gp3):** $0.115/GB-month
**Backup Storage:** Free (automated backups = DB size)
**Example Configuration:**
- db.t3.medium instance: $49.64/month
- 100GB gp3 storage: $11.50/month
- **Total: $61.14/month**
### Multi-AZ Deployments
**How it Works:**
- Synchronous replication to standby in different AZ
- Automatic failover (60-120 seconds)
- Same endpoint (no app changes)
**Cost:** 2x instance cost + storage in both AZs
**Use When:**
- Production workloads
- High availability required (99.95% SLA)
- Automatic failover needed
### Read Replicas
**Purpose:**
- Offload read traffic
- Scale horizontally
- Analytics on replica (no impact on primary)
**Limitations:**
- Up to 15 replicas per instance
- Asynchronous replication (eventual consistency)
- Cross-region supported
**Cost:** Standard instance pricing per replica
### Blue/Green Deployments (2025 Feature)
**Purpose:**
- Zero-downtime version upgrades
- Test changes in production clone
**How it Works:**
1. Create green environment (clone)
2. Test in green environment
3. Switch traffic (blue → green)
4. Rollback if issues detected
**Use Cases:**
- Major version upgrades
- Schema changes
- Performance testing
### Best Practices
1. **Enable Automated Backups:**
- Retention: 7-35 days
- Point-in-time recovery
- No performance impact (uses snapshots)
2. **Use Parameter Groups:**
- Customize DB engine settings
- Apply best practices per workload
- Version control parameter changes
3. **Monitor Performance:**
- Enable Performance Insights (free for 7 days)
- Track slow queries
- Set up CloudWatch alarms
4. **Security:**
- Enable encryption at rest (KMS)
- Use TLS for connections
- Store credentials in Secrets Manager
- Apply security group restrictions
---
## Aurora (AWS-Native Relational)
### Overview
AWS-designed database compatible with MySQL and PostgreSQL. Higher performance, availability, and durability than standard RDS.
### Architecture
**Storage:**
- Automatically scales 10GB to 128TB
- 6 copies across 3 AZs
- Self-healing storage
- Continuous backup to S3
**Compute:**
- Primary instance (read-write)
- Up to 15 read replicas
- Sub-10ms replica lag
### Performance Improvements
**vs. MySQL:**
- 5x throughput improvement
- Same applications, drivers
**vs. PostgreSQL:**
- 3x throughput improvement
- PostgreSQL 11-15 compatibility
### Aurora Serverless v2
**Use Cases:**
- Variable workloads (dev/test, seasonal)
- Unpredictable traffic
- Multi-tenant applications
**Scaling:**
- Minimum: 0.5 ACU (1 ACU = 2GB RAM, ~2 vCPU)
- Maximum: 128 ACU
- Scales in 0.5 ACU increments
- Sub-second scaling
**Cost Model:**
- $0.12 per ACU-hour (us-east-1)
- Storage: $0.10/GB-month
- I/O: $0.20 per 1M requests
**Example Calculation:**
```
Workload: 8 hours/day active, 2 ACU baseline, 10 ACU peak
ACU-hours/month: (2 × 16hr + 10 × 8hr) × 30 days = 3,360
Cost: 3,360 × $0.12 = $403.20/month
Storage (100GB): $10/month
Total: ~$413/month
```
### Aurora Limitless Database (2024+)
**Purpose:**
- Horizontal write scaling
- Sharding managed by Aurora
- Millions of transactions per second
**Use Cases:**
- Highest-scale OLTP workloads
- Multi-tenant SaaS applications
- Gaming leaderboards
**How it Works:**
- Data automatically sharded
- Distributed SQL processing
- Appears as single database to applications
### Aurora Global Database
**Purpose:**
- Cross-region replication (<1 second lag)
- Disaster recovery
- Low-latency global reads
**Architecture:**
- Primary region (read-write)
- Up to 5 secondary regions (read-only)
- Dedicated infrastructure for replication
**Cost:**
- Replication: $0.10 per GB transferred
- Instances in secondary regions charged normally
### Cost Comparison: Aurora vs. RDS
**Aurora Advantages:**
- No manual backups needed (continuous to S3)
- Faster replication (sub-10ms vs. seconds)
- Higher availability (99.99% vs. 99.95%)
- Automatic failover to replicas
**Aurora Premium:**
- 20% more expensive than RDS for equivalent instance
- Worth it for production workloads
**Example:**
- RDS PostgreSQL db.r5.large: $0.24/hour = $175/month
- Aurora PostgreSQL r5.large: $0.29/hour = $212/month
- Difference: $37/month (20% premium)
### Best Practices
1. **Use Aurora for Production:**
- Better availability than RDS
- Automatic storage scaling
- Fast failover
2. **Leverage Read Replicas:**
- Create reader endpoint (automatic load balancing)
- Offload analytics to replicas
- Use custom endpoints for workload isolation
3. **Enable Backtrack (MySQL):**
- Rewind DB to specific point in time
- No restore from backup needed
- Minutes instead of hours
---
## DynamoDB (NoSQL)
### Overview
Fully managed NoSQL database. Single-digit millisecond latency. Infinite horizontal scaling.
### Data Model
**Primary Key Options:**
1. **Partition Key Only:**
- Unique identifier
- Example: UserID
2. **Partition Key + Sort Key:**
- Composite primary key
- Example: UserID (partition) + Timestamp (sort)
- Enables range queries
**Attributes:**
- Flexible schema (no predefined columns)
- Supports strings, numbers, binary, lists, maps, sets
### Capacity Modes
**On-Demand:**
- Pay per request
- No capacity planning
- Automatic scaling
**Pricing (us-east-1):**
- Write: $1.25 per million write request units
- Read: $0.25 per million read request units
- Storage: $0.25/GB-month
**Provisioned:**
- Specify read/write capacity units
- Predictable cost
- Auto-scaling available
**Pricing (us-east-1):**
- Write: $0.00065 per WCU-hour
- Read: $0.00013 per RCU-hour
- Storage: $0.25/GB-month
### Storage Classes (2024+)
**Standard:**
- Default storage class
- $0.25/GB-month
**Standard-IA (Infrequent Access):**
- 60% cheaper storage
- $0.10/GB-month
- Higher per-request cost
- Use for tables accessed <2 times/month
### Global Tables
**Purpose:**
- Multi-region, active-active replication
- Sub-second replication lag
- Automatic conflict resolution
**Use Cases:**
- Global applications
- Disaster recovery
- Low-latency global access
**Cost:**
- Replication: $0.000002 per replicated write
- Full instance cost in each region
### DynamoDB Streams
**Purpose:**
- Real-time change data capture
- Trigger Lambda on insert/update/delete
- Audit logging, analytics pipelines
**Retention:**
- 24 hours of change data
**Use Cases:**
- Event-driven architectures
- Data replication
- Aggregation pipelines
### DynamoDB Accelerator (DAX)
**Purpose:**
- In-memory caching layer
- Microsecond latency
- Fully managed
**Performance:**
- Cache hit: <1ms latency
- Cache miss: ~10ms (DynamoDB read)
**Cost (dax.t3.small):**
- $0.04/hour = $29/month per node
- Minimum 3 nodes (HA) = $87/month
**Use When:**
- Need <1ms latency
- Read-heavy workload
- Can afford caching cost
### Best Practices
1. **Design Partition Keys:**
- Distribute access evenly
- Avoid hot partitions
- Use high-cardinality attributes (UserID, not Country)
2. **Use Global Secondary Indexes (GSI):**
- Query alternate access patterns
- Different partition/sort keys
- Eventually consistent reads
- Plan for 20 GSIs limit
3. **Use Local Secondary Indexes (LSI):**
- Same partition key, different sort key
- Strongly consistent reads
- Must create at table creation
- 5 LSI limit
4. **Enable Point-in-Time Recovery:**
- Restore to any second in last 35 days
- $0.20/GB-month (20% of table size)
5. **Use PartiQL:**
- SQL-like query language
- Easier than low-level API
- Supports SELECT, INSERT, UPDATE, DELETE
---
## DocumentDB (MongoDB Compatible)
### Overview
Managed document database compatible with MongoDB 4.0+. Scales to millions of requests per second.
### Architecture
**Storage:**
- Automatically scales to 128TB
- 6 copies across 3 AZs (like Aurora)
- Continuous backup to S3
**Compute:**
- Primary instance (read-write)
- Up to 15 read replicas
### MongoDB Compatibility
**Supported:**
- MongoDB 4.0, 4.2, 5.0 APIs
- Drivers and tools (Compass, mongosh)
- Most MongoDB queries and aggregations
**Not Supported:**
- Some advanced MongoDB features
- Check compatibility guide for specifics
### Cost Model (db.t3.medium)
**Instance:** $0.073/hour = $53/month
**Storage:** $0.10/GB-month
**I/O:** $0.20 per 1M requests
**Example:**
- Instance: $53/month
- 100GB storage: $10/month
- 10M I/O requests: $2/month
- **Total: $65/month**
### Use Cases
**Ideal:**
- Existing MongoDB workloads
- Document-oriented data
- JSON data storage
- Flexible schemas
**Consider Alternatives:**
- Simple key-value: Use DynamoDB (cheaper)
- Need native MongoDB: Use MongoDB Atlas
- Complex transactions: Use Aurora PostgreSQL
---
## ElastiCache and MemoryDB
### ElastiCache for Redis
**Purpose:**
- In-memory caching
- Session storage
- Real-time analytics
**Cost (cache.t3.medium):**
- $0.068/hour = $49.64/month per node
**Use Cases:**
- Database query caching
- Session store
- Leaderboards
- Rate limiting
- Pub/sub messaging
**Limitations:**
- No persistence (data lost on restart)
- Use MemoryDB for durability
### ElastiCache for Memcached
**Purpose:**
- Simple caching layer
- Horizontal scaling via sharding
**vs. Redis:**
- Simpler (no advanced data structures)
- Multi-threaded (better CPU utilization)
- No persistence
**Use When:**
- Simple caching needed
- Horizontal scaling priority
- Don't need Redis features
### MemoryDB for Redis (2024+)
**Purpose:**
- Redis-compatible with Multi-AZ durability
- Primary database (not just cache)
**Performance:**
- Microsecond read latency
- Single-digit millisecond write latency
- Durable across AZ failures
**Cost (db.t4g.small):**
- $0.061/hour = $44.53/month per node
**vs. ElastiCache Redis:**
- 20% more expensive
- Durable (survives restarts)
- Use as primary database
**Use Cases:**
- Real-time applications needing persistence
- Gaming leaderboards with durability
- Session stores with HA requirements
---
## Database Selection Decision Tree
```
Q1: What is the data model?
├─ Relational (tables with joins) → Q2
├─ Document (JSON/BSON) → Q5
├─ Key-Value → DynamoDB
├─ Graph → Neptune
└─ Time-Series → Timestream
Q2: What is the scale requirement?
├─ <64TB, standard RDS features → RDS
└─ >64TB or need highest performance → Aurora
Q3: What engine do you need?
├─ PostgreSQL → RDS PostgreSQL or Aurora PostgreSQL
├─ MySQL → RDS MySQL or Aurora MySQL
├─ Oracle/SQL Server → RDS (Aurora not available)
Q4: What availability do you need?
├─ Dev/Test → RDS Single-AZ
├─ Production → RDS Multi-AZ or Aurora
└─ Global → Aurora Global Database
Q5: Document database specifics:
├─ Simple key-value with JSON → DynamoDB
├─ MongoDB compatibility required → DocumentDB
└─ Complex MongoDB features → MongoDB Atlas
Q6: Caching needs:
├─ Simple cache, no persistence → ElastiCache (Redis or Memcached)
├─ Cache with durability → MemoryDB
└─ Microsecond latency for DynamoDB → DAX
```
---
## Migration Strategies
### On-Premises to AWS
**Relational Databases:**
1. **AWS Database Migration Service (DMS):**
- Minimal downtime
- Continuous replication
- Supports heterogeneous migrations (Oracle → PostgreSQL)
2. **Native Tools:**
- MySQL: mysqldump, binlog replication
- PostgreSQL: pg_dump, logical replication
- Oracle: Data Pump, GoldenGate
**NoSQL Databases:**
1. **MongoDB to DocumentDB:**
- Use AWS DMS
- mongodump/mongorestore for smaller DBs
2. **Self-Managed to DynamoDB:**
- Application-level migration
- Dual-write pattern (old + new)
- Validate and cutover
### RDS to Aurora
**Zero-Downtime Migration:**
1. Create Aurora read replica from RDS instance
2. Promote replica to standalone Aurora cluster
3. Update application endpoints
4. Decommission RDS instance
**Timeframe:** 1-2 hours depending on size
### DynamoDB to Aurora (or vice versa)
**Strategy:**
- Application-level migration
- Dual-write pattern
- Gradually shift reads
- Validate data consistency
**Tooling:**
- AWS DMS (limited support)
- Custom scripts
### Self-Managed to Managed
**Benefits:**
- Automated backups
- Automatic failover
- Managed upgrades
- Built-in monitoring
**Considerations:**
- Test performance (might differ)
- Validate feature compatibility
- Plan rollback strategy
```
### references/storage-services.md
```markdown
# AWS Storage Services - Deep Dive
## Table of Contents
1. [S3 (Simple Storage Service)](#s3-simple-storage-service)
2. [EBS (Elastic Block Store)](#ebs-elastic-block-store)
3. [EFS (Elastic File System)](#efs-elastic-file-system)
4. [FSx Family](#fsx-family)
5. [Storage Selection Guide](#storage-selection-guide)
---
## S3 (Simple Storage Service)
### Storage Classes
| Class | Use Case | Durability | Availability | Cost/GB | Retrieval Cost |
|-------|----------|------------|--------------|---------|----------------|
| **Standard** | Frequent access | 99.999999999% | 99.99% | $0.023 | Free |
| **Intelligent-Tiering** | Unknown/changing | 99.999999999% | 99.9% | $0.023-$0.00099 | Free |
| **Standard-IA** | Infrequent (>30 days) | 99.999999999% | 99.9% | $0.0125 | $0.01/GB |
| **One Zone-IA** | Non-critical | 99.999999999% | 99.5% | $0.01 | $0.01/GB |
| **Glacier Instant** | Archive, instant | 99.999999999% | 99.9% | $0.004 | $0.03/GB |
| **Glacier Flexible** | Archive, 1-5 min | 99.999999999% | 99.99% | $0.0036 | $0.01-$0.03/GB |
| **Glacier Deep Archive** | Long-term (7-10yr) | 99.999999999% | 99.99% | $0.00099 | $0.02/GB + 12hr |
| **S3 Express One Zone** | High perf (NEW) | 99.999999999% | 99.95% | $0.16 | Free |
### S3 Intelligent-Tiering
**How it Works:**
- Automatic optimization across 5 tiers
- No retrieval fees
- Small monitoring fee: $0.0025 per 1,000 objects
**Tiers:**
1. Frequent Access (0-30 days): $0.023/GB
2. Infrequent Access (30-90 days): $0.0125/GB
3. Archive Instant Access (90-180 days): $0.004/GB
4. Archive Access (180-365 days): $0.0036/GB
5. Deep Archive (365+ days): $0.00099/GB
**Use When:**
- Unknown or changing access patterns
- Want automated cost optimization
- Can tolerate small monitoring fee
### S3 Express One Zone (2024)
**Performance:**
- 10x faster than S3 Standard
- Single-digit millisecond latency
- Hundreds of thousands of requests per second
**Use Cases:**
- Low-latency data processing
- ML training data access
- High-performance computing
**Cost Trade-off:**
- $0.16/GB vs. $0.023/GB (7x more expensive)
- No data transfer charges within same AZ
### Lifecycle Policies
**Example Configuration:**
```json
{
"Rules": [{
"Id": "Archive old data",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 }
}]
}
```
### S3 Features
**Versioning:**
- Preserve all versions of objects
- Protect against accidental deletion
- Enable MFA delete for extra protection
**Replication:**
- Cross-Region Replication (CRR): Disaster recovery
- Same-Region Replication (SRR): Compliance, lower latency
**S3 Object Lambda:**
- Transform objects on retrieval
- Redact PII, resize images, convert formats
**S3 Batch Operations:**
- Perform operations on billions of objects
- Copy, tag, restore from Glacier, invoke Lambda
### Best Practices
1. Enable versioning for critical data
2. Use lifecycle policies to reduce costs
3. Enable S3 Intelligent-Tiering for unknown patterns
4. Encrypt at rest (SSE-S3 or SSE-KMS)
5. Use CloudFront for frequently accessed content
6. Enable access logging for auditing
---
## EBS (Elastic Block Store)
### Volume Types
| Type | IOPS | Throughput | Cost/GB | Use Case |
|------|------|------------|---------|----------|
| **gp3** | 3,000-16,000 | 125-1,000 MB/s | $0.08 | General purpose (recommended) |
| **gp2** | 3,000-16,000 | 250 MB/s max | $0.10 | Legacy general purpose |
| **io2 Block Express** | 256,000 | 4,000 MB/s | $0.125 + IOPS | Highest performance |
| **io2** | 64,000 | 1,000 MB/s | $0.125 + IOPS | Critical workloads |
| **st1** (HDD) | 500 | 500 MB/s | $0.045 | Throughput-optimized |
| **sc1** (HDD) | 250 | 250 MB/s | $0.015 | Cold storage |
### gp3 vs. gp2
**gp3 Advantages:**
- 20% cheaper ($0.08 vs. $0.10)
- Independent IOPS and throughput scaling
- Baseline: 3,000 IOPS, 125 MB/s
- Configurable up to 16,000 IOPS, 1,000 MB/s
**Recommendation:** Use gp3 for 99% of workloads
### EBS Snapshots
**Features:**
- Incremental backups to S3
- Copy across regions
- Create volumes from snapshots
**EBS Snapshots Archive (2024):**
- 75% cheaper storage ($0.0125 vs. $0.05/GB-month)
- Restore takes 24-72 hours
- Use for compliance, long-term retention
**Fast Snapshot Restore (FSR):**
- Pre-warm snapshots for instant recovery
- $0.75/hour per AZ per snapshot
- Use for critical recovery scenarios
### Multi-Attach io2 Volumes
**Purpose:**
- Share volume across multiple EC2 instances
- Cluster file systems, HA applications
**Limitations:**
- Up to 16 instances
- Same AZ only
- io2 volumes only
### Best Practices
1. Use gp3 for general purpose workloads
2. Enable EBS encryption by default
3. Automate snapshot creation (lifecycle policies)
4. Delete unused snapshots
5. Archive old snapshots (75% cost savings)
6. Use Fast Snapshot Restore for critical backups
---
## EFS (Elastic File System)
### Storage Classes
| Class | Cost/GB-month | Performance | Use Case |
|-------|---------------|-------------|----------|
| **Standard** | $0.30 | High | Frequent access |
| **Infrequent Access (IA)** | $0.025 | Lower | >30 days idle |
| **One Zone** | $0.16 | High | Non-critical |
| **One Zone-IA** | $0.0133 | Lower | Dev/test |
### Performance Modes
**General Purpose:**
- Low latency (<10ms)
- Up to 7,000 file ops/sec
- Default mode
**Max I/O:**
- Higher aggregate throughput
- Slightly higher latency
- Big data, media processing
### Throughput Modes
**Elastic (Default):**
- Automatically scales
- Pay only for actual throughput
- $0.30/GB transferred
**Provisioned:**
- Specify throughput independent of storage
- Consistent high throughput
- $6.00/MB/s-month
**Bursting:**
- Throughput scales with storage size
- 50 MB/s per TB stored
- Legacy mode
### EFS Features (2025)
**Intelligent-Tiering:**
- Automatic movement to IA tier
- After 7, 14, 30, 60, or 90 days idle
- No retrieval fees
**EFS Replication:**
- Cross-region disaster recovery
- Near real-time replication
- $0.015/GB replicated
### Use Cases
**Ideal:**
- Shared file storage across EC2/Fargate/Lambda
- Content management systems
- Container persistent storage (ECS, EKS)
- Home directories
**Avoid:**
- Single-instance applications (use EBS)
- Windows workloads (use FSx for Windows)
- High-performance HPC (use FSx for Lustre)
### Best Practices
1. Enable Intelligent-Tiering (save 92% on IA files)
2. Use One Zone class for dev/test (50% cheaper)
3. Mount via NFS 4.1 for best compatibility
4. Use encryption in transit (TLS)
5. Monitor with CloudWatch metrics
---
## FSx Family
### FSx for Windows File Server
**Purpose:**
- Windows-native SMB file shares
- Active Directory integration
**Cost:**
- SSD: $0.013/GB-month + throughput
- HDD: $0.0065/GB-month + throughput
**Use Cases:**
- Windows applications (SQL Server, IIS)
- Home directories
- SharePoint storage
### FSx for Lustre
**Purpose:**
- High-performance computing (HPC)
- Sub-millisecond latency
- 100+ GB/s throughput
**Cost:**
- Persistent SSD: $0.145/GB-month
- Scratch: $0.084/GB-month (no replication)
**Use Cases:**
- ML training
- Video rendering
- Genomics research
- Financial modeling
**S3 Integration:**
- Link to S3 bucket
- Lazy-load data on first access
- Write results back to S3
### FSx for NetApp ONTAP
**Purpose:**
- Enterprise NAS features
- Multi-protocol (NFS, SMB, iSCSI)
**Features:**
- Snapshots, clones
- Data compression, deduplication
- SnapMirror replication
- Hybrid cloud integration
**Cost:**
- SSD: $0.230/GB-month + IOPS
- Throughput: $0.50/MB/s-month
### FSx for OpenZFS
**Purpose:**
- Linux ZFS file systems
- Up to 12.5 GB/s throughput
**Features:**
- Snapshots, clones
- Compression
- Point-in-time recovery
**Cost:**
- SSD: $0.150/GB-month + throughput
---
## Storage Selection Guide
### Decision Matrix
```
Use Case → Service
Objects (files, media, static assets) → S3
├─ Frequent access → S3 Standard
├─ Infrequent (>30 days) → S3 Standard-IA
├─ Archive → S3 Glacier (Instant, Flexible, Deep)
└─ Unknown pattern → S3 Intelligent-Tiering
Block storage (databases, boot volumes) → EBS
├─ General purpose → gp3
├─ High IOPS → io2 or io2 Block Express
└─ Throughput optimized → st1 (HDD)
Shared files (NFS) → EFS or FSx
├─ Linux NFS → EFS
├─ Windows SMB → FSx for Windows
├─ High-performance HPC → FSx for Lustre
└─ Enterprise NAS → FSx for NetApp ONTAP
Container storage:
├─ Ephemeral → Local SSD
├─ Persistent single-container → EBS
└─ Persistent shared → EFS or FSx
```
### Cost Comparison (1TB for 1 Month)
| Service | Monthly Cost | Access Pattern |
|---------|--------------|----------------|
| S3 Standard | $23 | Frequent |
| S3 Standard-IA | $12.50 | Infrequent |
| S3 Glacier Instant | $4 | Archive, instant |
| S3 Deep Archive | $0.99 | Long-term archive |
| EBS gp3 | $80 | Block storage |
| EFS Standard | $300 | Shared files, frequent |
| EFS IA | $25 | Shared files, infrequent |
| FSx for Lustre | ~$145 | High-performance HPC |
### Performance Comparison
| Service | Latency | IOPS | Throughput |
|---------|---------|------|------------|
| S3 Standard | ~100ms | N/A | Unlimited |
| S3 Express One Zone | <10ms | 100,000+ | Unlimited |
| EBS gp3 | <1ms | 16,000 | 1,000 MB/s |
| EBS io2 Block Express | <1ms | 256,000 | 4,000 MB/s |
| EFS General Purpose | <10ms | 7,000 ops/s | Elastic |
| FSx for Lustre | <1ms | Millions | 100+ GB/s |
### Lifecycle Cost Optimization
**Scenario: 100TB data with 10% active**
**Without Optimization:**
- S3 Standard: 100TB × $23 = $2,300/month
**With Lifecycle Policies:**
- Active (10TB): S3 Standard = $230
- Infrequent (30TB): S3 Standard-IA = $375
- Archive (60TB): S3 Glacier Instant = $240
- **Total: $845/month (63% savings)**
**With Intelligent-Tiering:**
- Automatic optimization
- Similar savings
- Small monitoring fee (~$250 for 100M objects)
```
### references/serverless-patterns.md
```markdown
# AWS Serverless Architecture Patterns
## Table of Contents
- [Lambda Function Patterns](#lambda-function-patterns)
- [API Gateway Patterns](#api-gateway-patterns)
- [Step Functions Orchestration](#step-functions-orchestration)
- [EventBridge Patterns](#eventbridge-patterns)
- [DynamoDB Integration](#dynamodb-integration)
- [Lambda SnapStart Configuration](#lambda-snapstart-configuration)
- [Response Streaming](#response-streaming)
- [Error Handling and Retry Logic](#error-handling-and-retry-logic)
- [Performance Optimization](#performance-optimization)
- [Anti-Patterns](#anti-patterns)
## Lambda Function Patterns
### Basic REST API Handler
**Pattern:** Single Lambda function handling HTTP requests through API Gateway.
**Use When:**
- Building CRUD APIs with predictable traffic
- Execution time under 15 minutes
- Need automatic scaling
- Cost optimization for variable workloads
**Architecture:**
```
API Gateway HTTP API → Lambda Function → DynamoDB/RDS
→ S3 (optional)
```
**Key Configuration:**
```yaml
# Lambda configuration
Runtime: python3.12 or nodejs20.x
Memory: 1024 MB (price-performance sweet spot)
Timeout: 30 seconds (API Gateway limit)
ReservedConcurrency: null (unlimited scaling)
ProvisionedConcurrency: 0 (avoid unless cold start critical)
```
**Cost Characteristics:**
- Free tier: 1M requests/month, 400,000 GB-seconds compute
- Beyond free tier: $0.20 per 1M requests + $0.0000166667 per GB-second
- 1M requests at 1GB memory, 500ms duration: ~$4.17/month
**Memory Allocation Guidance:**
- 128-256 MB: Simple data transformations, S3 processing
- 512-1024 MB: API handlers, moderate database queries
- 1536-3008 MB: Complex processing, ML inference, heavy I/O
- 10240 MB: Maximum, for CPU-intensive workloads
### Event-Driven Processing
**Pattern:** Lambda triggered by events from S3, DynamoDB Streams, or EventBridge.
**Architecture:**
```
S3 Upload → EventBridge Event → Lambda (transform) → DynamoDB (metadata)
→ SQS (downstream)
```
**Configuration for Event Sources:**
**S3 Event:**
```yaml
EventSourceArn: !GetAtt MyBucket.Arn
Events:
- s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: prefix
Value: uploads/
- Name: suffix
Value: .csv
```
**DynamoDB Stream:**
```yaml
EventSourceArn: !GetAtt MyTable.StreamArn
StartingPosition: LATEST
BatchSize: 100
MaximumBatchingWindowInSeconds: 10
ParallelizationFactor: 10 # Process 10 shards concurrently
```
**Batch Processing Configuration:**
- BatchSize: 1-10,000 for SQS/Kinesis
- MaximumBatchingWindowInSeconds: 0-300 (wait for batch to fill)
- ParallelizationFactor: 1-10 (concurrent executions per shard)
### Scheduled Task Pattern
**Pattern:** Lambda function running on schedule using EventBridge Rules.
**Use When:**
- Periodic data processing (ETL jobs)
- Cleanup tasks (delete expired records)
- Report generation
- Health checks and monitoring
**EventBridge Schedule Syntax:**
```
rate(5 minutes) # Every 5 minutes
rate(1 hour) # Every hour
rate(1 day) # Daily
cron(0 9 * * ? *) # 9 AM UTC daily
cron(0 0 ? * MON-FRI *) # Midnight weekdays
cron(0/15 * * * ? *) # Every 15 minutes
```
**Best Practices:**
- Use UTC timezone for cron expressions
- Set appropriate timeout (max 15 minutes)
- Implement idempotency (safe to retry)
- Use CloudWatch Logs for debugging
## API Gateway Patterns
### HTTP API vs REST API
**Decision Matrix:**
| Feature | HTTP API | REST API |
|---------|----------|----------|
| Cost | $1.00/million | $3.50/million |
| Latency | ~35% lower | Standard |
| JWT Authorization | Native | Custom authorizer needed |
| Request Validation | No | Yes |
| API Keys | No | Yes |
| Usage Plans | No | Yes |
| WebSocket | Separate | No |
**Recommendation:** Use HTTP API for 90% of use cases. Use REST API only when request validation, API keys, or usage plans required.
### Lambda Proxy Integration
**Pattern:** Pass entire request to Lambda, return entire response.
**Request Format Received:**
```json
{
"version": "2.0",
"routeKey": "POST /items",
"rawPath": "/items",
"headers": {
"content-type": "application/json",
"user-agent": "Mozilla/5.0"
},
"queryStringParameters": {
"filter": "active"
},
"body": "{\"name\":\"item1\"}",
"isBase64Encoded": false,
"requestContext": {
"requestId": "abc123",
"http": {
"method": "POST",
"path": "/items"
}
}
}
```
**Required Response Format:**
```json
{
"statusCode": 200,
"headers": {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*"
},
"body": "{\"message\":\"Success\"}"
}
```
**Common Errors:**
- Missing statusCode field (required)
- Body not stringified (must be string, not object)
- Incorrect header names (case-sensitive)
### CORS Configuration
**For HTTP API:**
```yaml
CorsConfiguration:
AllowOrigins:
- https://example.com
AllowMethods:
- GET
- POST
- PUT
- DELETE
AllowHeaders:
- Content-Type
- Authorization
MaxAge: 300
AllowCredentials: true
```
**For REST API:**
Must implement OPTIONS method with CORS headers in Lambda response.
### Custom Domain Names
**Pattern:** Map custom domain to API Gateway endpoint.
**Requirements:**
1. ACM certificate in us-east-1 (for edge-optimized) or same region (for regional)
2. Route 53 hosted zone or external DNS provider
3. API mapping configuration
**Configuration:**
```yaml
DomainName:
DomainName: api.example.com
CertificateArn: arn:aws:acm:us-east-1:123456789012:certificate/abc
EndpointConfiguration:
Types:
- REGIONAL # or EDGE for CloudFront distribution
ApiMapping:
DomainName: api.example.com
ApiId: !Ref HttpApi
Stage: prod
ApiMappingKey: v1 # results in api.example.com/v1
```
## Step Functions Orchestration
### Express Workflows vs Standard Workflows
**Decision Matrix:**
| Feature | Express Workflow | Standard Workflow |
|---------|-----------------|-------------------|
| Max Duration | 5 minutes | 1 year |
| Pricing Model | Per execution | Per state transition |
| Execution Guarantee | At-least-once | Exactly-once |
| Execution History | CloudWatch Logs | Built-in (90 days) |
| Cost (1M executions) | $1.00 | $25.00 |
**Use Cases:**
- **Express:** API response orchestration, real-time data processing
- **Standard:** Long-running workflows, ETL pipelines, human approval
### Common State Types
**Task State (Lambda Invocation):**
```json
{
"ProcessData": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessFunction",
"TimeoutSeconds": 300,
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError"
}],
"Next": "NextState"
}
}
```
**Choice State (Conditional Branching):**
```json
{
"CheckStatus": {
"Type": "Choice",
"Choices": [{
"Variable": "$.status",
"StringEquals": "SUCCESS",
"Next": "SuccessState"
}, {
"Variable": "$.status",
"StringEquals": "FAILED",
"Next": "FailureState"
}],
"Default": "DefaultState"
}
}
```
**Map State (Parallel Processing):**
```json
{
"ProcessItems": {
"Type": "Map",
"ItemsPath": "$.items",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessItem",
"End": true
}
}
},
"Next": "Aggregate"
}
}
```
### Distributed Map (2024+ Feature)
**Use When:**
- Processing thousands to millions of items
- Items stored in S3 (JSON array or CSV)
- Need massive parallelism (10,000+ concurrent executions)
**Configuration:**
```json
{
"ProcessLargeDataset": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:getObject",
"Parameters": {
"Bucket": "my-bucket",
"Key": "data.json"
}
},
"ItemSelector": {
"item.$": "$$.Map.Item.Value"
},
"MaxConcurrency": 10000,
"ToleratedFailurePercentage": 5,
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket": "output-bucket",
"Prefix": "results/"
}
},
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Process",
"End": true
}
}
}
}
}
```
**Performance:**
- Can process millions of items in minutes
- Automatic result aggregation to S3
- Built-in fault tolerance
## EventBridge Patterns
### Event-Driven Architecture
**Pattern:** Decouple producers and consumers using EventBridge.
**Architecture:**
```
Producer Service → EventBridge Event Bus → EventBridge Rule → Target (Lambda/SQS/Step Functions)
→ Target (Additional consumers)
```
**Event Pattern Matching:**
```json
{
"source": ["myapp.orders"],
"detail-type": ["Order Placed"],
"detail": {
"amount": [{"numeric": [">=", 100]}],
"status": ["confirmed"]
}
}
```
**Advanced Filtering:**
```json
{
"source": ["myapp.users"],
"detail": {
"location": {
"state": [
{"prefix": "US-"},
"CA-BC",
"CA-ON"
]
},
"metadata": {
"plan": [
{"anything-but": "free"}
]
}
}
}
```
### EventBridge Pipes (2023+ Feature)
**Pattern:** Simplified event processing with built-in filtering and enrichment.
**Architecture:**
```
Source (SQS/Kinesis/DynamoDB) → Filter → Enrichment (Lambda/API) → Target
```
**Use When:**
- Need to filter events before processing
- Require data enrichment from external API
- Want simpler configuration than Lambda + EventBridge
**Configuration:**
```yaml
Pipe:
Source: !GetAtt SourceQueue.Arn
SourceParameters:
SqsQueueParameters:
BatchSize: 10
MaximumBatchingWindowInSeconds: 5
Filter:
Pattern: |
{
"body": {
"amount": [{"numeric": [">", 100]}]
}
}
Enrichment: !GetAtt EnrichmentFunction.Arn
EnrichmentParameters:
InputTemplate: |
{
"orderId": <$.body.orderId>,
"customerId": <$.body.customerId>
}
Target: !GetAtt TargetFunction.Arn
TargetParameters:
InputTemplate: |
{
"enrichedData": <$.body>,
"timestamp": <$.metadata.timestamp>
}
```
**Benefits:**
- No custom Lambda code for routing
- Built-in transformation using JSONPath
- Automatic retries and DLQ support
### Schema Registry
**Pattern:** Define and version event schemas for type safety.
**Benefits:**
- Auto-generate code bindings (Java, Python, TypeScript)
- Schema validation
- Version management
- Discovery of available events
**Example Schema:**
```json
{
"openapi": "3.0.0",
"info": {
"version": "1.0.0",
"title": "OrderPlaced"
},
"paths": {},
"components": {
"schemas": {
"OrderPlaced": {
"type": "object",
"required": ["orderId", "customerId", "amount"],
"properties": {
"orderId": {
"type": "string",
"format": "uuid"
},
"customerId": {
"type": "string"
},
"amount": {
"type": "number",
"minimum": 0
},
"status": {
"type": "string",
"enum": ["pending", "confirmed", "shipped"]
}
}
}
}
}
}
```
## DynamoDB Integration
### Single-Table Design
**Pattern:** Store multiple entity types in one table using GSIs.
**Primary Key Design:**
```
PK: CUSTOMER#123 SK: METADATA
PK: CUSTOMER#123 SK: ORDER#456
PK: CUSTOMER#123 SK: ORDER#789
PK: ORDER#456 SK: METADATA
PK: ORDER#456 SK: ITEM#1
```
**Access Patterns:**
1. Get customer: Query PK=CUSTOMER#123, SK begins_with METADATA
2. Get customer orders: Query PK=CUSTOMER#123, SK begins_with ORDER#
3. Get order details: Query PK=ORDER#456
**GSI for Inverted Access:**
```
GSI1PK: ORDER#456 GSI1SK: CUSTOMER#123
GSI1PK: ORDER#789 GSI1SK: CUSTOMER#123
```
Query orders by customer using GSI1PK=CUSTOMER#123.
### Lambda + DynamoDB Best Practices
**Batch Operations:**
```python
# Use batch_write_item for multiple puts (25 items max)
dynamodb.batch_write_item(
RequestItems={
'MyTable': [
{'PutRequest': {'Item': item1}},
{'PutRequest': {'Item': item2}},
]
}
)
# Use batch_get_item for multiple reads (100 items max)
response = dynamodb.batch_get_item(
RequestItems={
'MyTable': {
'Keys': [
{'PK': 'CUSTOMER#123', 'SK': 'METADATA'},
{'PK': 'CUSTOMER#456', 'SK': 'METADATA'},
]
}
}
)
```
**Connection Reuse:**
```python
# Initialize client OUTSIDE handler for reuse
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyTable')
def lambda_handler(event, context):
# Reuses existing connection
response = table.get_item(Key={'PK': 'CUSTOMER#123', 'SK': 'METADATA'})
```
**DynamoDB Streams Processing:**
```python
def lambda_handler(event, context):
for record in event['Records']:
if record['eventName'] == 'INSERT':
new_item = record['dynamodb']['NewImage']
# Process new item
elif record['eventName'] == 'MODIFY':
old_item = record['dynamodb']['OldImage']
new_item = record['dynamodb']['NewImage']
# Process update
elif record['eventName'] == 'REMOVE':
old_item = record['dynamodb']['OldImage']
# Process deletion
```
## Lambda SnapStart Configuration
### Java Cold Start Optimization
**Use When:**
- Using Java 11 or Java 17 runtime
- Cold start latency is critical (APIs, synchronous processing)
- Application initialization is expensive (framework startup, connection pools)
**Performance Improvement:**
- Cold start: 10-15 seconds → 200-500 milliseconds
- Warm start: No change (still fast)
- Cost: Same as regular Lambda
**Configuration:**
```yaml
Function:
Runtime: java17
SnapStart:
ApplyOn: PublishedVersions
AutoPublishAlias: live
```
**Requirements:**
1. Must use published versions (not $LATEST)
2. Must use alias pointing to version
3. Invoke alias ARN, not function ARN
**Code Considerations:**
**Avoid:**
```java
// Network connections in initialization
private static HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build(); // Don't create in static initializer
```
**Prefer:**
```java
// Lazy initialization
private static HttpClient client;
private static HttpClient getClient() {
if (client == null) {
client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
}
return client;
}
```
**Uniqueness Requirements:**
- Generate unique IDs in handler, not initialization
- Refresh credentials on each invocation
- Don't cache sensitive data in static variables
## Response Streaming
### Large Response Pattern (2023+ Feature)
**Use When:**
- Response size exceeds 6 MB (synchronous limit)
- Need to stream data to client incrementally
- Generating large reports or files
**Maximum Size:**
- Standard response: 6 MB
- Streaming response: 20 MB
**Configuration:**
```yaml
Function:
Runtime: nodejs20.x or python3.12
InvokeMode: RESPONSE_STREAM
```
**Python Implementation:**
```python
def lambda_handler(event, context):
def generate_data():
# Stream data in chunks
for i in range(1000):
yield json.dumps({'chunk': i}) + '\n'
return generate_data()
```
**Node.js Implementation:**
```javascript
import { Readable } from 'stream';
export const handler = awslambda.streamifyResponse(
async (event, responseStream, context) => {
const stream = Readable.from(generateData());
stream.pipe(responseStream);
}
);
function* generateData() {
for (let i = 0; i < 1000; i++) {
yield JSON.stringify({ chunk: i }) + '\n';
}
}
```
**API Gateway Integration:**
Not supported with API Gateway. Use Lambda Function URL.
**Function URL Configuration:**
```yaml
FunctionUrl:
AuthType: AWS_IAM # or NONE for public
InvokeMode: RESPONSE_STREAM
Cors:
AllowOrigins:
- '*'
AllowMethods:
- GET
- POST
```
## Error Handling and Retry Logic
### Retry Configuration
**Automatic Retries by Invocation Type:**
| Invocation Type | Retries | Use Case |
|----------------|---------|----------|
| Synchronous (API Gateway) | 0 | Client should retry |
| Asynchronous (S3, EventBridge) | 2 | AWS retries automatically |
| Event Source (SQS, Kinesis) | Until success or TTL | Configurable |
**Custom Retry Configuration:**
```yaml
EventSourceMapping:
FunctionName: !Ref ProcessFunction
EventSourceArn: !GetAtt MyQueue.Arn
MaximumRetryAttempts: 3
MaximumRecordAgeInSeconds: 3600 # Discard after 1 hour
BisectBatchOnFunctionError: true # Split failed batch
```
**Asynchronous Retry Configuration:**
```yaml
Function:
EventInvokeConfig:
MaximumRetryAttempts: 1 # 0-2 retries
MaximumEventAgeInSeconds: 3600 # Discard after 1 hour
DestinationConfig:
OnSuccess:
Destination: !GetAtt SuccessQueue.Arn
OnFailure:
Destination: !GetAtt DLQ.Arn
```
### Dead Letter Queue (DLQ)
**Pattern:** Send failed events to SQS or SNS for manual processing.
**Configuration:**
```yaml
Function:
DeadLetterConfig:
TargetArn: !GetAtt DLQ.Arn
DLQ:
Type: AWS::SQS::Queue
Properties:
MessageRetentionPeriod: 1209600 # 14 days
VisibilityTimeout: 300
```
**DLQ Processing:**
```python
# Separate Lambda to process DLQ
def dlq_handler(event, context):
for record in event['Records']:
body = json.loads(record['body'])
# Log to CloudWatch or external monitoring
logger.error(f"Failed event: {body}")
# Send alert to SNS/PagerDuty
# Store in S3 for analysis
```
### Idempotency
**Pattern:** Ensure safe retries using idempotency keys.
**Implementation (Python):**
```python
import boto3
import hashlib
import json
dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('IdempotencyTable')
def lambda_handler(event, context):
# Generate idempotency key from event
key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()
# Check if already processed
try:
response = idempotency_table.get_item(Key={'RequestId': key})
if 'Item' in response:
# Already processed, return cached result
return json.loads(response['Item']['Result'])
except Exception as e:
logger.error(f"Idempotency check failed: {e}")
# Process event
result = process_event(event)
# Store result
try:
idempotency_table.put_item(
Item={
'RequestId': key,
'Result': json.dumps(result),
'TTL': int(time.time()) + 86400 # 24 hours
}
)
except Exception as e:
logger.error(f"Failed to store idempotency: {e}")
return result
```
## Performance Optimization
### Memory vs CPU Tradeoff
**Key Insight:** CPU allocation scales linearly with memory.
| Memory | vCPU | Best For |
|--------|------|----------|
| 128 MB | 0.083 vCPU | Simple transformations |
| 512 MB | 0.33 vCPU | Light API handlers |
| 1024 MB | 0.67 vCPU | General purpose |
| 1769 MB | 1.0 vCPU | CPU-bound tasks |
| 3538 MB | 2.0 vCPU | Heavy processing |
| 10240 MB | 6.0 vCPU | Maximum performance |
**Cost vs Performance:**
- 128 MB, 1000ms = $0.0000002083 per invocation
- 1024 MB, 125ms = $0.0000002083 per invocation (8x faster, same cost!)
**Recommendation:** Use AWS Lambda Power Tuning to find optimal memory.
### Connection Pooling
**Pattern:** Reuse database connections across invocations.
**RDS Connection Pooling:**
```python
import pymysql
# Initialize OUTSIDE handler
connection = None
def get_connection():
global connection
if connection is None or not connection.open:
connection = pymysql.connect(
host=os.environ['DB_HOST'],
user=os.environ['DB_USER'],
password=os.environ['DB_PASSWORD'],
database=os.environ['DB_NAME'],
connect_timeout=5,
cursorclass=pymysql.cursors.DictCursor
)
return connection
def lambda_handler(event, context):
conn = get_connection()
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM users WHERE id = %s", (event['userId'],))
result = cursor.fetchone()
return result
```
**RDS Proxy (Recommended):**
- Manages connection pooling automatically
- Reduces overhead by 66%+
- Built-in failover
- IAM authentication support
**Configuration:**
```yaml
DBProxy:
Type: AWS::RDS::DBProxy
Properties:
EngineFamily: POSTGRESQL
Auth:
- AuthScheme: SECRETS
SecretArn: !Ref DBSecret
RoleArn: !GetAtt ProxyRole.Arn
VpcSubnetIds:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
```
### Provisioned Concurrency
**Use When:**
- Cold starts are unacceptable (<100ms p99 required)
- Predictable high traffic (not cost-effective for variable traffic)
- Cost is secondary to performance
**Cost Model:**
- Provisioned concurrency: $0.0000041667 per GB-second
- On-demand: $0.0000166667 per GB-second
- Provisioned is always running (charged even when idle)
**Configuration:**
```yaml
Function:
ProvisionedConcurrencyConfig:
ProvisionedConcurrentExecutions: 10
AutoScalingConfig:
MinCapacity: 5
MaxCapacity: 100
TargetValue: 0.70 # 70% utilization
```
**Application Auto Scaling:**
```yaml
ScalableTarget:
ServiceNamespace: lambda
ResourceId: !Sub "function:${FunctionName}:${Alias}"
ScalableDimension: lambda:function:ProvisionedConcurrency
MinCapacity: 5
MaxCapacity: 100
ScalingPolicy:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
TargetValue: 0.70
PredefinedMetricSpecification:
PredefinedMetricType: LambdaProvisionedConcurrencyUtilization
```
## Anti-Patterns
### Don't: Run Long-Running Tasks
**Problem:** Lambda has 15-minute timeout limit.
**Solution:** Use Step Functions, ECS Fargate, or Batch.
### Don't: Store State in /tmp
**Problem:** /tmp is ephemeral and limited to 512 MB (10 GB in 2024+).
**Solution:** Use S3, EFS, or ElastiCache for persistent storage.
### Don't: Use Lambda for Predictable High Traffic
**Problem:** More expensive than Fargate/EC2 at constant utilization.
**Breakeven Analysis:**
- Lambda: $0.0000166667 per GB-second
- Fargate (1 vCPU, 2GB): $0.000011 per GB-second at 100% utilization
**Solution:** Use Fargate or EC2 for always-on workloads.
### Don't: Hardcode Configuration
**Problem:** Requires redeployment for configuration changes.
**Solution:** Use environment variables, Parameter Store, or AppConfig.
### Don't: Ignore Cold Start Impact
**Problem:** First invocation is slow (Java: 10-15s, Python: 200-500ms).
**Solutions:**
1. Use SnapStart (Java only)
2. Provision concurrency (expensive)
3. Keep functions warm (cron trigger)
4. Minimize dependencies
5. Use lighter runtimes (Node.js, Python over Java)
### Don't: Process Large Files in Lambda
**Problem:** 512 MB - 10 GB memory limit, 15-minute timeout.
**Solution:** Use S3 Select, Athena, or EMR for large data processing.
### Don't: Synchronous Step Functions for APIs
**Problem:** Express workflows have 5-minute limit, standard workflows are slow.
**Solution:** Use direct Lambda integration or asynchronous workflows.
### Don't: Ignore Concurrent Execution Limits
**Problem:** Account limit of 1,000 concurrent executions (can request increase).
**Solution:**
1. Request limit increase
2. Use reserved concurrency to prevent throttling
3. Implement backpressure (SQS rate limiting)
4. Use Provisioned Concurrency for critical functions
### Don't: Skip VPC Best Practices
**Problem:** Lambda in VPC requires ENIs (slow cold starts pre-2019).
**Modern Solution (Post-2019):**
- Hyperplane ENIs (shared across functions)
- No cold start penalty
- Use VPC for RDS/ElastiCache access
- Use VPC endpoints for AWS services (avoid NAT Gateway costs)
**Configuration:**
```yaml
Function:
VpcConfig:
SecurityGroupIds:
- !Ref LambdaSecurityGroup
SubnetIds:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
```
```
### references/container-patterns.md
```markdown
# AWS Container Patterns
## Table of Contents
- [ECS Service Patterns](#ecs-service-patterns)
- [EKS Cluster Patterns](#eks-cluster-patterns)
- [Fargate vs EC2 Launch Types](#fargate-vs-ec2-launch-types)
- [Task Definition Best Practices](#task-definition-best-practices)
- [Service Discovery and Load Balancing](#service-discovery-and-load-balancing)
- [Auto Scaling Strategies](#auto-scaling-strategies)
- [Service Connect (Service Mesh)](#service-connect-service-mesh)
- [EKS Pod Identities](#eks-pod-identities)
- [Container Security](#container-security)
- [Logging and Monitoring](#logging-and-monitoring)
- [Cost Optimization](#cost-optimization)
- [Anti-Patterns](#anti-patterns)
## ECS Service Patterns
### Basic Web Application (Fargate + ALB)
**Pattern:** Containerized web service with auto-scaling and load balancing.
**Architecture:**
```
Internet → ALB (Application Load Balancer)
→ Target Group
→ ECS Service (Fargate tasks)
→ RDS or DynamoDB
→ ElastiCache (optional)
```
**Use When:**
- Building containerized web applications
- Need auto-scaling without managing servers
- Docker-based deployment workflow
- Team lacks Kubernetes expertise
**Task Definition Components:**
```yaml
TaskDefinition:
Family: web-app
NetworkMode: awsvpc # Required for Fargate
RequiresCompatibilities:
- FARGATE
Cpu: '512' # 0.5 vCPU
Memory: '1024' # 1 GB
ExecutionRoleArn: !GetAtt ExecutionRole.Arn # Pull image, logs
TaskRoleArn: !GetAtt TaskRole.Arn # Container permissions
ContainerDefinitions:
- Name: web
Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/web-app:latest
PortMappings:
- ContainerPort: 80
Protocol: tcp
Environment:
- Name: NODE_ENV
Value: production
Secrets:
- Name: DB_PASSWORD
ValueFrom: arn:aws:secretsmanager:us-east-1:123456789012:secret:db-pass
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/web-app
awslogs-region: us-east-1
awslogs-stream-prefix: web
HealthCheck:
Command:
- CMD-SHELL
- curl -f http://localhost/health || exit 1
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 60
```
**Service Configuration:**
```yaml
Service:
ServiceName: web-service
Cluster: !Ref ECSCluster
TaskDefinition: !Ref TaskDefinition
DesiredCount: 2
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
SecurityGroups:
- !Ref ServiceSecurityGroup
AssignPublicIp: DISABLED
LoadBalancers:
- TargetGroupArn: !Ref TargetGroup
ContainerName: web
ContainerPort: 80
HealthCheckGracePeriodSeconds: 60
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 100
DeploymentCircuitBreaker:
Enable: true
Rollback: true
```
**Cost Estimate (2 tasks, 24/7):**
- Fargate (0.5 vCPU, 1GB): ~$35/month
- ALB: ~$20/month
- Data transfer: ~$5/month
- **Total: ~$60/month**
### Background Worker Pattern
**Pattern:** Process asynchronous jobs from SQS queue.
**Architecture:**
```
SQS Queue → ECS Service (Fargate tasks)
→ DynamoDB (job status)
→ S3 (results)
```
**Use When:**
- Processing background jobs (image processing, report generation)
- Need auto-scaling based on queue depth
- Variable workload patterns
- Prefer containers over Lambda (>15 min runtime, >10GB memory)
**Task Definition Differences:**
```yaml
ContainerDefinitions:
- Name: worker
Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/worker:latest
# No PortMappings needed
Environment:
- Name: QUEUE_URL
Value: !Ref WorkQueue
- Name: BATCH_SIZE
Value: '10'
Essential: true
```
**Service Configuration:**
```yaml
Service:
ServiceName: worker-service
DesiredCount: 1 # Scale based on queue depth
# No LoadBalancers configuration
```
**Auto Scaling by Queue Depth:**
```yaml
ScalableTarget:
ServiceNamespace: ecs
ResourceId: !Sub "service/${Cluster}/${ServiceName}"
ScalableDimension: ecs:service:DesiredCount
MinCapacity: 1
MaxCapacity: 20
ScalingPolicy:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
CustomizedMetricSpecification:
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Statistic: Average
Dimensions:
- Name: QueueName
Value: !GetAtt WorkQueue.QueueName
TargetValue: 100 # Target 100 messages per task
ScaleInCooldown: 300
ScaleOutCooldown: 60
```
### Scheduled Task Pattern
**Pattern:** Run containerized cron jobs using EventBridge.
**Architecture:**
```
EventBridge Rule (schedule) → ECS Task (Fargate)
→ Process data
→ Store results in S3
```
**Use When:**
- Scheduled batch processing
- Need more than 15 minutes runtime (Lambda limit)
- Require more than 10 GB memory
- Complex dependencies or large Docker images
**EventBridge Rule:**
```yaml
ScheduledRule:
ScheduleExpression: cron(0 2 * * ? *) # 2 AM UTC daily
State: ENABLED
Targets:
- Arn: !GetAtt ECSCluster.Arn
RoleArn: !GetAtt EventsRole.Arn
EcsParameters:
TaskDefinitionArn: !Ref TaskDefinition
LaunchType: FARGATE
NetworkConfiguration:
AwsVpcConfiguration:
Subnets:
- !Ref PrivateSubnet1
SecurityGroups:
- !Ref TaskSecurityGroup
AssignPublicIp: DISABLED
TaskCount: 1
```
**Benefits over Lambda:**
- No 15-minute timeout
- Up to 120 GB memory (Fargate)
- More familiar Docker ecosystem
- Can use same image as production service
## EKS Cluster Patterns
### Production-Ready EKS Cluster
**Pattern:** Managed Kubernetes cluster with best practices.
**Architecture:**
```
VPC (10.0.0.0/16)
├── Public Subnets (NAT Gateways, Load Balancers)
├── Private Subnets (EKS worker nodes)
└── Database Subnets (RDS, ElastiCache)
EKS Control Plane (AWS-managed)
└── Node Groups (EC2 or Fargate)
└── Pods (application containers)
```
**Use When:**
- Team has Kubernetes expertise
- Need Kubernetes ecosystem (Helm, Operators, Istio)
- Multi-cloud or hybrid cloud strategy
- Complex orchestration requirements
- Migrating from on-premises Kubernetes
**Cluster Configuration:**
```yaml
EKSCluster:
Name: production-cluster
Version: '1.28'
RoleArn: !GetAtt ClusterRole.Arn
ResourcesVpcConfig:
SubnetIds:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
- !Ref PrivateSubnet3
EndpointPublicAccess: false # Private cluster
EndpointPrivateAccess: true
SecurityGroupIds:
- !Ref ClusterSecurityGroup
Logging:
ClusterLogging:
EnabledTypes:
- Type: api
- Type: audit
- Type: authenticator
- Type: controllerManager
- Type: scheduler
EncryptionConfig:
- Resources:
- secrets
Provider:
KeyArn: !GetAtt KMSKey.Arn
```
**Managed Node Group (Recommended):**
```yaml
NodeGroup:
ClusterName: !Ref EKSCluster
NodegroupName: general-purpose
NodeRole: !GetAtt NodeRole.Arn
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
ScalingConfig:
MinSize: 2
MaxSize: 10
DesiredSize: 3
InstanceTypes:
- t3.medium
- t3a.medium # AMD alternative, 10% cheaper
AmiType: AL2_x86_64 # Amazon Linux 2
CapacityType: ON_DEMAND # or SPOT for 70% savings
UpdateConfig:
MaxUnavailable: 1 # Rolling updates
Labels:
environment: production
workload-type: general
Taints:
- Key: dedicated
Value: general
Effect: NoSchedule
```
**Fargate Profile (Serverless Nodes):**
```yaml
FargateProfile:
ClusterName: !Ref EKSCluster
FargateProfileName: serverless-workloads
PodExecutionRoleArn: !GetAtt FargateRole.Arn
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
Selectors:
- Namespace: serverless
Labels:
compute-type: fargate
- Namespace: kube-system
Labels:
k8s-app: kube-dns # CoreDNS on Fargate
```
**Cost Breakdown:**
- EKS control plane: $73/month
- Worker nodes (3x t3.medium, 24/7): ~$94/month
- NAT Gateways (2x): ~$65/month
- Data transfer: ~$10/month
- **Total: ~$242/month minimum**
### EKS Auto Mode (2024+ Feature)
**Pattern:** Fully managed node lifecycle, auto-scaling, and upgrades.
**Use When:**
- Want maximum automation
- Minimize operational overhead
- Acceptable with AWS-managed node pools
- Cost is less critical than simplicity
**Configuration:**
```yaml
EKSCluster:
ComputeConfig:
Enabled: true # Enable Auto Mode
NodePools:
- general-purpose # AWS-managed
NodeRoleArn: !GetAtt AutoModeNodeRole.Arn
```
**Auto Mode Features:**
- Automatic node provisioning
- Auto-scaling without Cluster Autoscaler
- Automated OS patches and upgrades
- Built-in cost optimization
**Cost Impact:**
- ~10-15% higher than self-managed nodes
- Savings from reduced operational overhead
- Automatic spot instance usage
### Multi-Tenant EKS Pattern
**Pattern:** Isolated namespaces with resource quotas and network policies.
**Use When:**
- Multiple teams or applications sharing cluster
- Need cost efficiency (shared control plane)
- Strong isolation requirements
- Centralized cluster management
**Namespace Isolation:**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: team-a
labels:
team: team-a
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
persistentvolumeclaims: "5"
services.loadbalancers: "2"
---
apiVersion: v1
kind: LimitRange
metadata:
name: team-a-limits
namespace: team-a
spec:
limits:
- max:
cpu: "4"
memory: 8Gi
min:
cpu: "100m"
memory: 128Mi
type: Container
```
**Network Policies (Deny by Default):**
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: team-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-within-namespace
namespace: team-a
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {} # Allow from same namespace
```
**RBAC (Role-Based Access Control):**
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: team-a-developer
namespace: team-a
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["pods", "deployments", "jobs", "services"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["secrets", "configmaps"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-a-developers
namespace: team-a
subjects:
- kind: Group
name: team-a
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: team-a-developer
apiGroup: rbac.authorization.k8s.io
```
## Fargate vs EC2 Launch Types
### Decision Matrix
| Factor | Fargate | EC2 |
|--------|---------|-----|
| **Management** | AWS manages instances | Self-managed instances |
| **Pricing** | Per task-second | Per instance-hour |
| **Scaling** | Task-level (instant) | Instance-level (slower) |
| **Cost (predictable)** | Higher | Lower (RI/SP savings) |
| **Cost (variable)** | Lower | Higher (idle capacity) |
| **Start Time** | ~30 seconds | Instant (if capacity) |
| **Instance Access** | No SSH access | Full SSH access |
| **Customization** | Limited | Full control |
| **Spot Support** | Fargate Spot (70% off) | EC2 Spot (70% off) |
### Cost Comparison Example
**Workload:** Web service, 2 vCPU, 4 GB RAM, 24/7
**Fargate:**
- Pricing: $0.04048/hour (2 vCPU) + $0.004445/hour per GB = ~$0.0587/hour
- Monthly: $0.0587 × 730 hours = $42.85/month
**EC2 (t3.medium: 2 vCPU, 4 GB):**
- On-Demand: $0.0416/hour = $30.37/month
- 1-year Reserved Instance: $0.0208/hour = $15.18/month
- 3-year Reserved Instance: $0.0125/hour = $9.13/month
**Recommendation:**
- **Variable traffic:** Fargate (scale to zero)
- **Predictable 24/7:** EC2 with Reserved Instances
- **Development/Test:** Fargate (simpler, pay only when testing)
- **Production high-volume:** EC2 with RI (50-70% cost savings)
### Fargate Spot
**Pattern:** Run fault-tolerant tasks at 70% discount.
**Use When:**
- Batch processing jobs
- CI/CD build workers
- Data processing pipelines
- Development/test environments
**Configuration:**
```yaml
Service:
CapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 4
Base: 0
- CapacityProvider: FARGATE
Weight: 1
Base: 2 # Always maintain 2 on-demand tasks
```
**Interruption Handling:**
- 2-minute warning before interruption
- Listen for SIGTERM signal
- Gracefully finish current work
- Task automatically restarted
**Python Example:**
```python
import signal
import sys
def sigterm_handler(signum, frame):
print("SIGTERM received, gracefully shutting down")
# Finish current work
cleanup()
sys.exit(0)
signal.signal(signal.SIGTERM, sigterm_handler)
# Main processing loop
while True:
process_batch()
```
## Task Definition Best Practices
### Resource Allocation
**CPU and Memory Combinations (Fargate):**
| CPU (vCPU) | Memory Options (GB) |
|------------|---------------------|
| 0.25 | 0.5, 1, 2 |
| 0.5 | 1, 2, 3, 4 |
| 1 | 2, 3, 4, 5, 6, 7, 8 |
| 2 | 4-16 (1 GB increments) |
| 4 | 8-30 (1 GB increments) |
| 8 | 16-60 (4 GB increments) |
| 16 | 32-120 (8 GB increments) |
**Right-Sizing Approach:**
1. Start with 0.5 vCPU, 1 GB (common web apps)
2. Monitor CloudWatch metrics (CPUUtilization, MemoryUtilization)
3. Increase if consistently >70% utilization
4. Decrease if consistently <30% utilization
### Health Checks
**Container Health Check:**
```yaml
HealthCheck:
Command:
- CMD-SHELL
- curl -f http://localhost:8080/health || exit 1
Interval: 30 # Seconds between checks
Timeout: 5 # Max time for check to complete
Retries: 3 # Consecutive failures before unhealthy
StartPeriod: 60 # Grace period for startup
```
**Load Balancer Health Check:**
```yaml
TargetGroup:
HealthCheckPath: /health
HealthCheckProtocol: HTTP
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2 # Consecutive successes
UnhealthyThresholdCount: 3 # Consecutive failures
Matcher:
HttpCode: 200 # or 200-299
```
**Best Practices:**
- Use dedicated health check endpoint
- Check critical dependencies (database connectivity)
- Return appropriate status codes
- Keep health checks fast (<1 second)
- Log health check failures
### Environment Variables vs Secrets
**Environment Variables (Plain Text):**
```yaml
Environment:
- Name: NODE_ENV
Value: production
- Name: LOG_LEVEL
Value: info
- Name: API_ENDPOINT
Value: https://api.example.com
```
**Use For:**
- Non-sensitive configuration
- Public endpoints
- Feature flags
- Environment names
**Secrets (Encrypted):**
```yaml
Secrets:
- Name: DB_PASSWORD
ValueFrom: arn:aws:secretsmanager:us-east-1:123456789012:secret:db-pass
- Name: API_KEY
ValueFrom: arn:aws:ssm:us-east-1:123456789012:parameter/api-key
```
**Use For:**
- Database credentials
- API keys
- OAuth tokens
- Private certificates
**IAM Permissions Required:**
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"ssm:GetParameters"
],
"Resource": [
"arn:aws:secretsmanager:us-east-1:123456789012:secret:*",
"arn:aws:ssm:us-east-1:123456789012:parameter/*"
]
}]
}
```
### Logging Configuration
**CloudWatch Logs (Default):**
```yaml
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/my-service
awslogs-region: us-east-1
awslogs-stream-prefix: task
awslogs-create-group: true
```
**FireLens (Advanced Routing):**
```yaml
LogConfiguration:
LogDriver: awsfirelens
Options:
Name: datadog
apikey: !Ref DatadogApiKey
dd_service: my-service
dd_source: ecs
dd_tags: env:production,team:platform
```
**Supported Destinations:**
- CloudWatch Logs
- Datadog
- New Relic
- Splunk
- Elasticsearch
- S3 (via Kinesis Firehose)
## Service Discovery and Load Balancing
### AWS Cloud Map (Service Discovery)
**Pattern:** DNS-based service discovery for microservices.
**Use When:**
- Microservices architecture
- Service-to-service communication
- Dynamic service endpoints
- No load balancer needed (internal traffic)
**Configuration:**
```yaml
ServiceDiscoveryService:
Name: backend-api
DnsConfig:
NamespaceId: !Ref PrivateNamespace
DnsRecords:
- Type: A
TTL: 10
HealthCheckCustomConfig:
FailureThreshold: 1
PrivateNamespace:
Name: internal.example.com
Vpc: !Ref VPC
```
**Service Registration:**
```yaml
Service:
ServiceRegistries:
- RegistryArn: !GetAtt ServiceDiscoveryService.Arn
ContainerName: backend
ContainerPort: 8080
```
**Usage in Application:**
```python
# Resolve service using DNS
import socket
hostname = "backend-api.internal.example.com"
ip_address = socket.gethostbyname(hostname)
# Make request
response = requests.get(f"http://{ip_address}:8080/api")
```
**Benefits:**
- No load balancer cost
- Automatic registration/deregistration
- Health check integration
- Low latency (DNS caching)
### Application Load Balancer Integration
**Path-Based Routing:**
```yaml
ListenerRule1:
Priority: 1
Conditions:
- Field: path-pattern
Values:
- /api/*
Actions:
- Type: forward
TargetGroupArn: !Ref BackendTargetGroup
ListenerRule2:
Priority: 2
Conditions:
- Field: path-pattern
Values:
- /admin/*
Actions:
- Type: forward
TargetGroupArn: !Ref AdminTargetGroup
```
**Host-Based Routing:**
```yaml
ListenerRule:
Conditions:
- Field: host-header
Values:
- api.example.com
- api-staging.example.com
Actions:
- Type: forward
TargetGroupArn: !Ref ApiTargetGroup
```
**Header-Based Routing:**
```yaml
ListenerRule:
Conditions:
- Field: http-header
HttpHeaderConfig:
HttpHeaderName: X-API-Version
Values:
- v2
Actions:
- Type: forward
TargetGroupArn: !Ref V2TargetGroup
```
## Auto Scaling Strategies
### Target Tracking (Recommended)
**CPU-Based Scaling:**
```yaml
ScalingPolicy:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
TargetValue: 70.0 # 70% CPU utilization
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleInCooldown: 300 # 5 minutes
ScaleOutCooldown: 60 # 1 minute
```
**Memory-Based Scaling:**
```yaml
TargetTrackingScalingPolicyConfiguration:
TargetValue: 80.0 # 80% memory utilization
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageMemoryUtilization
```
**ALB Request Count Scaling:**
```yaml
TargetTrackingScalingPolicyConfiguration:
TargetValue: 1000 # 1000 requests per target
PredefinedMetricSpecification:
PredefinedMetricType: ALBRequestCountPerTarget
ResourceLabel: !Sub
- ${LoadBalancerFullName}/${TargetGroupFullName}
- LoadBalancerFullName: !GetAtt ALB.LoadBalancerFullName
TargetGroupFullName: !GetAtt TargetGroup.TargetGroupFullName
```
### Step Scaling (Advanced)
**Use When:**
- Need different scaling behavior at different thresholds
- Complex scaling logic
- Multiple alarm thresholds
**Configuration:**
```yaml
ScalingPolicy:
PolicyType: StepScaling
StepScalingPolicyConfiguration:
AdjustmentType: PercentChangeInCapacity
Cooldown: 300
MetricAggregationType: Average
StepAdjustments:
- MetricIntervalLowerBound: 0
MetricIntervalUpperBound: 10
ScalingAdjustment: 10 # Add 10% capacity
- MetricIntervalLowerBound: 10
MetricIntervalUpperBound: 20
ScalingAdjustment: 20 # Add 20% capacity
- MetricIntervalLowerBound: 20
ScalingAdjustment: 30 # Add 30% capacity
```
### Scheduled Scaling
**Use When:**
- Predictable traffic patterns (business hours)
- Batch processing at specific times
- Cost optimization (scale down at night)
**Configuration:**
```yaml
ScheduledAction1:
ScheduledActionName: scale-up-morning
Schedule: cron(0 7 * * MON-FRI *) # 7 AM weekdays
ScalableTargetAction:
MinCapacity: 5
MaxCapacity: 20
ScheduledAction2:
ScheduledActionName: scale-down-evening
Schedule: cron(0 19 * * * *) # 7 PM daily
ScalableTargetAction:
MinCapacity: 1
MaxCapacity: 5
```
## Service Connect (Service Mesh)
### Built-In Service Mesh (2023+ Feature)
**Use When:**
- Need service-to-service communication
- Want observability without code changes
- Require retry logic and circuit breaking
- Avoid complexity of Istio or App Mesh
**Architecture:**
```
Service A → Envoy Proxy → Service B Endpoint
→ Service B Load Balancing
→ CloudMap Discovery
```
**Configuration:**
```yaml
Cluster:
ServiceConnectDefaults:
Namespace: internal.local
Service:
ServiceConnectConfiguration:
Enabled: true
Namespace: internal.local
Services:
- PortName: http
ClientAliases:
- Port: 8080
DnsName: backend-api
IngressPortOverride: 8080
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/service-connect
awslogs-stream-prefix: proxy
```
**Benefits:**
- Zero code changes
- Automatic retries
- Circuit breaking
- Connection pooling
- Metrics and tracing
- mTLS encryption (optional)
**Observability:**
- Request success/failure rates
- Latency percentiles (p50, p99)
- Active connections
- Retry counts
- All metrics in CloudWatch
## EKS Pod Identities
### Simplified IAM for Pods (2024+ Feature)
**Use When:**
- Running applications on EKS
- Need AWS service access from pods
- Want simpler configuration than IRSA
**Old Way (IRSA - IAM Roles for Service Accounts):**
```yaml
# Complex OIDC provider setup required
# Per-namespace service account annotations
# Trust relationship with OIDC provider
```
**New Way (Pod Identities):**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: default
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyAppRole
```
**IAM Role Configuration:**
```yaml
MyAppRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: pods.eks.amazonaws.com
Action: sts:AssumeRole
Condition:
StringEquals:
aws:SourceAccount: !Ref AWS::AccountId
ArnEquals:
aws:SourceArn: !Sub 'arn:aws:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}'
Policies:
- PolicyName: S3Access
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub '${Bucket.Arn}/*'
```
**Pod Specification:**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: my-app
namespace: default
spec:
serviceAccountName: my-app
containers:
- name: app
image: my-app:latest
# AWS SDK automatically assumes role
```
**Benefits over IRSA:**
- Simpler setup (no OIDC provider)
- Works with any namespace
- Faster credential rotation
- Better audit logging
## Container Security
### Image Scanning
**ECR Image Scanning:**
```yaml
Repository:
ImageScanningConfiguration:
ScanOnPush: true
ImageTagMutability: IMMUTABLE # Prevent tag overwrites
```
**Scan Results Integration:**
- Automatic CVE detection
- Severity classification (Critical, High, Medium, Low)
- Integration with Security Hub
- Findings in ECR console
**Best Practices:**
- Scan all images before deployment
- Block deployments with critical vulnerabilities
- Regularly rebuild base images
- Use minimal base images (alpine, distroless)
### Runtime Security
**Read-Only Root Filesystem:**
```yaml
ContainerDefinitions:
- Name: app
ReadonlyRootFilesystem: true
MountPoints:
- SourceVolume: tmp
ContainerPath: /tmp
ReadOnly: false
Volumes:
- Name: tmp
Host:
SourcePath: /tmp
```
**Drop Capabilities (Kubernetes):**
```yaml
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Only if needed
```
**Network Segmentation:**
```yaml
SecurityGroup:
Ingress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
SourceSecurityGroupId: !Ref ALBSecurityGroup
# No SSH access
Egress:
- IpProtocol: tcp
FromPort: 443
ToPort: 443
CidrIp: 0.0.0.0/0 # HTTPS only
- IpProtocol: tcp
FromPort: 5432
ToPort: 5432
DestinationSecurityGroupId: !Ref DBSecurityGroup
```
## Logging and Monitoring
### Container Insights
**Enable Container Insights:**
```yaml
Cluster:
ClusterSettings:
- Name: containerInsights
Value: enabled
```
**Metrics Collected:**
- CPU utilization (cluster, service, task)
- Memory utilization
- Network throughput
- Disk I/O
- Task count
**Log Insights Queries:**
```sql
-- Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Slowest requests
fields @timestamp, duration, status
| filter path = "/api/users"
| stats avg(duration), max(duration), count() by bin(5m)
```
### X-Ray Tracing
**Enable X-Ray:**
```yaml
ContainerDefinitions:
- Name: xray-daemon
Image: amazon/aws-xray-daemon
PortMappings:
- ContainerPort: 2000
Protocol: udp
- Name: app
Environment:
- Name: AWS_XRAY_DAEMON_ADDRESS
Value: localhost:2000
```
**Application Code (Python):**
```python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
app = Flask(__name__)
XRayMiddleware(app, xray_recorder)
@app.route('/api/users')
def get_users():
# Automatically traced
return jsonify(users)
```
## Cost Optimization
### EC2 Launch Type Optimization
**Spot Instances (70% Savings):**
```yaml
CapacityProvider:
Name: spot-capacity
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref SpotASG
ManagedScaling:
Status: ENABLED
TargetCapacity: 100
ManagedTerminationProtection: ENABLED
Service:
CapacityProviderStrategy:
- CapacityProvider: spot-capacity
Weight: 4
- CapacityProvider: ondemand-capacity
Weight: 1
Base: 2 # Always maintain 2 on-demand
```
**Graviton Processors (20% Cost Savings):**
```yaml
LaunchTemplate:
InstanceType: t4g.medium # Graviton-based
# 20% cheaper than t3.medium
# 40% better price-performance
```
**Reserved Instances:**
- 1-year: 30-40% savings
- 3-year: 50-60% savings
- Use for predictable baseline capacity
### Task-Level Optimization
**Right-Sizing:**
- Monitor CloudWatch metrics weekly
- Reduce over-provisioned resources
- Use Compute Optimizer recommendations
**Reduce Data Transfer:**
- Use VPC endpoints (avoid NAT Gateway costs)
- Place services in same AZ when possible
- Use CloudFront for static assets
**Storage Optimization:**
- Use ephemeral storage (free)
- Avoid EBS volumes unless necessary
- Clean up unused ECR images
## Anti-Patterns
### Don't: Run Databases in Containers
**Problem:** Stateful data, performance overhead, operational complexity.
**Solution:** Use RDS, Aurora, or DynamoDB.
### Don't: Use Latest Tag
**Problem:** Unpredictable deployments, difficult rollbacks.
**Solution:** Use immutable tags (commit SHA, semantic versions).
```yaml
# Bad
Image: my-app:latest
# Good
Image: my-app:v1.2.3
Image: my-app:commit-abc123
```
### Don't: Store Secrets in Environment Variables
**Problem:** Exposed in logs, console, API responses.
**Solution:** Use Secrets Manager or Parameter Store.
### Don't: Run Single Replica
**Problem:** No high availability, downtime during deployments.
**Solution:** Run minimum 2 replicas across multiple AZs.
### Don't: Ignore Resource Limits
**Problem:** Resource starvation, OOM kills, cascading failures.
**Solution:** Set appropriate CPU and memory limits.
```yaml
# Kubernetes
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
```
### Don't: Use Default VPC
**Problem:** No subnet segmentation, poor security posture.
**Solution:** Create custom VPC with private subnets.
### Don't: Skip Health Checks
**Problem:** Traffic sent to unhealthy tasks, user-facing errors.
**Solution:** Implement comprehensive health checks.
### Don't: Ignore Deployment Circuit Breakers
**Problem:** Bad deployments can take down entire service.
**Solution:** Enable circuit breakers with automatic rollback.
```yaml
DeploymentConfiguration:
DeploymentCircuitBreaker:
Enable: true
Rollback: true
```
```
### references/networking.md
```markdown
# AWS Networking - Deep Dive
## Table of Contents
- [VPC Architecture](#vpc-architecture)
- [Standard 3-Tier Pattern](#standard-3-tier-pattern)
- [Security Groups vs. NACLs](#security-groups-vs-nacls)
- [NAT Gateway](#nat-gateway)
- [Load Balancers](#load-balancers)
- [Application Load Balancer (ALB)](#application-load-balancer-alb)
- [Network Load Balancer (NLB)](#network-load-balancer-nlb)
- [CloudFront (CDN)](#cloudfront-cdn)
- [Route 53 (DNS)](#route-53-dns)
- [Routing Policies](#routing-policies)
- [VPC Peering](#vpc-peering)
- [PrivateLink](#privatelink)
- [Transit Gateway](#transit-gateway)
## VPC Architecture
### Standard 3-Tier Pattern
```
VPC: 10.0.0.0/16 (65,536 IPs)
Availability Zone A:
Public Subnet: 10.0.1.0/24 (256 IPs: ALB, NAT Gateway, Bastion)
Private Subnet: 10.0.11.0/24 (256 IPs: ECS, Lambda, App Servers)
Database Subnet: 10.0.21.0/24 (256 IPs: RDS, Aurora, ElastiCache)
Availability Zone B:
Public Subnet: 10.0.2.0/24
Private Subnet: 10.0.12.0/24
Database Subnet: 10.0.22.0/24
Availability Zone C:
Public Subnet: 10.0.3.0/24
Private Subnet: 10.0.13.0/24
Database Subnet: 10.0.23.0/24
```
### Security Groups vs. NACLs
| Feature | Security Groups | Network ACLs |
|---------|----------------|--------------|
| **Level** | Instance (ENI) | Subnet |
| **State** | Stateful | Stateless |
| **Rules** | Allow only | Allow + Deny |
| **Return Traffic** | Automatic | Must configure |
| **Evaluation** | All rules | Numbered order |
### NAT Gateway
**Purpose:** Enable private subnet instances to access internet for updates, APIs
**Cost (us-east-1):**
- $0.045/hour = $32.85/month
- $0.045/GB processed
- Deploy one per AZ for HA
**Alternative:** NAT Instance (EC2) - cheaper but manual management
---
## Load Balancers
### Application Load Balancer (ALB)
**Features:**
- Layer 7 (HTTP/HTTPS)
- Path-based routing: `/api` → backend, `/web` → frontend
- Host-based routing: `api.example.com`, `web.example.com`
- WebSocket support
- Lambda targets (serverless backends)
- HTTP/2, gRPC support
**Cost:**
- $0.0225/hour = $16.43/month
- $0.008/LCU-hour (Load Balancer Capacity Unit)
- Minimum ~$20/month
### Network Load Balancer (NLB)
**Features:**
- Layer 4 (TCP/UDP)
- Ultra-low latency (<100 microseconds)
- Millions of requests/second
- Static IP addresses (Elastic IP)
- PrivateLink support
**Cost:**
- $0.0225/hour = $16.43/month
- $0.006/NLCU-hour
**Use When:**
- Extreme performance needed
- Static IPs required
- Non-HTTP protocols
---
## CloudFront (CDN)
**Purpose:** Global content delivery network with 450+ edge locations
**Features:**
- Cache static content (images, CSS, JS)
- Dynamic content acceleration
- DDoS protection (AWS Shield)
- Lambda@Edge for edge compute
**Cost:**
- Data transfer: $0.085/GB (first 10TB, decreases)
- Requests: $0.0075 per 10,000
- Free tier: 1TB transfer, 10M requests/month (12 months)
**Cache Behaviors:**
- Match path patterns: `/images/*`, `/api/*`
- TTL configuration per pattern
- Origin types: S3, ALB, custom
---
## Route 53 (DNS)
### Routing Policies
| Policy | Use Case |
|--------|----------|
| **Simple** | Single resource |
| **Weighted** | A/B testing, gradual migration (10% → 90%) |
| **Latency** | Route to lowest-latency region |
| **Failover** | Active-passive disaster recovery |
| **Geolocation** | Route based on user location |
| **Geoproximity** | Route based on resource location + bias |
| **Multi-value** | Return multiple IPs with health checks |
**Cost:**
- Hosted zone: $0.50/month
- Queries: $0.40 per million
- Health checks: $0.50/month each
---
## VPC Peering
**Purpose:** Connect two VPCs privately (same region or cross-region)
**Characteristics:**
- Non-transitive (A↔B, B↔C doesn't mean A↔C)
- No overlapping CIDR blocks
- Data transfer: $0.01/GB (same region), $0.02/GB (cross-region)
---
## PrivateLink
**Purpose:** Privately access AWS services or third-party services
**Use Cases:**
- Access S3, DynamoDB without internet gateway
- SaaS vendor connections
- Shared services across accounts
**Cost:**
- $0.01/hour per AZ = $7.30/month per AZ
- $0.01/GB processed
---
## Transit Gateway
**Purpose:** Hub-and-spoke network architecture for 100s of VPCs
**Features:**
- Connect VPCs, on-premises networks (VPN, Direct Connect)
- Transitive routing
- Route table customization
**Cost:**
- $0.05/hour per attachment = $36.50/month
- $0.02/GB processed
**Use When:**
- >5 VPCs to connect
- Need centralized routing
- Hybrid cloud architecture
```
### references/security.md
```markdown
# AWS Security Best Practices
## Table of Contents
- [IAM (Identity and Access Management)](#iam-identity-and-access-management)
- [Least Privilege Policy Example](#least-privilege-policy-example)
- [IAM Best Practices](#iam-best-practices)
- [IAM Roles for Common Services](#iam-roles-for-common-services)
- [KMS (Key Management Service)](#kms-key-management-service)
- [Key Types](#key-types)
- [Encryption Patterns](#encryption-patterns)
- [KMS API Costs](#kms-api-costs)
- [Secrets Manager](#secrets-manager)
- [Cost Comparison](#cost-comparison)
- [When to Use Each](#when-to-use-each)
- [Automatic Rotation Example](#automatic-rotation-example)
- [WAF (Web Application Firewall)](#waf-web-application-firewall)
- [Managed Rule Groups](#managed-rule-groups)
- [Custom Rules](#custom-rules)
- [Cost Model](#cost-model)
- [GuardDuty (Threat Detection)](#guardduty-threat-detection)
- [Security Hub](#security-hub)
- [Network Security](#network-security)
- [Security Group Strategy](#security-group-strategy)
- [VPC Flow Logs](#vpc-flow-logs)
- [Encryption Checklist](#encryption-checklist)
- [Data at Rest](#data-at-rest)
- [Data in Transit](#data-in-transit)
- [Key Management](#key-management)
- [Compliance Frameworks](#compliance-frameworks)
- [AWS Artifact](#aws-artifact)
- [AWS Config](#aws-config)
- [Incident Response](#incident-response)
- [CloudTrail](#cloudtrail)
- [Security Automation](#security-automation)
## IAM (Identity and Access Management)
### Least Privilege Policy Example
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-bucket/uploads/*",
"Condition": {
"IpAddress": {
"aws:SourceIp": "203.0.113.0/24"
}
}
}]
}
```
### IAM Best Practices
1. **Use Roles, Not Users** for applications
2. **Enable MFA** for privileged users
3. **Use IAM Access Analyzer** to validate policies
4. **Implement Permission Boundaries** for maximum permissions
5. **Rotate Credentials** regularly (90 days)
6. **Use AWS Organizations SCPs** for guardrails
### IAM Roles for Common Services
**Lambda Execution Role:**
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}
```
**ECS Task Role:**
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ecs-tasks.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}
```
---
## KMS (Key Management Service)
### Key Types
| Type | Cost | Rotation | Use Case |
|------|------|----------|----------|
| **AWS Managed** | Free | Automatic (3 years) | S3, EBS, RDS default |
| **Customer Managed** | $1/month | Manual or automatic | Custom policies |
| **Custom Key Store** | $1/month + CloudHSM | Manual | FIPS 140-2 Level 3 |
### Encryption Patterns
**Server-Side Encryption (SSE):**
- S3: SSE-S3 (free), SSE-KMS ($1/month key + API costs)
- EBS: Encrypted by default (recommended)
- RDS: Enable at creation
**Client-Side Encryption:**
- Encrypt before sending to AWS
- Application manages keys
- Use KMS Encrypt API or encryption SDK
### KMS API Costs
- $0.03 per 10,000 requests
- Free tier: AWS managed keys
- Monitor usage for high-volume applications
---
## Secrets Manager
### Cost Comparison
| Service | Cost/Secret | Rotation | Secret Size |
|---------|-------------|----------|-------------|
| **Secrets Manager** | $0.40/month | Automatic (Lambda) | 64KB |
| **Parameter Store (Standard)** | Free | Manual | 4KB |
| **Parameter Store (Advanced)** | $0.05/month | Manual | 8KB |
### When to Use Each
**Secrets Manager:**
- Database credentials with rotation
- API keys requiring rotation
- Multi-region replication needed
**Parameter Store:**
- Application configuration
- Non-rotating secrets
- Cost-sensitive scenarios
### Automatic Rotation Example
**RDS MySQL Credentials:**
1. Secrets Manager invokes Lambda every 30 days
2. Lambda creates new user with same permissions
3. Tests new credentials
4. Updates secret
5. Deletes old user
---
## WAF (Web Application Firewall)
### Managed Rule Groups
| Rule Group | Purpose | Cost |
|------------|---------|------|
| **Core Rule Set** | OWASP Top 10 | $10/month |
| **SQL Injection** | Database attacks | $10/month |
| **Known Bad Inputs** | CVE signatures | $10/month |
| **IP Reputation** | Block malicious IPs | $10/month |
### Custom Rules
**Rate Limiting:**
```json
{
"Name": "RateLimitRule",
"Priority": 1,
"Statement": {
"RateBasedStatement": {
"Limit": 2000,
"AggregateKeyType": "IP"
}
},
"Action": {"Block": {}}
}
```
**Geo-Blocking:**
```json
{
"Name": "BlockCountries",
"Priority": 2,
"Statement": {
"GeoMatchStatement": {
"CountryCodes": ["CN", "RU"]
}
},
"Action": {"Block": {}}
}
```
### Cost Model
- Web ACL: $5/month
- Rules: $1/month per rule
- Requests: $0.60 per million
- Example: 1 ACL + 5 rules + 10M requests = $5 + $5 + $6 = $16/month
---
## GuardDuty (Threat Detection)
**Purpose:** Intelligent threat detection using ML
**Data Sources:**
- VPC Flow Logs
- CloudTrail event logs
- DNS logs
**Cost:**
- CloudTrail: $4.50 per million events
- VPC Flow Logs: $0.50 per million events analyzed
- DNS logs: $0.40 per million queries
**Use Cases:**
- Detect compromised instances
- Identify reconnaissance attempts
- Find unauthorized access
---
## Security Hub
**Purpose:** Centralized security dashboard, compliance checks
**Features:**
- CIS AWS Foundations Benchmark
- PCI DSS compliance
- Integration with GuardDuty, Inspector, Macie
**Cost:**
- Security checks: $0.001 per check
- Findings ingestion: $0.00003 per finding
---
## Network Security
### Security Group Strategy
**Principle:** Default deny, explicit allow
**Example: Web Tier**
```
Inbound:
- Port 443 (HTTPS) from 0.0.0.0/0
- Port 80 (HTTP) from 0.0.0.0/0
Outbound:
- All traffic (stateful return traffic automatic)
```
**Example: App Tier**
```
Inbound:
- Port 8080 from web-tier-sg
- Port 3000 from web-tier-sg
Outbound:
- Port 5432 to database-tier-sg (PostgreSQL)
- Port 443 to 0.0.0.0/0 (API calls)
```
### VPC Flow Logs
**Purpose:** Network traffic analysis, troubleshooting, security monitoring
**Cost:**
- $0.50 per GB ingested to CloudWatch Logs
- Can send to S3 for cheaper storage ($0.023/GB)
**Analysis Tools:**
- CloudWatch Insights for queries
- Athena for S3-stored logs
- Third-party tools (Splunk, Datadog)
---
## Encryption Checklist
### Data at Rest
- [ ] S3: SSE-S3 or SSE-KMS enabled
- [ ] EBS: Encryption by default enabled
- [ ] RDS: Encryption enabled at creation
- [ ] DynamoDB: Encryption enabled (free)
- [ ] EFS: Encryption enabled
- [ ] ElastiCache: At-rest encryption (Redis only)
### Data in Transit
- [ ] ALB/NLB: HTTPS listeners with TLS 1.2+
- [ ] CloudFront: HTTPS required
- [ ] RDS: Force SSL connections
- [ ] DynamoDB: HTTPS API calls
- [ ] S3: Bucket policies require HTTPS
### Key Management
- [ ] Use AWS KMS for customer-managed keys
- [ ] Enable automatic key rotation (365 days)
- [ ] Use key policies for access control
- [ ] Monitor KMS API usage (CloudTrail)
---
## Compliance Frameworks
### AWS Artifact
**Purpose:** Access compliance reports (SOC, PCI, ISO, HIPAA)
**Available Reports:**
- SOC 1, 2, 3
- PCI DSS
- ISO 27001, 27017, 27018
- HIPAA BAA
### AWS Config
**Purpose:** Resource inventory, configuration compliance
**Rules Examples:**
- encrypted-volumes: All EBS volumes encrypted
- s3-bucket-public-read-prohibited: No public S3 buckets
- rds-multi-az-enabled: RDS instances Multi-AZ
- iam-password-policy: Strong password requirements
**Cost:**
- $0.003 per configuration item
- $0.001 per rule evaluation
---
## Incident Response
### CloudTrail
**Purpose:** API audit logs for all AWS actions
**Features:**
- 90-day event history (free)
- Longer retention via S3 trail
- Multi-region trails
- Log file integrity validation
**Cost:**
- First copy of management events: Free
- S3 storage: Standard S3 rates
- CloudWatch Logs integration: $0.50/GB
### Security Automation
**Example: Auto-Remediation with Lambda**
1. Config Rule detects non-compliant resource
2. EventBridge triggers Lambda
3. Lambda remediates (e.g., enable encryption)
4. SNS notification sent
**Common Automations:**
- Revoke overly permissive security groups
- Enable encryption on new resources
- Delete public S3 buckets
- Rotate IAM access keys >90 days old
```
### references/well-architected.md
```markdown
# AWS Well-Architected Framework - Implementation Guide
## Table of Contents
- [Six Pillars](#six-pillars)
- [1. Operational Excellence](#1-operational-excellence)
- [2. Security](#2-security)
- [3. Reliability](#3-reliability)
- [4. Performance Efficiency](#4-performance-efficiency)
- [5. Cost Optimization](#5-cost-optimization)
- [6. Sustainability](#6-sustainability)
- [Well-Architected Review Process](#well-architected-review-process)
- [1. Define Workload](#1-define-workload)
- [2. Assess Against Pillars](#2-assess-against-pillars)
- [3. Prioritize Improvements](#3-prioritize-improvements)
- [4. Implement Changes](#4-implement-changes)
- [5. Re-Review Regularly](#5-re-review-regularly)
- [Architecture Patterns by Pillar](#architecture-patterns-by-pillar)
- [Operational Excellence Pattern](#operational-excellence-pattern)
- [Security Pattern](#security-pattern)
- [Reliability Pattern](#reliability-pattern)
- [Performance Pattern](#performance-pattern)
- [Cost Optimization Pattern](#cost-optimization-pattern)
- [Checklist by Pillar](#checklist-by-pillar)
- [Operational Excellence](#operational-excellence)
- [Security](#security)
- [Reliability](#reliability)
- [Performance Efficiency](#performance-efficiency)
- [Cost Optimization](#cost-optimization)
- [Sustainability](#sustainability)
## Six Pillars
### 1. Operational Excellence
**Design Principles:**
- Perform operations as code
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational events
**Implementation:**
**Infrastructure as Code:**
- Use CDK, Terraform, or CloudFormation
- Version control all infrastructure
- Peer review changes via pull requests
- Automate deployments (CI/CD)
**Deployment Strategies:**
- Blue-green deployments (instant rollback)
- Canary releases (gradual traffic shift)
- Feature flags (runtime toggles)
**Observability:**
- CloudWatch Logs (structured JSON logging)
- CloudWatch Metrics (custom metrics)
- X-Ray (distributed tracing)
- CloudWatch Alarms (proactive alerts)
**Runbooks:**
- Document common operations
- Automate with Systems Manager Automation
- Test via GameDays
- Version control runbooks
---
### 2. Security
**Design Principles:**
- Implement strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events
**Implementation:**
**Identity:**
- Use IAM roles for applications
- Implement least privilege
- Enable MFA for privileged users
- Use AWS Organizations for multi-account governance
**Detection:**
- Enable CloudTrail (all regions)
- Enable VPC Flow Logs
- Enable GuardDuty (threat detection)
- Enable Security Hub (compliance)
**Protection:**
- Encrypt data at rest (KMS)
- Encrypt data in transit (TLS 1.2+)
- Use WAF for application layer protection
- Implement security groups and NACLs
**Incident Response:**
- Automate responses with EventBridge + Lambda
- Pre-deploy forensic tools
- Practice incident response via GameDays
---
### 3. Reliability
**Design Principles:**
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally for resilience
- Stop guessing capacity
- Manage change via automation
**Implementation:**
**Multi-AZ Architecture:**
- RDS Multi-AZ (automatic failover)
- Aurora replicas across AZs
- ECS/EKS tasks distributed across AZs
- ALB/NLB across multiple AZs
**Auto-Scaling:**
- EC2 Auto Scaling Groups
- ECS Service Auto Scaling
- DynamoDB Auto Scaling
- Application Auto Scaling
**Backup and Recovery:**
- RDS automated backups (7-35 days)
- S3 versioning and replication
- EBS snapshots (automated lifecycle)
- Cross-region backups
**Chaos Engineering:**
- AWS Fault Injection Simulator
- Test failure scenarios regularly
- Validate recovery procedures
**Change Management:**
- Infrastructure as code (no manual changes)
- Blue-green deployments
- Automated testing (unit, integration, e2e)
---
### 4. Performance Efficiency
**Design Principles:**
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy
**Implementation:**
**Compute Optimization:**
- Use Compute Optimizer for rightsizing
- Lambda for event-driven workloads
- Fargate for containers (no EC2 management)
- Graviton processors (25% better performance, 60% less energy)
**Storage Optimization:**
- S3 Intelligent-Tiering (auto-optimize)
- EBS gp3 (20% cheaper than gp2)
- EFS Intelligent-Tiering (save 92% on IA files)
**Database Optimization:**
- Use RDS Performance Insights
- Implement read replicas (offload reads)
- Use DynamoDB DAX (microsecond latency)
- ElastiCache for caching
**Caching Strategy:**
```
User → CloudFront (static content)
→ API Gateway (API response caching)
→ Lambda
→ DAX (DynamoDB cache)
→ DynamoDB
```
**Global Delivery:**
- CloudFront for static content
- Global Accelerator for TCP/UDP
- Route 53 latency-based routing
- Aurora Global Database (<1s replication)
---
### 5. Cost Optimization
**Design Principles:**
- Implement cloud financial management
- Adopt a consumption model
- Measure overall efficiency
- Stop spending on undifferentiated heavy lifting
- Analyze and attribute expenditure
**Implementation:**
**Right-Sizing:**
- Use Compute Optimizer recommendations
- Monitor CloudWatch metrics (CPU, memory)
- Start small, scale based on data
- Automate scaling (don't over-provision)
**Pricing Models:**
| Model | Commitment | Savings | Best For |
|-------|------------|---------|----------|
| On-Demand | None | 0% | Variable workloads |
| Savings Plans | 1-3 years | 30-40% | Flexible compute commitment |
| Reserved Instances | 1-3 years | 30-60% | Predictable, specific instances |
| Spot Instances | None | 60-90% | Fault-tolerant, flexible workloads |
**Storage Optimization:**
- S3 Intelligent-Tiering (auto-optimize to cheapest tier)
- S3 Lifecycle policies (transition to Glacier)
- EBS gp3 (20% cheaper than gp2)
- Delete unused EBS snapshots
- Archive old snapshots (75% cheaper)
**Monitoring:**
- AWS Cost Explorer (visualize spending)
- AWS Budgets (set alerts)
- Cost Allocation Tags (attribute costs)
- Trusted Advisor (cost optimization checks)
**Example Cost Optimization:**
```
Before:
- 10 m5.large on-demand 24/7 = $700/month
- S3 Standard for all data (100TB) = $2,300/month
- Total: $3,000/month
After:
- 5 m5.large Reserved (baseline) = $350/month (50% savings)
- 5 m5.large Spot (variable) = $70/month (90% savings)
- S3 Intelligent-Tiering (100TB) = $845/month (63% savings)
- Total: $1,265/month (58% savings)
```
---
### 6. Sustainability
**Design Principles:**
- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt more efficient hardware
- Use managed services
- Reduce downstream impact
**Implementation:**
**Energy-Efficient Compute:**
- Use Graviton3 instances (60% less energy)
- Lambda (pay per request, no idle)
- Fargate (no EC2 overhead)
**Region Selection:**
- Choose regions with renewable energy
- AWS publishes carbon footprint reports
- Example: US West (Oregon) uses 100% renewable
**Storage Efficiency:**
- Delete unused data
- Compress data
- Use appropriate storage tiers
- S3 Intelligent-Tiering (auto-optimize)
**Software Optimization:**
- Optimize code for performance (less CPU = less energy)
- Async processing (batch operations)
- Minimize data transfer (use caching, edge locations)
**Measure Impact:**
- Customer Carbon Footprint Tool (in AWS Billing Console)
- Track carbon emissions per service
- Set reduction goals
---
## Well-Architected Review Process
### 1. Define Workload
- Application name and purpose
- Architecture diagram
- Traffic patterns
- Compliance requirements
### 2. Assess Against Pillars
Use AWS Well-Architected Tool (free):
- Answer questions per pillar
- Identify high and medium risk issues
- Generate improvement plan
### 3. Prioritize Improvements
**Risk Levels:**
- High Risk (HRI): Address immediately
- Medium Risk (MRI): Plan to address
- None: No issues identified
**Example Findings:**
| Pillar | Issue | Risk | Recommendation |
|--------|-------|------|----------------|
| Reliability | Single AZ deployment | HRI | Deploy Multi-AZ |
| Security | No CloudTrail | HRI | Enable CloudTrail |
| Cost | On-demand only | MRI | Purchase Reserved Instances |
| Performance | No caching | MRI | Add CloudFront, ElastiCache |
### 4. Implement Changes
- Create tickets for each improvement
- Use infrastructure as code
- Test changes in non-production
- Measure impact
### 5. Re-Review Regularly
- Quarterly reviews for production workloads
- After major architecture changes
- Before significant events (sales, launches)
---
## Architecture Patterns by Pillar
### Operational Excellence Pattern
```
GitHub Repository (IaC)
→ GitHub Actions (CI/CD)
→ CDK Deploy
→ CloudFormation Stack
→ Infrastructure
→ CloudWatch Logs/Metrics/Alarms
→ SNS Notifications
→ On-call rotation
```
### Security Pattern
```
User Request
→ WAF (block threats)
→ CloudFront (DDoS protection)
→ ALB (TLS termination)
→ ECS Tasks (app in private subnet)
→ RDS (encrypted, private subnet)
→ Secrets Manager (credentials)
→ CloudTrail (audit logs)
→ GuardDuty (threat detection)
```
### Reliability Pattern
```
Multi-Region Active-Active:
Region A (Primary):
Route 53 (latency-based)
→ CloudFront
→ ALB (3 AZs)
→ ECS Fargate (auto-scaling)
→ Aurora Global (primary)
Region B (Secondary):
Route 53 (latency-based)
→ CloudFront
→ ALB (3 AZs)
→ ECS Fargate (auto-scaling)
→ Aurora Global (read-only)
```
### Performance Pattern
```
User Request
→ Route 53 (latency-based routing)
→ CloudFront (edge caching)
→ API Gateway (API caching)
→ Lambda (Provisioned Concurrency)
→ DAX (DynamoDB cache)
→ DynamoDB Global Tables
```
### Cost Optimization Pattern
```
Compute:
- Baseline: Reserved Instances (predictable)
- Variable: Spot Instances (fault-tolerant tasks)
- Serverless: Lambda (event-driven)
Storage:
- Hot: S3 Standard (frequent access)
- Warm: S3 Standard-IA (infrequent)
- Cold: S3 Glacier (archive)
- Auto: S3 Intelligent-Tiering (unknown)
Database:
- Production: Aurora (performance + HA)
- Dev/Test: Aurora Serverless v2 (pay per use)
- Cache: ElastiCache (reduce DB load)
```
---
## Checklist by Pillar
### Operational Excellence
- [ ] Infrastructure as code (CDK/Terraform)
- [ ] CI/CD pipeline automated
- [ ] Structured logging (JSON)
- [ ] Custom CloudWatch metrics
- [ ] Alarms configured with SNS
- [ ] Runbooks documented
- [ ] Disaster recovery tested
### Security
- [ ] IAM roles (no hardcoded credentials)
- [ ] MFA enabled for privileged users
- [ ] CloudTrail enabled (all regions)
- [ ] VPC Flow Logs enabled
- [ ] GuardDuty enabled
- [ ] Encryption at rest (all services)
- [ ] TLS 1.2+ in transit
- [ ] Secrets Manager for credentials
- [ ] Security groups follow least privilege
### Reliability
- [ ] Multi-AZ deployments
- [ ] Auto-scaling configured
- [ ] Automated backups enabled
- [ ] Cross-region backups (critical data)
- [ ] Health checks on load balancers
- [ ] RDS Multi-AZ or Aurora
- [ ] Route 53 health checks
- [ ] Chaos engineering tested
### Performance Efficiency
- [ ] Right-sized instances (Compute Optimizer)
- [ ] Caching implemented (CloudFront, ElastiCache)
- [ ] CDN for static content
- [ ] Read replicas for databases
- [ ] Asynchronous processing where applicable
- [ ] Monitoring and alerting active
### Cost Optimization
- [ ] Reserved Instances or Savings Plans
- [ ] Spot Instances for fault-tolerant workloads
- [ ] S3 lifecycle policies configured
- [ ] Unused resources deleted (EBS, snapshots)
- [ ] Cost allocation tags applied
- [ ] AWS Budgets configured
- [ ] Rightsizing recommendations reviewed monthly
### Sustainability
- [ ] Graviton instances where supported
- [ ] Renewable energy regions preferred
- [ ] S3 Intelligent-Tiering enabled
- [ ] Lambda for event-driven (no idle)
- [ ] Auto-scaling to match demand
- [ ] Carbon footprint monitored
```
### examples/cloudformation/lambda-api.yaml
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Production-ready Lambda + API Gateway + DynamoDB REST API'
Parameters:
Stage:
Type: String
Default: prod
AllowedValues:
- dev
- staging
- prod
Description: Deployment stage
DomainName:
Type: String
Default: ''
Description: Custom domain name for API (optional, e.g., api.example.com)
CertificateArn:
Type: String
Default: ''
Description: ACM certificate ARN for custom domain (required if DomainName is specified)
Conditions:
HasCustomDomain: !Not [!Equals [!Ref DomainName, '']]
Resources:
# DynamoDB Table
ItemsTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: !Sub '${AWS::StackName}-items'
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: itemId
AttributeType: S
- AttributeName: userId
AttributeType: S
- AttributeName: createdAt
AttributeType: N
KeySchema:
- AttributeName: itemId
KeyType: HASH
GlobalSecondaryIndexes:
- IndexName: UserIdIndex
KeySchema:
- AttributeName: userId
KeyType: HASH
- AttributeName: createdAt
KeyType: RANGE
Projection:
ProjectionType: ALL
StreamSpecification:
StreamViewType: NEW_AND_OLD_IMAGES
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
SSESpecification:
SSEEnabled: true
Tags:
- Key: Environment
Value: !Ref Stage
# Lambda Execution Role
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: DynamoDBAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
- dynamodb:Query
- dynamodb:Scan
Resource:
- !GetAtt ItemsTable.Arn
- !Sub '${ItemsTable.Arn}/index/*'
# Lambda Function
ApiFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Sub '${AWS::StackName}-api'
Runtime: python3.12
Handler: index.lambda_handler
Role: !GetAtt LambdaExecutionRole.Arn
Timeout: 30
MemorySize: 1024
Environment:
Variables:
TABLE_NAME: !Ref ItemsTable
STAGE: !Ref Stage
Code:
ZipFile: |
import json
import os
import boto3
import uuid
from datetime import datetime
from decimal import Decimal
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])
class DecimalEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, Decimal):
return float(obj)
return super(DecimalEncoder, self).default(obj)
def cors_headers():
return {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE,OPTIONS',
'Access-Control-Allow-Headers': 'Content-Type,Authorization'
}
def response(status_code, body):
return {
'statusCode': status_code,
'headers': cors_headers(),
'body': json.dumps(body, cls=DecimalEncoder)
}
def lambda_handler(event, context):
try:
http_method = event.get('requestContext', {}).get('http', {}).get('method')
path = event.get('rawPath', '')
path_params = event.get('pathParameters', {})
query_params = event.get('queryStringParameters', {})
# CORS preflight
if http_method == 'OPTIONS':
return response(200, {'message': 'OK'})
# Parse body for POST/PUT
body = {}
if event.get('body'):
body = json.loads(event['body'])
# Route requests
if path == '/items' and http_method == 'GET':
return get_items(query_params)
elif path == '/items' and http_method == 'POST':
return create_item(body)
elif path.startswith('/items/') and http_method == 'GET':
item_id = path_params.get('id')
return get_item(item_id)
elif path.startswith('/items/') and http_method == 'PUT':
item_id = path_params.get('id')
return update_item(item_id, body)
elif path.startswith('/items/') and http_method == 'DELETE':
item_id = path_params.get('id')
return delete_item(item_id)
elif path == '/health' and http_method == 'GET':
return response(200, {'status': 'healthy'})
else:
return response(404, {'error': 'Not found'})
except Exception as e:
print(f"Error: {str(e)}")
return response(500, {'error': 'Internal server error'})
def get_items(query_params):
"""List all items, optionally filtered by userId"""
try:
user_id = query_params.get('userId')
if user_id:
# Query by GSI
result = table.query(
IndexName='UserIdIndex',
KeyConditionExpression='userId = :userId',
ExpressionAttributeValues={':userId': user_id},
ScanIndexForward=False,
Limit=100
)
else:
# Scan all items
result = table.scan(Limit=100)
return response(200, {
'items': result.get('Items', []),
'count': len(result.get('Items', []))
})
except Exception as e:
print(f"Error getting items: {str(e)}")
return response(500, {'error': 'Failed to get items'})
def get_item(item_id):
"""Get single item by ID"""
try:
if not item_id:
return response(400, {'error': 'Missing itemId'})
result = table.get_item(Key={'itemId': item_id})
if 'Item' not in result:
return response(404, {'error': 'Item not found'})
return response(200, result['Item'])
except Exception as e:
print(f"Error getting item: {str(e)}")
return response(500, {'error': 'Failed to get item'})
def create_item(body):
"""Create new item"""
try:
if not body.get('userId') or not body.get('name'):
return response(400, {'error': 'Missing required fields: userId, name'})
item_id = str(uuid.uuid4())
timestamp = int(datetime.utcnow().timestamp())
item = {
'itemId': item_id,
'userId': body['userId'],
'name': body['name'],
'description': body.get('description', ''),
'createdAt': timestamp,
'updatedAt': timestamp
}
table.put_item(Item=item)
return response(201, item)
except Exception as e:
print(f"Error creating item: {str(e)}")
return response(500, {'error': 'Failed to create item'})
def update_item(item_id, body):
"""Update existing item"""
try:
if not item_id:
return response(400, {'error': 'Missing itemId'})
# Check if item exists
existing = table.get_item(Key={'itemId': item_id})
if 'Item' not in existing:
return response(404, {'error': 'Item not found'})
timestamp = int(datetime.utcnow().timestamp())
# Build update expression
update_expr = 'SET updatedAt = :updatedAt'
expr_values = {':updatedAt': timestamp}
if 'name' in body:
update_expr += ', #name = :name'
expr_values[':name'] = body['name']
if 'description' in body:
update_expr += ', description = :description'
expr_values[':description'] = body['description']
result = table.update_item(
Key={'itemId': item_id},
UpdateExpression=update_expr,
ExpressionAttributeValues=expr_values,
ExpressionAttributeNames={'#name': 'name'} if 'name' in body else None,
ReturnValues='ALL_NEW'
)
return response(200, result['Attributes'])
except Exception as e:
print(f"Error updating item: {str(e)}")
return response(500, {'error': 'Failed to update item'})
def delete_item(item_id):
"""Delete item"""
try:
if not item_id:
return response(400, {'error': 'Missing itemId'})
# Check if item exists
existing = table.get_item(Key={'itemId': item_id})
if 'Item' not in existing:
return response(404, {'error': 'Item not found'})
table.delete_item(Key={'itemId': item_id})
return response(200, {'message': 'Item deleted successfully'})
except Exception as e:
print(f"Error deleting item: {str(e)}")
return response(500, {'error': 'Failed to delete item'})
Tags:
- Key: Environment
Value: !Ref Stage
# Lambda Log Group
ApiLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/aws/lambda/${ApiFunction}'
RetentionInDays: 7
# HTTP API Gateway
HttpApi:
Type: AWS::ApiGatewayV2::Api
Properties:
Name: !Sub '${AWS::StackName}-api'
ProtocolType: HTTP
CorsConfiguration:
AllowOrigins:
- '*'
AllowMethods:
- GET
- POST
- PUT
- DELETE
- OPTIONS
AllowHeaders:
- Content-Type
- Authorization
MaxAge: 300
# API Stage
ApiStage:
Type: AWS::ApiGatewayV2::Stage
Properties:
ApiId: !Ref HttpApi
StageName: !Ref Stage
AutoDeploy: true
AccessLogSettings:
DestinationArn: !GetAtt ApiAccessLogGroup.Arn
Format: '$context.requestId $context.error.message $context.error.messageString'
DefaultRouteSettings:
ThrottlingBurstLimit: 100
ThrottlingRateLimit: 50
# API Access Logs
ApiAccessLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/aws/apigateway/${AWS::StackName}'
RetentionInDays: 7
# Lambda Integration
LambdaIntegration:
Type: AWS::ApiGatewayV2::Integration
Properties:
ApiId: !Ref HttpApi
IntegrationType: AWS_PROXY
IntegrationUri: !GetAtt ApiFunction.Arn
PayloadFormatVersion: '2.0'
# API Routes
HealthRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'GET /health'
Target: !Sub 'integrations/${LambdaIntegration}'
GetItemsRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'GET /items'
Target: !Sub 'integrations/${LambdaIntegration}'
CreateItemRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'POST /items'
Target: !Sub 'integrations/${LambdaIntegration}'
GetItemRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'GET /items/{id}'
Target: !Sub 'integrations/${LambdaIntegration}'
UpdateItemRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'PUT /items/{id}'
Target: !Sub 'integrations/${LambdaIntegration}'
DeleteItemRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'DELETE /items/{id}'
Target: !Sub 'integrations/${LambdaIntegration}'
OptionsRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref HttpApi
RouteKey: 'OPTIONS /{proxy+}'
Target: !Sub 'integrations/${LambdaIntegration}'
# Lambda Permission for API Gateway
ApiGatewayInvokePermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref ApiFunction
Action: lambda:InvokeFunction
Principal: apigateway.amazonaws.com
SourceArn: !Sub 'arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${HttpApi}/*'
# Custom Domain (Optional)
CustomDomain:
Type: AWS::ApiGatewayV2::DomainName
Condition: HasCustomDomain
Properties:
DomainName: !Ref DomainName
DomainNameConfigurations:
- CertificateArn: !Ref CertificateArn
EndpointType: REGIONAL
ApiMapping:
Type: AWS::ApiGatewayV2::ApiMapping
Condition: HasCustomDomain
Properties:
ApiId: !Ref HttpApi
DomainName: !Ref CustomDomain
Stage: !Ref ApiStage
# CloudWatch Alarm for Errors
ApiErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-api-errors'
AlarmDescription: Alert when API error rate is high
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref ApiFunction
# CloudWatch Alarm for Throttling
ApiThrottleAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-api-throttles'
AlarmDescription: Alert when API is being throttled
MetricName: Throttles
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref ApiFunction
Outputs:
ApiEndpoint:
Description: HTTP API Gateway endpoint URL
Value: !Sub 'https://${HttpApi}.execute-api.${AWS::Region}.amazonaws.com/${Stage}'
Export:
Name: !Sub '${AWS::StackName}-ApiEndpoint'
CustomDomainEndpoint:
Description: Custom domain endpoint (if configured)
Condition: HasCustomDomain
Value: !Sub 'https://${DomainName}'
DynamoDBTableName:
Description: DynamoDB table name
Value: !Ref ItemsTable
Export:
Name: !Sub '${AWS::StackName}-TableName'
LambdaFunctionArn:
Description: Lambda function ARN
Value: !GetAtt ApiFunction.Arn
Export:
Name: !Sub '${AWS::StackName}-FunctionArn'
ApiId:
Description: HTTP API Gateway ID
Value: !Ref HttpApi
Export:
Name: !Sub '${AWS::StackName}-ApiId'
```