Back to skills
SkillHub ClubRun DevOpsFull StackBackendDevOps

deploying-on-aws

Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
318
Hot score
99
Updated
March 20, 2026
Overall rating
C3.8
Composite score
3.8
Best-practice grade
B72.0

Install command

npx @skill-hub/cli install ancoleman-ai-design-components-deploying-on-aws

Repository

ancoleman/ai-design-components

Skill path: skills/deploying-on-aws

Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, Backend, DevOps.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: ancoleman.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install deploying-on-aws into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/ancoleman/ai-design-components before adding deploying-on-aws to shared team environments
  • Use deploying-on-aws for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: deploying-on-aws
description: Selecting and implementing AWS services and architectural patterns. Use when designing AWS cloud architectures, choosing compute/storage/database services, implementing serverless or container patterns, or applying AWS Well-Architected Framework principles.
---

# AWS Patterns

## Purpose

This skill provides decision frameworks and implementation patterns for Amazon Web Services. Navigate AWS's 200+ services through proven selection criteria, architectural patterns, and Well-Architected Framework principles. Focus on practical service selection, cost-aware design, and modern 2025 patterns including Lambda SnapStart, EventBridge Pipes, and S3 Express One Zone.

Use this skill when designing AWS solutions, selecting services for specific workloads, implementing serverless or container architectures, or optimizing existing AWS infrastructure for cost, performance, and reliability.

## When to Use This Skill

Invoke this skill when:

- Choosing between Lambda, Fargate, ECS, EKS, or EC2 for compute workloads
- Selecting database services (RDS, Aurora, DynamoDB) based on access patterns
- Designing VPC architecture for multi-tier applications
- Implementing serverless patterns with API Gateway and Lambda
- Building container-based microservices on ECS or EKS
- Applying AWS Well-Architected Framework to designs
- Optimizing AWS costs while maintaining performance
- Implementing security best practices (IAM, KMS, encryption)

## Core Service Selection Frameworks

### Compute Service Selection

**Decision Flow:**

```
Execution Duration:
  <15 minutes → Evaluate Lambda
  >15 minutes → Evaluate containers or VMs

Event-Driven/Scheduled:
  YES → Lambda (serverless)
  NO → Consider traffic patterns

Containerized:
  YES → Need Kubernetes?
    YES → EKS
    NO → ECS (Fargate or EC2)
  NO → Evaluate EC2 or containerize first

Special Requirements:
  GPU/Windows/BYOL licensing → EC2
  Predictable high traffic → EC2 or ECS on EC2 (cost optimization)
  Variable traffic → Lambda or Fargate
```

**Quick Reference:**

| Workload | Primary Choice | Cost Model | Key Benefit |
|----------|---------------|------------|-------------|
| API Backend | Lambda + API Gateway | Pay per request | Auto-scale, no servers |
| Microservices | ECS on Fargate | Pay for runtime | Simple operations |
| Kubernetes Apps | EKS | $73/mo + compute | Portability, ecosystem |
| Batch Jobs | Lambda or Fargate Spot | Request/spot pricing | Cost efficiency |
| Long-Running | EC2 Reserved Instances | 30-60% savings | Predictable cost |

For detailed service comparisons including cost examples, performance characteristics, and use case guidance, see `references/compute-services.md`.

### Database Service Selection

**Decision Matrix by Access Pattern:**

| Access Pattern | Data Model | Primary Choice | Key Criteria |
|----------------|------------|----------------|--------------|
| Transactional (OLTP) | Relational | Aurora | Performance + HA |
| Simple CRUD | Relational | RDS PostgreSQL | Cost vs. features |
| Key-Value Lookups | NoSQL | DynamoDB | Serverless scale |
| Document Storage | JSON/BSON | DynamoDB | Flexibility vs. MongoDB compat |
| Caching | In-Memory | ElastiCache Redis | Speed + durability |
| Analytics (OLAP) | Columnar | Redshift/Athena | Dedicated vs. serverless |
| Time-Series | Timestamped | Timestream | Purpose-built |

**Query Complexity Guide:**

- **Simple Key-Value:** DynamoDB (single-digit ms latency)
- **Moderate Joins (2-3 tables):** Aurora or RDS (cost vs. performance)
- **Complex Analytics:** Redshift (dedicated) or Athena (serverless, query S3)
- **Real-Time Streams:** DynamoDB Streams + Lambda

For storage class selection, cost comparisons, and migration patterns, see `references/database-services.md`.

### Storage Service Selection

**Primary Decision Tree:**

```
Data Type:
  Objects (files, media) → S3 + lifecycle policies
  Blocks (databases, boot volumes) → EBS
  Shared Files (cross-instance) → Evaluate protocol

File Protocol Required:
  NFS (Linux) → EFS
  SMB (Windows) → FSx for Windows
  High-Performance HPC → FSx for Lustre
  Multi-Protocol + Enterprise → FSx for NetApp ONTAP
```

**Cost Comparison (1TB/month):**

| Service | Monthly Cost | Access Pattern |
|---------|--------------|----------------|
| S3 Standard | $23 | Frequent access |
| S3 Standard-IA | $12.50 | Infrequent (>30 days) |
| S3 Glacier Instant | $4 | Archive, instant retrieval |
| EBS gp3 | $80 | Block storage |
| EFS Standard | $300 | Shared files, frequent |
| EFS IA | $25 | Shared files, infrequent |

**Recommendation:** Use S3 for 80%+ of storage needs. Use EFS/FSx only when shared file access is required.

For S3 storage classes, EBS volume types, and lifecycle policy examples, see `references/storage-services.md`.

## Serverless Architecture Patterns

### Pattern 1: REST API (Lambda + API Gateway + DynamoDB)

**Architecture:**
```
Client → API Gateway (HTTP API) → Lambda → DynamoDB
                                        ↓
                                       S3 (file uploads)
```

**Use When:**
- Building RESTful APIs with CRUD operations
- Variable or unpredictable traffic
- Minimal operational overhead desired
- Pay-per-request cost model acceptable

**Cost Estimate (1M requests/month):**
- API Gateway: $3.50
- Lambda: $3.53
- DynamoDB: ~$7.50
- **Total: ~$15/month** (vs. Fargate ~$35+, EC2 ~$50+)

**Key Components:**
- API Gateway HTTP API (cheaper than REST API)
- Lambda with appropriate memory allocation (1024MB typically optimal)
- DynamoDB on-demand billing (for variable traffic)
- CloudWatch Logs for debugging

See `examples/cdk/serverless-api/` and `examples/terraform/serverless-api/` for complete implementations.

### Pattern 2: Event-Driven Processing (EventBridge + Lambda + SQS)

**Architecture:**
```
S3 Upload → EventBridge Rule → Lambda (process) → DynamoDB (metadata)
                                              ↓
                                            SQS (downstream tasks)
```

**Use When:**
- Asynchronous file processing
- Decoupled microservices communication
- Fan-out patterns (one event, multiple consumers)
- Need retry logic and dead-letter queues

**Key Features (2025):**
- **EventBridge Pipes:** Simplified source → filter → enrichment → target
- **Lambda Response Streaming:** Stream responses up to 20MB
- **Step Functions Distributed Map:** Process millions of items in parallel

See `references/serverless-patterns.md` for additional patterns including Step Functions orchestration, API Gateway WebSockets, and Lambda SnapStart configuration.

## Container Architecture Patterns

### Pattern 1: ECS on Fargate (Serverless Containers)

**Architecture:**
```
ALB → ECS Service (Fargate tasks) → RDS Aurora
                                 ↓
                           ElastiCache Redis
```

**Use When:**
- Containerized applications without cluster management
- Variable traffic with auto-scaling
- Avoid EC2 instance management
- Docker-based deployment

**Key Components:**
- Application Load Balancer (path-based routing)
- ECS Cluster with Fargate launch type
- Task definitions (CPU, memory, container image)
- Auto-scaling based on CPU/memory or custom metrics
- Service Connect for built-in service mesh (2025 feature)

**Cost Model (2 vCPU, 4GB RAM, 24/7):**
- Fargate: ~$70/month
- ALB: ~$20/month
- RDS Aurora db.t3.medium: ~$50/month
- **Total: ~$140/month**

### Pattern 2: EKS (Kubernetes on AWS)

**Use When:**
- Kubernetes expertise exists in team
- Multi-cloud or hybrid cloud strategy
- Need Kubernetes ecosystem (Helm, Operators, Istio)
- Complex workload orchestration requirements

**Key Features (2025):**
- **EKS Auto Mode:** Fully managed node lifecycle
- **EKS Pod Identities:** Simplified IAM (replaces IRSA)
- **EKS Hybrid Nodes:** Run on-premises nodes

**Cost Considerations:**
- EKS control plane: $73/month per cluster
- Worker nodes: Fargate or EC2 pricing
- Use EKS on Fargate for simplicity, EC2 for cost optimization

For ECS task definitions, EKS cluster setup with CDK/Terraform, and service mesh patterns, see `references/container-patterns.md`.

## Networking Essentials

### VPC Architecture

**Standard 3-Tier Pattern:**

```
VPC: 10.0.0.0/16

Per Availability Zone (deploy across 3 AZs):
  Public Subnet:    10.0.X.0/24   (ALB, NAT Gateway)
  Private Subnet:   10.0.1X.0/24  (ECS, Lambda, app tier)
  Database Subnet:  10.0.2X.0/24  (RDS, Aurora, isolated)
```

**Best Practices:**
- Use /16 for VPC CIDR (65,536 IPs for growth)
- Use /24 for subnet CIDRs (256 IPs, 251 usable)
- Deploy across minimum 2 AZs (3 recommended) for high availability
- Use Security Groups (stateful) for instance-level firewall
- Enable VPC Flow Logs for troubleshooting

### Load Balancing

**Service Selection:**

| Load Balancer | Protocol | Use Case | Key Feature |
|---------------|----------|----------|-------------|
| ALB | HTTP/HTTPS | Web apps, APIs | Path/host routing, Lambda targets |
| NLB | TCP/UDP | High performance | Static IP, ultra-low latency |
| GWLB | Layer 3 | Security appliances | Inline inspection |

**ALB Features:**
- Path-based routing: `/api` → backend, `/web` → frontend
- Host-based routing: `api.example.com`, `web.example.com`
- WebSocket and gRPC support
- Integration with Lambda (serverless backends)

For CloudFront CDN patterns, Route 53 routing policies, and VPC peering configurations, see `references/networking.md`.

## Security Best Practices

### IAM Principles

**Least Privilege Pattern:**

```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject"],
    "Resource": "arn:aws:s3:::my-bucket/uploads/*"
  }]
}
```

**Core Practices:**
- Use IAM roles (not users) for applications
- Implement least privilege (grant minimum permissions needed)
- Enable MFA for privileged users
- Use IAM Access Analyzer to validate policies
- Leverage AWS Organizations SCPs for guardrails

### Data Protection

**Encryption Requirements:**

| Service | At-Rest Encryption | In-Transit Encryption |
|---------|-------------------|----------------------|
| S3 | SSE-S3 or SSE-KMS | HTTPS (TLS 1.2+) |
| EBS | KMS encryption | N/A (within instance) |
| RDS/Aurora | KMS encryption | TLS connections |
| DynamoDB | KMS encryption | HTTPS API |

**Secrets Management:**
- **Secrets Manager:** Database credentials with automatic rotation
- **Parameter Store:** Application configuration (free tier available)
- **KMS:** Encryption key management (customer-managed keys)

For WAF rules, GuardDuty configuration, and network security patterns, see `references/security.md`.

## AWS Well-Architected Framework

### Six Pillars Overview

**1. Operational Excellence**
- Infrastructure as code (CDK, Terraform, CloudFormation)
- Automated deployments (CI/CD pipelines)
- Observability (CloudWatch Logs, Metrics, X-Ray)
- Runbooks and playbooks for common operations

**2. Security**
- Strong identity foundation (IAM roles and policies)
- Defense in depth (Security Groups, NACLs, WAF)
- Data protection (encryption at rest and in transit)
- Detective controls (CloudTrail, GuardDuty, Security Hub)

**3. Reliability**
- Multi-AZ deployments (RDS Multi-AZ, Aurora replicas)
- Auto-scaling (EC2 ASG, ECS Service Auto Scaling)
- Backup and recovery (automated snapshots, cross-region)
- Chaos engineering (Fault Injection Simulator)

**4. Performance Efficiency**
- Right-size resources (use Compute Optimizer)
- Use managed services (reduce operational overhead)
- Caching strategies (CloudFront, ElastiCache, DAX)
- Monitor and optimize continuously

**5. Cost Optimization**
- Right-sizing compute (match capacity to demand)
- Pricing models (Reserved Instances, Savings Plans, Spot)
- Storage optimization (S3 Intelligent-Tiering, lifecycle policies)
- Cost monitoring (Cost Explorer, Budgets, Trusted Advisor)

**6. Sustainability (Added 2024)**
- Use Graviton processors (60% less energy, 25% better performance)
- Optimize workload placement (renewable energy regions)
- Storage efficiency (delete unused data, compression)
- Software optimization (efficient code, async processing)

For detailed pillar implementation guides, architectural review checklists, and Well-Architected Tool integration, see `references/well-architected.md`.

## Infrastructure as Code

### Tool Selection

**AWS CDK (Cloud Development Kit):**
- **Languages:** TypeScript, Python, Java, C#, Go
- **Best For:** AWS-native workloads, type-safe infrastructure
- **Key Benefit:** High-level constructs, synthesizes to CloudFormation
- **Example:** `examples/cdk/serverless-api/`

**Terraform:**
- **Language:** HCL (HashiCorp Configuration Language)
- **Best For:** Multi-cloud environments
- **Key Benefit:** Largest ecosystem, mature state management
- **Example:** `examples/terraform/serverless-api/`

**CloudFormation:**
- **Language:** YAML or JSON
- **Best For:** Native AWS integration, no additional tools
- **Key Benefit:** AWS service support on day 1
- **Example:** `examples/cloudformation/lambda-api.yaml`

### CDK Quick Start

```bash
# Install CDK CLI
npm install -g aws-cdk

# Initialize new project
cdk init app --language=typescript
npm install

# Deploy infrastructure
cdk bootstrap  # One-time setup
cdk deploy
```

### Terraform Quick Start

```bash
# Install Terraform
brew install terraform  # macOS

# Initialize project
terraform init

# Preview changes
terraform plan

# Apply changes
terraform apply
```

For complete working examples with VPC networking, multi-tier applications, and event-driven architectures, see the `examples/` directory.

## Cost Optimization Strategies

### Compute Cost Optimization

**Right-Sizing:**
- Use AWS Compute Optimizer for EC2/Lambda recommendations
- Monitor CloudWatch metrics (CPU, memory utilization)
- Start conservatively, scale based on actual usage

**Pricing Models:**

| Model | Commitment | Savings | Best For |
|-------|------------|---------|----------|
| On-Demand | None | 0% | Variable workloads |
| Savings Plans | 1-3 years | 30-40% | Flexible compute |
| Reserved Instances | 1-3 years | 30-60% | Predictable workloads |
| Spot Instances | None | 60-90% | Fault-tolerant tasks |

**Graviton Advantage:**
- Graviton3 instances: 25% better performance vs. Graviton2
- 60% less energy consumption
- Available: EC2, Lambda, Fargate, RDS, ElastiCache

### Storage Cost Optimization

**S3 Lifecycle Policies:**
```
Day 0-30:    S3 Standard         ($0.023/GB)
Day 30-90:   S3 Standard-IA      ($0.0125/GB)
Day 90-365:  S3 Glacier Instant  ($0.004/GB)
Day 365+:    S3 Deep Archive     ($0.00099/GB)
```

**EBS Optimization:**
- Use gp3 volumes (20% cheaper than gp2, configurable IOPS)
- Delete unused snapshots
- Archive old snapshots (75% cheaper)

**Monitoring:**
- Enable AWS Cost Explorer (free)
- Set up AWS Budgets with alerts
- Use Cost Allocation Tags for attribution
- Review Trusted Advisor cost checks

## Common Patterns and Examples

### Serverless Three-Tier Application

```
CloudFront (CDN)
  → S3 (React frontend)
  → API Gateway (REST API)
    → Lambda (business logic)
      → DynamoDB (data)
      → S3 (file storage)
```

**Complete CDK implementation:** `examples/cdk/three-tier-app/`
**Complete Terraform implementation:** `examples/terraform/three-tier-app/`

### Containerized Microservices

```
Route 53 (DNS)
  → CloudFront (CDN)
    → ALB (load balancer)
      → ECS Fargate (services)
        → RDS Aurora (database)
        → ElastiCache Redis (cache)
```

**Complete implementation:** `examples/cdk/ecs-fargate/`

### Event-Driven Data Pipeline

```
S3 Upload
  → EventBridge Rule
    → Lambda (transform)
      → Kinesis Firehose
        → S3 Data Lake
          → Athena (query)
```

**Complete implementation:** `examples/cdk/event-driven/`

## Integration with Other Skills

### Related Skills

- **infrastructure-as-code** - Multi-cloud IaC concepts, CDK and Terraform patterns
- **kubernetes-operations** - EKS cluster operations, kubectl, Helm charts
- **building-ci-pipelines** - CodePipeline, CodeBuild, GitHub Actions → AWS
- **secret-management** - Secrets Manager rotation, Parameter Store hierarchies
- **observability** - CloudWatch advanced queries, X-Ray distributed tracing
- **security-hardening** - IAM policy best practices, security automation
- **disaster-recovery** - Multi-region strategies, backup automation

### Cross-Skill Patterns

**EKS + kubernetes-operations:**
- Use this skill for EKS cluster provisioning (CDK/Terraform)
- Use kubernetes-operations for kubectl, Helm, application deployment

**Secrets Management:**
- Use this skill for Secrets Manager/Parameter Store setup
- Use secret-management skill for rotation policies, access patterns

**CI/CD Integration:**
- Use this skill for CodePipeline infrastructure
- Use building-ci-pipelines skill for pipeline configuration

## Reference Documentation

### Detailed Guides

- **Compute Services:** `references/compute-services.md` - Lambda, Fargate, ECS, EKS, EC2 deep dive
- **Database Services:** `references/database-services.md` - RDS, Aurora, DynamoDB, ElastiCache comparison
- **Storage Services:** `references/storage-services.md` - S3 classes, EBS types, EFS/FSx selection
- **Networking:** `references/networking.md` - VPC design, load balancing, CloudFront, Route 53
- **Security:** `references/security.md` - IAM patterns, KMS, Secrets Manager, WAF
- **Serverless Patterns:** `references/serverless-patterns.md` - Advanced Lambda, Step Functions, EventBridge
- **Container Patterns:** `references/container-patterns.md` - ECS Service Connect, EKS Pod Identities
- **Well-Architected:** `references/well-architected.md` - Six pillars implementation guide

### Working Examples

- **CDK Examples:** `examples/cdk/` - TypeScript implementations
- **Terraform Examples:** `examples/terraform/` - HCL implementations
- **CloudFormation Examples:** `examples/cloudformation/` - YAML templates

### Utility Scripts

- **Cost Estimation:** `scripts/cost-estimate.sh` - Estimate infrastructure costs
- **Resource Audit:** `scripts/resource-audit.sh` - Audit AWS resources
- **Security Check:** `scripts/security-check.sh` - Basic security validation

## AWS Service Updates (2025)

**Recent Innovations to Consider:**

- **Lambda SnapStart:** Near-instant cold starts for Java functions
- **Lambda Response Streaming:** Stream responses up to 20MB
- **EventBridge Pipes:** Simplified event processing (source → filter → enrichment → target)
- **S3 Express One Zone:** 10x faster S3, single-digit millisecond latency
- **ECS Service Connect:** Built-in service mesh for ECS
- **EKS Auto Mode:** Fully managed Kubernetes node lifecycle
- **EKS Pod Identities:** Simplified IAM for pods (replaces IRSA)
- **Aurora Limitless Database:** Horizontal write scaling beyond single-writer limit
- **DynamoDB Standard-IA:** Infrequent access tables at 60% cost savings
- **RDS Blue/Green Deployments:** Zero-downtime version upgrades

---

## Quick Decision Checklist

**Before choosing a service, answer:**

1. **Traffic Pattern:** Predictable or variable? (affects compute choice)
2. **Data Model:** Relational, key-value, document, or graph? (affects database choice)
3. **Access Pattern:** Frequent, infrequent, or archive? (affects storage class)
4. **Latency Requirements:** Milliseconds, seconds, or minutes acceptable?
5. **Scaling Needs:** Vertical (bigger instances) or horizontal (more instances)?
6. **Operational Overhead:** Prefer managed services or need control?
7. **Cost Sensitivity:** Optimize for cost, performance, or balance?
8. **Compliance Requirements:** Data residency, encryption, audit logging needed?

**Then consult the relevant decision framework in this skill or detailed references.**

## Getting Started

**For New AWS Projects:**

1. Define architecture using Well-Architected Framework pillars
2. Choose compute service using decision tree (Lambda/Fargate/ECS/EKS/EC2)
3. Select database based on access patterns and data model
4. Design VPC with 3-tier subnet architecture
5. Implement IaC using CDK or Terraform (see examples/)
6. Apply security best practices (IAM, encryption, logging)
7. Set up monitoring and cost tracking

**For Existing AWS Projects:**

1. Run AWS Trusted Advisor for recommendations
2. Review Well-Architected Framework pillars
3. Optimize costs (right-size, Reserved Instances, storage lifecycle)
4. Migrate to modern services (EC2 → Fargate, RDS → Aurora)
5. Improve security posture (enable GuardDuty, implement least privilege)
6. Automate with IaC (reverse-engineer to Terraform or CDK)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/compute-services.md

```markdown
# AWS Compute Services - Deep Dive

## Table of Contents

1. [Lambda (Serverless Functions)](#lambda-serverless-functions)
2. [Fargate (Serverless Containers)](#fargate-serverless-containers)
3. [ECS (Elastic Container Service)](#ecs-elastic-container-service)
4. [EKS (Elastic Kubernetes Service)](#eks-elastic-kubernetes-service)
5. [EC2 (Virtual Machines)](#ec2-virtual-machines)
6. [Service Comparison Matrix](#service-comparison-matrix)
7. [Migration Paths](#migration-paths)

---

## Lambda (Serverless Functions)

### Overview

AWS Lambda runs code without provisioning servers. Pay only for compute time consumed. Supports multiple languages and automatic scaling.

### Key Specifications (2025)

- **Execution Time:** 1ms to 15 minutes maximum
- **Memory:** 128MB to 10,240MB (increments of 1MB)
- **Storage:** 512MB to 10GB ephemeral storage (/tmp)
- **Deployment Package:** 50MB zipped, 250MB unzipped
- **Concurrent Executions:** 1,000 default (can increase via quota)
- **Supported Runtimes:** Node.js, Python, Java, Go, .NET, Ruby, Custom (containers)

### Performance Features (2025)

**Lambda SnapStart (Java):**
- Near-instant cold starts for Java functions
- Caches initialized execution environment
- 10x faster startup vs. traditional Java

**Lambda Response Streaming:**
- Stream responses up to 20MB
- Progressive results for large payloads
- Ideal for generative AI, video processing

**Provisioned Concurrency:**
- Pre-initialized execution environments
- Sub-10ms cold starts
- Predictable performance for latency-sensitive apps

### Cost Model

**Request Pricing:**
- Free tier: 1M requests/month (perpetual)
- $0.20 per 1M requests thereafter

**Compute Pricing (us-east-1):**
- $0.0000166667 per GB-second
- Free tier: 400,000 GB-seconds/month

**Example Calculations:**

```
Scenario: API with 5M requests/month, 512MB memory, 200ms avg execution

Requests: (5M - 1M free) × $0.20/1M = $0.80
Compute: 5M × 0.2s × 0.5GB × $0.0000166667 = $8.33
Total: $9.13/month
```

### Use Cases

**Ideal:**
- API backends (via API Gateway)
- File processing (S3 triggers)
- Scheduled jobs (EventBridge cron)
- Stream processing (Kinesis, DynamoDB Streams)
- WebHooks and event handlers

**Avoid:**
- Long-running tasks (>15 minutes)
- Stateful applications
- Predictable high throughput (EC2 cheaper at scale)
- Large deployment packages (>250MB)

### Best Practices

1. **Optimize Memory Allocation:**
   - CPU scales with memory (1,769MB = 1 vCPU)
   - Test different memory sizes (more memory = faster execution = lower cost)
   - Use AWS Lambda Power Tuning tool

2. **Reduce Cold Starts:**
   - Minimize dependencies
   - Use SnapStart for Java
   - Provision concurrency for critical functions
   - Keep functions warm with scheduled pings (if cost-effective)

3. **Environment Variables:**
   - Use for configuration (no code changes)
   - Encrypt sensitive values with KMS
   - Consider Parameter Store or Secrets Manager for secrets

4. **Observability:**
   - Enable X-Ray tracing
   - Use structured logging (JSON)
   - Create CloudWatch dashboards
   - Set up alarms for errors and throttling

---

## Fargate (Serverless Containers)

### Overview

AWS Fargate runs containers without managing servers. Pay for vCPU and memory used. Works with ECS and EKS.

### Key Specifications

**CPU Options:**
- 0.25 vCPU to 16 vCPU
- Must match valid CPU/memory combinations

**Memory Options:**
- 0.5GB to 120GB
- Scales with CPU selection

**Platform Versions:**
- **1.4.0:** Current default, supports EFS, container insights
- **1.3.0:** Legacy, missing some features

### Cost Model (Linux, us-east-1)

**Per vCPU-hour:** $0.04048
**Per GB-hour:** $0.004445

**Example Configurations:**

| vCPU | Memory | Hourly | Monthly (24/7) |
|------|--------|--------|----------------|
| 0.25 | 0.5GB | $0.01 | $7.50 |
| 0.5 | 1GB | $0.02 | $15.00 |
| 1 | 2GB | $0.05 | $35.00 |
| 2 | 4GB | $0.10 | $70.00 |
| 4 | 8GB | $0.20 | $140.00 |

**Fargate Spot:**
- 70% discount vs. on-demand Fargate
- Can be interrupted with 2-minute notice
- Ideal for fault-tolerant batch jobs

### Use Cases

**Ideal:**
- Containerized microservices
- Batch processing (with Fargate Spot)
- CI/CD build agents
- Variable traffic applications
- Multi-hour running tasks

**Avoid:**
- Extremely cost-sensitive at high scale (EC2 cheaper)
- GPU workloads (use EC2)
- Stateful apps requiring persistent local storage
- High-performance computing

### Best Practices

1. **Task Sizing:**
   - Start small, monitor CloudWatch Container Insights
   - Scale up based on actual utilization
   - Use Application Auto Scaling

2. **Networking:**
   - Use awsvpc network mode (required)
   - Each task gets ENI with private IP
   - Use Security Groups for network isolation

3. **Storage:**
   - Ephemeral storage: 20GB default (can increase to 200GB)
   - Persistent storage: Mount EFS volumes
   - Logs: Send to CloudWatch Logs

---

## ECS (Elastic Container Service)

### Overview

AWS-native container orchestration. Simpler than Kubernetes. Tight integration with AWS services.

### Launch Types

**Fargate:**
- Serverless, no EC2 management
- Pay per task

**EC2:**
- Manage EC2 instances
- Lower cost at scale
- More control

**External:**
- Run on on-premises servers
- ECS Anywhere

### Key Features (2025)

**ECS Service Connect:**
- Built-in service mesh
- Service discovery without custom code
- Load balancing and circuit breaking

**ECS Exec:**
- Interactive shell access to containers
- Debugging without SSH

**Capacity Providers:**
- Auto-scale between Fargate and EC2
- Mix spot and on-demand instances

### Cost Model

**No ECS Control Plane Fees:**
- Only pay for underlying compute (Fargate or EC2)

**Example (10 services, t3.medium EC2):**
- EC2: 10 × $30/month = $300
- ECS: $0
- **Total: $300/month**

### Use Cases

**Ideal:**
- Docker-based applications
- AWS-native deployments
- Simpler than Kubernetes requirements
- Tight ALB/CloudWatch/IAM integration

**Avoid:**
- Multi-cloud portability needed (use EKS)
- Team has Kubernetes expertise
- Need Kubernetes ecosystem (Helm, Operators)

---

## EKS (Elastic Kubernetes Service)

### Overview

Managed Kubernetes control plane. Full Kubernetes compatibility. Multi-cloud/hybrid portability.

### Key Specifications

**Control Plane:**
- Highly available across 3 AZs
- Automatic version upgrades
- Integrated with AWS IAM

**Supported Versions:**
- Kubernetes 1.25 to 1.28 (as of 2025)
- Automatic minor version upgrades available

### Key Features (2025)

**EKS Auto Mode:**
- Fully managed node lifecycle
- Automatic capacity provisioning
- No manual node group management

**EKS Pod Identities:**
- Simplified IAM for pods
- Replaces IRSA (IAM Roles for Service Accounts)
- Easier setup and debugging

**EKS Hybrid Nodes:**
- Run Kubernetes nodes on-premises
- Consistent management plane

### Cost Model

**Control Plane:** $0.10/hour = $73/month per cluster

**Worker Nodes:**
- Fargate: Per-task pricing
- EC2: Instance pricing
- On-Demand, Reserved, or Spot

**Example (3 m5.large nodes on-demand):**
- Control plane: $73/month
- Nodes: 3 × $70 = $210/month
- **Total: $283/month**

### Use Cases

**Ideal:**
- Kubernetes expertise exists
- Multi-cloud/hybrid strategy
- Complex orchestration needs
- Kubernetes ecosystem required (Helm, Operators, Istio)

**Avoid:**
- Team lacks Kubernetes knowledge
- Simple workloads (over-engineering)
- Cost-sensitive (ECS cheaper)

### Best Practices

1. **Node Groups:**
   - Use managed node groups
   - Mix on-demand (baseline) + spot (burst)
   - Use Auto Mode for simplicity

2. **Networking:**
   - Use AWS VPC CNI plugin
   - Enable Pod Security Groups
   - Use Network Policies for isolation

3. **Storage:**
   - Use EBS CSI Driver for persistent volumes
   - Use EFS CSI Driver for shared storage
   - Implement StorageClasses for automation

---

## EC2 (Virtual Machines)

### Overview

Virtual servers in the cloud. Full OS control. Widest instance type selection.

### Instance Families

**General Purpose (T, M):**
- Balanced CPU, memory, network
- t3: Burstable, cost-effective
- m5: Consistent performance

**Compute Optimized (C):**
- High CPU-to-memory ratio
- Batch processing, HPC, gaming

**Memory Optimized (R, X):**
- High memory-to-CPU ratio
- Databases, caches, in-memory analytics

**Storage Optimized (I, D):**
- High IOPS, throughput
- NoSQL databases, data warehousing

**Accelerated Computing (P, G, Inf):**
- GPU, FPGA, inference
- ML training, rendering, genomics

### Pricing Models

**On-Demand:**
- Pay by the second
- No commitment
- Highest cost

**Reserved Instances:**
- 1-year or 3-year commitment
- 30-60% savings
- Predictable workloads

**Savings Plans:**
- 1-year or 3-year commitment
- Flexible across instance families
- 30-40% savings

**Spot Instances:**
- Bid on spare capacity
- 60-90% savings
- Can be interrupted with 2-minute notice

### Cost Examples (us-east-1, On-Demand)

| Instance | vCPU | Memory | Hourly | Monthly | Use Case |
|----------|------|--------|--------|---------|----------|
| t3.micro | 2 | 1GB | $0.0104 | $7.60 | Dev/test |
| t3.medium | 2 | 4GB | $0.0416 | $30.37 | Small apps |
| m5.large | 2 | 8GB | $0.096 | $70.08 | General purpose |
| c5.xlarge | 4 | 8GB | $0.17 | $124.10 | Compute heavy |
| r5.large | 2 | 16GB | $0.126 | $91.98 | Memory heavy |

### Use Cases

**Ideal:**
- Maximum OS control
- GPU/FPGA workloads
- Windows Server
- BYOL licensing
- Predictable high traffic (with Reserved Instances)

**Avoid:**
- Variable traffic (use Lambda/Fargate)
- Minimal ops desired
- Serverless patterns applicable

---

## Service Comparison Matrix

| Criteria | Lambda | Fargate | ECS (EC2) | EKS | EC2 |
|----------|--------|---------|-----------|-----|-----|
| **Ops Overhead** | Minimal | Low | Medium | High | High |
| **Cost (variable)** | Excellent | Good | Fair | Fair | Poor |
| **Cost (predictable)** | Poor | Fair | Good | Good | Excellent |
| **Cold Start** | Yes | No | No | No | No |
| **Max Runtime** | 15 min | Unlimited | Unlimited | Unlimited | Unlimited |
| **Portability** | Low | Medium | Medium | High | High |
| **Scaling Speed** | Instant | Fast | Medium | Medium | Slow |
| **State Management** | None | Limited | Good | Excellent | Excellent |

---

## Migration Paths

### VM to Containers

```
EC2 → ECS on EC2 → ECS on Fargate → Serverless (Lambda)
```

**Step 1: EC2 → ECS on EC2**
- Containerize application (Dockerfile)
- Deploy to ECS with EC2 launch type
- Benefit: Better resource utilization

**Step 2: ECS on EC2 → ECS on Fargate**
- Migrate task definitions to Fargate
- Remove EC2 instance management
- Benefit: No server operations

**Step 3: ECS on Fargate → Lambda (if applicable)**
- Refactor to event-driven functions
- Use API Gateway for HTTP
- Benefit: Pay-per-request pricing

### Monolith to Microservices

```
Single EC2 → Multiple ECS Services → Lambda Functions
```

**Strategy:**
- Identify bounded contexts
- Extract services incrementally
- Use API Gateway or ALB for routing
- Implement service mesh (ECS Service Connect)

### On-Premises to AWS

```
On-Prem VMs → EC2 → Containers (ECS/EKS) → Serverless
```

**Tools:**
- AWS Application Migration Service (MGN)
- VM Import/Export
- Database Migration Service (DMS)

```

### references/database-services.md

```markdown
# AWS Database Services - Deep Dive

## Table of Contents

1. [RDS (Relational Database Service)](#rds-relational-database-service)
2. [Aurora (AWS-Native Relational)](#aurora-aws-native-relational)
3. [DynamoDB (NoSQL)](#dynamodb-nosql)
4. [DocumentDB (MongoDB Compatible)](#documentdb-mongodb-compatible)
5. [ElastiCache and MemoryDB](#elasticache-and-memorydb)
6. [Database Selection Decision Tree](#database-selection-decision-tree)
7. [Migration Strategies](#migration-strategies)

---

## RDS (Relational Database Service)

### Supported Engines

| Engine | Latest Version | Best For |
|--------|---------------|----------|
| PostgreSQL | 15.x | Modern apps, JSON support |
| MySQL | 8.0.x | Legacy compatibility |
| MariaDB | 10.11.x | MySQL fork, enhanced |
| Oracle | 19c, 21c | Enterprise apps, BYOL |
| SQL Server | 2019, 2022 | Microsoft ecosystem |

### Instance Classes

**General Purpose (db.t3, db.m5):**
- Balanced CPU, memory
- t3: Burstable, cost-effective
- m5: Consistent performance

**Memory Optimized (db.r5, db.x2):**
- High memory-to-CPU ratio
- Large datasets, caching
- r5: Latest generation

**Burstable (db.t4g - Graviton):**
- ARM-based processors
- 40% better price-performance
- Sustainable performance

### Cost Model (PostgreSQL db.t3.medium, us-east-1)

**Instance:** $0.068/hour = $49.64/month
**Storage (gp3):** $0.115/GB-month
**Backup Storage:** Free (automated backups = DB size)

**Example Configuration:**
- db.t3.medium instance: $49.64/month
- 100GB gp3 storage: $11.50/month
- **Total: $61.14/month**

### Multi-AZ Deployments

**How it Works:**
- Synchronous replication to standby in different AZ
- Automatic failover (60-120 seconds)
- Same endpoint (no app changes)

**Cost:** 2x instance cost + storage in both AZs

**Use When:**
- Production workloads
- High availability required (99.95% SLA)
- Automatic failover needed

### Read Replicas

**Purpose:**
- Offload read traffic
- Scale horizontally
- Analytics on replica (no impact on primary)

**Limitations:**
- Up to 15 replicas per instance
- Asynchronous replication (eventual consistency)
- Cross-region supported

**Cost:** Standard instance pricing per replica

### Blue/Green Deployments (2025 Feature)

**Purpose:**
- Zero-downtime version upgrades
- Test changes in production clone

**How it Works:**
1. Create green environment (clone)
2. Test in green environment
3. Switch traffic (blue → green)
4. Rollback if issues detected

**Use Cases:**
- Major version upgrades
- Schema changes
- Performance testing

### Best Practices

1. **Enable Automated Backups:**
   - Retention: 7-35 days
   - Point-in-time recovery
   - No performance impact (uses snapshots)

2. **Use Parameter Groups:**
   - Customize DB engine settings
   - Apply best practices per workload
   - Version control parameter changes

3. **Monitor Performance:**
   - Enable Performance Insights (free for 7 days)
   - Track slow queries
   - Set up CloudWatch alarms

4. **Security:**
   - Enable encryption at rest (KMS)
   - Use TLS for connections
   - Store credentials in Secrets Manager
   - Apply security group restrictions

---

## Aurora (AWS-Native Relational)

### Overview

AWS-designed database compatible with MySQL and PostgreSQL. Higher performance, availability, and durability than standard RDS.

### Architecture

**Storage:**
- Automatically scales 10GB to 128TB
- 6 copies across 3 AZs
- Self-healing storage
- Continuous backup to S3

**Compute:**
- Primary instance (read-write)
- Up to 15 read replicas
- Sub-10ms replica lag

### Performance Improvements

**vs. MySQL:**
- 5x throughput improvement
- Same applications, drivers

**vs. PostgreSQL:**
- 3x throughput improvement
- PostgreSQL 11-15 compatibility

### Aurora Serverless v2

**Use Cases:**
- Variable workloads (dev/test, seasonal)
- Unpredictable traffic
- Multi-tenant applications

**Scaling:**
- Minimum: 0.5 ACU (1 ACU = 2GB RAM, ~2 vCPU)
- Maximum: 128 ACU
- Scales in 0.5 ACU increments
- Sub-second scaling

**Cost Model:**
- $0.12 per ACU-hour (us-east-1)
- Storage: $0.10/GB-month
- I/O: $0.20 per 1M requests

**Example Calculation:**
```
Workload: 8 hours/day active, 2 ACU baseline, 10 ACU peak

ACU-hours/month: (2 × 16hr + 10 × 8hr) × 30 days = 3,360
Cost: 3,360 × $0.12 = $403.20/month
Storage (100GB): $10/month
Total: ~$413/month
```

### Aurora Limitless Database (2024+)

**Purpose:**
- Horizontal write scaling
- Sharding managed by Aurora
- Millions of transactions per second

**Use Cases:**
- Highest-scale OLTP workloads
- Multi-tenant SaaS applications
- Gaming leaderboards

**How it Works:**
- Data automatically sharded
- Distributed SQL processing
- Appears as single database to applications

### Aurora Global Database

**Purpose:**
- Cross-region replication (<1 second lag)
- Disaster recovery
- Low-latency global reads

**Architecture:**
- Primary region (read-write)
- Up to 5 secondary regions (read-only)
- Dedicated infrastructure for replication

**Cost:**
- Replication: $0.10 per GB transferred
- Instances in secondary regions charged normally

### Cost Comparison: Aurora vs. RDS

**Aurora Advantages:**
- No manual backups needed (continuous to S3)
- Faster replication (sub-10ms vs. seconds)
- Higher availability (99.99% vs. 99.95%)
- Automatic failover to replicas

**Aurora Premium:**
- 20% more expensive than RDS for equivalent instance
- Worth it for production workloads

**Example:**
- RDS PostgreSQL db.r5.large: $0.24/hour = $175/month
- Aurora PostgreSQL r5.large: $0.29/hour = $212/month
- Difference: $37/month (20% premium)

### Best Practices

1. **Use Aurora for Production:**
   - Better availability than RDS
   - Automatic storage scaling
   - Fast failover

2. **Leverage Read Replicas:**
   - Create reader endpoint (automatic load balancing)
   - Offload analytics to replicas
   - Use custom endpoints for workload isolation

3. **Enable Backtrack (MySQL):**
   - Rewind DB to specific point in time
   - No restore from backup needed
   - Minutes instead of hours

---

## DynamoDB (NoSQL)

### Overview

Fully managed NoSQL database. Single-digit millisecond latency. Infinite horizontal scaling.

### Data Model

**Primary Key Options:**

1. **Partition Key Only:**
   - Unique identifier
   - Example: UserID

2. **Partition Key + Sort Key:**
   - Composite primary key
   - Example: UserID (partition) + Timestamp (sort)
   - Enables range queries

**Attributes:**
- Flexible schema (no predefined columns)
- Supports strings, numbers, binary, lists, maps, sets

### Capacity Modes

**On-Demand:**
- Pay per request
- No capacity planning
- Automatic scaling

**Pricing (us-east-1):**
- Write: $1.25 per million write request units
- Read: $0.25 per million read request units
- Storage: $0.25/GB-month

**Provisioned:**
- Specify read/write capacity units
- Predictable cost
- Auto-scaling available

**Pricing (us-east-1):**
- Write: $0.00065 per WCU-hour
- Read: $0.00013 per RCU-hour
- Storage: $0.25/GB-month

### Storage Classes (2024+)

**Standard:**
- Default storage class
- $0.25/GB-month

**Standard-IA (Infrequent Access):**
- 60% cheaper storage
- $0.10/GB-month
- Higher per-request cost
- Use for tables accessed <2 times/month

### Global Tables

**Purpose:**
- Multi-region, active-active replication
- Sub-second replication lag
- Automatic conflict resolution

**Use Cases:**
- Global applications
- Disaster recovery
- Low-latency global access

**Cost:**
- Replication: $0.000002 per replicated write
- Full instance cost in each region

### DynamoDB Streams

**Purpose:**
- Real-time change data capture
- Trigger Lambda on insert/update/delete
- Audit logging, analytics pipelines

**Retention:**
- 24 hours of change data

**Use Cases:**
- Event-driven architectures
- Data replication
- Aggregation pipelines

### DynamoDB Accelerator (DAX)

**Purpose:**
- In-memory caching layer
- Microsecond latency
- Fully managed

**Performance:**
- Cache hit: <1ms latency
- Cache miss: ~10ms (DynamoDB read)

**Cost (dax.t3.small):**
- $0.04/hour = $29/month per node
- Minimum 3 nodes (HA) = $87/month

**Use When:**
- Need <1ms latency
- Read-heavy workload
- Can afford caching cost

### Best Practices

1. **Design Partition Keys:**
   - Distribute access evenly
   - Avoid hot partitions
   - Use high-cardinality attributes (UserID, not Country)

2. **Use Global Secondary Indexes (GSI):**
   - Query alternate access patterns
   - Different partition/sort keys
   - Eventually consistent reads
   - Plan for 20 GSIs limit

3. **Use Local Secondary Indexes (LSI):**
   - Same partition key, different sort key
   - Strongly consistent reads
   - Must create at table creation
   - 5 LSI limit

4. **Enable Point-in-Time Recovery:**
   - Restore to any second in last 35 days
   - $0.20/GB-month (20% of table size)

5. **Use PartiQL:**
   - SQL-like query language
   - Easier than low-level API
   - Supports SELECT, INSERT, UPDATE, DELETE

---

## DocumentDB (MongoDB Compatible)

### Overview

Managed document database compatible with MongoDB 4.0+. Scales to millions of requests per second.

### Architecture

**Storage:**
- Automatically scales to 128TB
- 6 copies across 3 AZs (like Aurora)
- Continuous backup to S3

**Compute:**
- Primary instance (read-write)
- Up to 15 read replicas

### MongoDB Compatibility

**Supported:**
- MongoDB 4.0, 4.2, 5.0 APIs
- Drivers and tools (Compass, mongosh)
- Most MongoDB queries and aggregations

**Not Supported:**
- Some advanced MongoDB features
- Check compatibility guide for specifics

### Cost Model (db.t3.medium)

**Instance:** $0.073/hour = $53/month
**Storage:** $0.10/GB-month
**I/O:** $0.20 per 1M requests

**Example:**
- Instance: $53/month
- 100GB storage: $10/month
- 10M I/O requests: $2/month
- **Total: $65/month**

### Use Cases

**Ideal:**
- Existing MongoDB workloads
- Document-oriented data
- JSON data storage
- Flexible schemas

**Consider Alternatives:**
- Simple key-value: Use DynamoDB (cheaper)
- Need native MongoDB: Use MongoDB Atlas
- Complex transactions: Use Aurora PostgreSQL

---

## ElastiCache and MemoryDB

### ElastiCache for Redis

**Purpose:**
- In-memory caching
- Session storage
- Real-time analytics

**Cost (cache.t3.medium):**
- $0.068/hour = $49.64/month per node

**Use Cases:**
- Database query caching
- Session store
- Leaderboards
- Rate limiting
- Pub/sub messaging

**Limitations:**
- No persistence (data lost on restart)
- Use MemoryDB for durability

### ElastiCache for Memcached

**Purpose:**
- Simple caching layer
- Horizontal scaling via sharding

**vs. Redis:**
- Simpler (no advanced data structures)
- Multi-threaded (better CPU utilization)
- No persistence

**Use When:**
- Simple caching needed
- Horizontal scaling priority
- Don't need Redis features

### MemoryDB for Redis (2024+)

**Purpose:**
- Redis-compatible with Multi-AZ durability
- Primary database (not just cache)

**Performance:**
- Microsecond read latency
- Single-digit millisecond write latency
- Durable across AZ failures

**Cost (db.t4g.small):**
- $0.061/hour = $44.53/month per node

**vs. ElastiCache Redis:**
- 20% more expensive
- Durable (survives restarts)
- Use as primary database

**Use Cases:**
- Real-time applications needing persistence
- Gaming leaderboards with durability
- Session stores with HA requirements

---

## Database Selection Decision Tree

```
Q1: What is the data model?
  ├─ Relational (tables with joins) → Q2
  ├─ Document (JSON/BSON) → Q5
  ├─ Key-Value → DynamoDB
  ├─ Graph → Neptune
  └─ Time-Series → Timestream

Q2: What is the scale requirement?
  ├─ <64TB, standard RDS features → RDS
  └─ >64TB or need highest performance → Aurora

Q3: What engine do you need?
  ├─ PostgreSQL → RDS PostgreSQL or Aurora PostgreSQL
  ├─ MySQL → RDS MySQL or Aurora MySQL
  ├─ Oracle/SQL Server → RDS (Aurora not available)

Q4: What availability do you need?
  ├─ Dev/Test → RDS Single-AZ
  ├─ Production → RDS Multi-AZ or Aurora
  └─ Global → Aurora Global Database

Q5: Document database specifics:
  ├─ Simple key-value with JSON → DynamoDB
  ├─ MongoDB compatibility required → DocumentDB
  └─ Complex MongoDB features → MongoDB Atlas

Q6: Caching needs:
  ├─ Simple cache, no persistence → ElastiCache (Redis or Memcached)
  ├─ Cache with durability → MemoryDB
  └─ Microsecond latency for DynamoDB → DAX
```

---

## Migration Strategies

### On-Premises to AWS

**Relational Databases:**

1. **AWS Database Migration Service (DMS):**
   - Minimal downtime
   - Continuous replication
   - Supports heterogeneous migrations (Oracle → PostgreSQL)

2. **Native Tools:**
   - MySQL: mysqldump, binlog replication
   - PostgreSQL: pg_dump, logical replication
   - Oracle: Data Pump, GoldenGate

**NoSQL Databases:**

1. **MongoDB to DocumentDB:**
   - Use AWS DMS
   - mongodump/mongorestore for smaller DBs

2. **Self-Managed to DynamoDB:**
   - Application-level migration
   - Dual-write pattern (old + new)
   - Validate and cutover

### RDS to Aurora

**Zero-Downtime Migration:**

1. Create Aurora read replica from RDS instance
2. Promote replica to standalone Aurora cluster
3. Update application endpoints
4. Decommission RDS instance

**Timeframe:** 1-2 hours depending on size

### DynamoDB to Aurora (or vice versa)

**Strategy:**
- Application-level migration
- Dual-write pattern
- Gradually shift reads
- Validate data consistency

**Tooling:**
- AWS DMS (limited support)
- Custom scripts

### Self-Managed to Managed

**Benefits:**
- Automated backups
- Automatic failover
- Managed upgrades
- Built-in monitoring

**Considerations:**
- Test performance (might differ)
- Validate feature compatibility
- Plan rollback strategy

```

### references/storage-services.md

```markdown
# AWS Storage Services - Deep Dive

## Table of Contents

1. [S3 (Simple Storage Service)](#s3-simple-storage-service)
2. [EBS (Elastic Block Store)](#ebs-elastic-block-store)
3. [EFS (Elastic File System)](#efs-elastic-file-system)
4. [FSx Family](#fsx-family)
5. [Storage Selection Guide](#storage-selection-guide)

---

## S3 (Simple Storage Service)

### Storage Classes

| Class | Use Case | Durability | Availability | Cost/GB | Retrieval Cost |
|-------|----------|------------|--------------|---------|----------------|
| **Standard** | Frequent access | 99.999999999% | 99.99% | $0.023 | Free |
| **Intelligent-Tiering** | Unknown/changing | 99.999999999% | 99.9% | $0.023-$0.00099 | Free |
| **Standard-IA** | Infrequent (>30 days) | 99.999999999% | 99.9% | $0.0125 | $0.01/GB |
| **One Zone-IA** | Non-critical | 99.999999999% | 99.5% | $0.01 | $0.01/GB |
| **Glacier Instant** | Archive, instant | 99.999999999% | 99.9% | $0.004 | $0.03/GB |
| **Glacier Flexible** | Archive, 1-5 min | 99.999999999% | 99.99% | $0.0036 | $0.01-$0.03/GB |
| **Glacier Deep Archive** | Long-term (7-10yr) | 99.999999999% | 99.99% | $0.00099 | $0.02/GB + 12hr |
| **S3 Express One Zone** | High perf (NEW) | 99.999999999% | 99.95% | $0.16 | Free |

### S3 Intelligent-Tiering

**How it Works:**
- Automatic optimization across 5 tiers
- No retrieval fees
- Small monitoring fee: $0.0025 per 1,000 objects

**Tiers:**
1. Frequent Access (0-30 days): $0.023/GB
2. Infrequent Access (30-90 days): $0.0125/GB
3. Archive Instant Access (90-180 days): $0.004/GB
4. Archive Access (180-365 days): $0.0036/GB
5. Deep Archive (365+ days): $0.00099/GB

**Use When:**
- Unknown or changing access patterns
- Want automated cost optimization
- Can tolerate small monitoring fee

### S3 Express One Zone (2024)

**Performance:**
- 10x faster than S3 Standard
- Single-digit millisecond latency
- Hundreds of thousands of requests per second

**Use Cases:**
- Low-latency data processing
- ML training data access
- High-performance computing

**Cost Trade-off:**
- $0.16/GB vs. $0.023/GB (7x more expensive)
- No data transfer charges within same AZ

### Lifecycle Policies

**Example Configuration:**

```json
{
  "Rules": [{
    "Id": "Archive old data",
    "Status": "Enabled",
    "Filter": { "Prefix": "logs/" },
    "Transitions": [
      { "Days": 30, "StorageClass": "STANDARD_IA" },
      { "Days": 90, "StorageClass": "GLACIER_IR" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ],
    "Expiration": { "Days": 2555 }
  }]
}
```

### S3 Features

**Versioning:**
- Preserve all versions of objects
- Protect against accidental deletion
- Enable MFA delete for extra protection

**Replication:**
- Cross-Region Replication (CRR): Disaster recovery
- Same-Region Replication (SRR): Compliance, lower latency

**S3 Object Lambda:**
- Transform objects on retrieval
- Redact PII, resize images, convert formats

**S3 Batch Operations:**
- Perform operations on billions of objects
- Copy, tag, restore from Glacier, invoke Lambda

### Best Practices

1. Enable versioning for critical data
2. Use lifecycle policies to reduce costs
3. Enable S3 Intelligent-Tiering for unknown patterns
4. Encrypt at rest (SSE-S3 or SSE-KMS)
5. Use CloudFront for frequently accessed content
6. Enable access logging for auditing

---

## EBS (Elastic Block Store)

### Volume Types

| Type | IOPS | Throughput | Cost/GB | Use Case |
|------|------|------------|---------|----------|
| **gp3** | 3,000-16,000 | 125-1,000 MB/s | $0.08 | General purpose (recommended) |
| **gp2** | 3,000-16,000 | 250 MB/s max | $0.10 | Legacy general purpose |
| **io2 Block Express** | 256,000 | 4,000 MB/s | $0.125 + IOPS | Highest performance |
| **io2** | 64,000 | 1,000 MB/s | $0.125 + IOPS | Critical workloads |
| **st1** (HDD) | 500 | 500 MB/s | $0.045 | Throughput-optimized |
| **sc1** (HDD) | 250 | 250 MB/s | $0.015 | Cold storage |

### gp3 vs. gp2

**gp3 Advantages:**
- 20% cheaper ($0.08 vs. $0.10)
- Independent IOPS and throughput scaling
- Baseline: 3,000 IOPS, 125 MB/s
- Configurable up to 16,000 IOPS, 1,000 MB/s

**Recommendation:** Use gp3 for 99% of workloads

### EBS Snapshots

**Features:**
- Incremental backups to S3
- Copy across regions
- Create volumes from snapshots

**EBS Snapshots Archive (2024):**
- 75% cheaper storage ($0.0125 vs. $0.05/GB-month)
- Restore takes 24-72 hours
- Use for compliance, long-term retention

**Fast Snapshot Restore (FSR):**
- Pre-warm snapshots for instant recovery
- $0.75/hour per AZ per snapshot
- Use for critical recovery scenarios

### Multi-Attach io2 Volumes

**Purpose:**
- Share volume across multiple EC2 instances
- Cluster file systems, HA applications

**Limitations:**
- Up to 16 instances
- Same AZ only
- io2 volumes only

### Best Practices

1. Use gp3 for general purpose workloads
2. Enable EBS encryption by default
3. Automate snapshot creation (lifecycle policies)
4. Delete unused snapshots
5. Archive old snapshots (75% cost savings)
6. Use Fast Snapshot Restore for critical backups

---

## EFS (Elastic File System)

### Storage Classes

| Class | Cost/GB-month | Performance | Use Case |
|-------|---------------|-------------|----------|
| **Standard** | $0.30 | High | Frequent access |
| **Infrequent Access (IA)** | $0.025 | Lower | >30 days idle |
| **One Zone** | $0.16 | High | Non-critical |
| **One Zone-IA** | $0.0133 | Lower | Dev/test |

### Performance Modes

**General Purpose:**
- Low latency (<10ms)
- Up to 7,000 file ops/sec
- Default mode

**Max I/O:**
- Higher aggregate throughput
- Slightly higher latency
- Big data, media processing

### Throughput Modes

**Elastic (Default):**
- Automatically scales
- Pay only for actual throughput
- $0.30/GB transferred

**Provisioned:**
- Specify throughput independent of storage
- Consistent high throughput
- $6.00/MB/s-month

**Bursting:**
- Throughput scales with storage size
- 50 MB/s per TB stored
- Legacy mode

### EFS Features (2025)

**Intelligent-Tiering:**
- Automatic movement to IA tier
- After 7, 14, 30, 60, or 90 days idle
- No retrieval fees

**EFS Replication:**
- Cross-region disaster recovery
- Near real-time replication
- $0.015/GB replicated

### Use Cases

**Ideal:**
- Shared file storage across EC2/Fargate/Lambda
- Content management systems
- Container persistent storage (ECS, EKS)
- Home directories

**Avoid:**
- Single-instance applications (use EBS)
- Windows workloads (use FSx for Windows)
- High-performance HPC (use FSx for Lustre)

### Best Practices

1. Enable Intelligent-Tiering (save 92% on IA files)
2. Use One Zone class for dev/test (50% cheaper)
3. Mount via NFS 4.1 for best compatibility
4. Use encryption in transit (TLS)
5. Monitor with CloudWatch metrics

---

## FSx Family

### FSx for Windows File Server

**Purpose:**
- Windows-native SMB file shares
- Active Directory integration

**Cost:**
- SSD: $0.013/GB-month + throughput
- HDD: $0.0065/GB-month + throughput

**Use Cases:**
- Windows applications (SQL Server, IIS)
- Home directories
- SharePoint storage

### FSx for Lustre

**Purpose:**
- High-performance computing (HPC)
- Sub-millisecond latency
- 100+ GB/s throughput

**Cost:**
- Persistent SSD: $0.145/GB-month
- Scratch: $0.084/GB-month (no replication)

**Use Cases:**
- ML training
- Video rendering
- Genomics research
- Financial modeling

**S3 Integration:**
- Link to S3 bucket
- Lazy-load data on first access
- Write results back to S3

### FSx for NetApp ONTAP

**Purpose:**
- Enterprise NAS features
- Multi-protocol (NFS, SMB, iSCSI)

**Features:**
- Snapshots, clones
- Data compression, deduplication
- SnapMirror replication
- Hybrid cloud integration

**Cost:**
- SSD: $0.230/GB-month + IOPS
- Throughput: $0.50/MB/s-month

### FSx for OpenZFS

**Purpose:**
- Linux ZFS file systems
- Up to 12.5 GB/s throughput

**Features:**
- Snapshots, clones
- Compression
- Point-in-time recovery

**Cost:**
- SSD: $0.150/GB-month + throughput

---

## Storage Selection Guide

### Decision Matrix

```
Use Case → Service

Objects (files, media, static assets) → S3
  ├─ Frequent access → S3 Standard
  ├─ Infrequent (>30 days) → S3 Standard-IA
  ├─ Archive → S3 Glacier (Instant, Flexible, Deep)
  └─ Unknown pattern → S3 Intelligent-Tiering

Block storage (databases, boot volumes) → EBS
  ├─ General purpose → gp3
  ├─ High IOPS → io2 or io2 Block Express
  └─ Throughput optimized → st1 (HDD)

Shared files (NFS) → EFS or FSx
  ├─ Linux NFS → EFS
  ├─ Windows SMB → FSx for Windows
  ├─ High-performance HPC → FSx for Lustre
  └─ Enterprise NAS → FSx for NetApp ONTAP

Container storage:
  ├─ Ephemeral → Local SSD
  ├─ Persistent single-container → EBS
  └─ Persistent shared → EFS or FSx
```

### Cost Comparison (1TB for 1 Month)

| Service | Monthly Cost | Access Pattern |
|---------|--------------|----------------|
| S3 Standard | $23 | Frequent |
| S3 Standard-IA | $12.50 | Infrequent |
| S3 Glacier Instant | $4 | Archive, instant |
| S3 Deep Archive | $0.99 | Long-term archive |
| EBS gp3 | $80 | Block storage |
| EFS Standard | $300 | Shared files, frequent |
| EFS IA | $25 | Shared files, infrequent |
| FSx for Lustre | ~$145 | High-performance HPC |

### Performance Comparison

| Service | Latency | IOPS | Throughput |
|---------|---------|------|------------|
| S3 Standard | ~100ms | N/A | Unlimited |
| S3 Express One Zone | <10ms | 100,000+ | Unlimited |
| EBS gp3 | <1ms | 16,000 | 1,000 MB/s |
| EBS io2 Block Express | <1ms | 256,000 | 4,000 MB/s |
| EFS General Purpose | <10ms | 7,000 ops/s | Elastic |
| FSx for Lustre | <1ms | Millions | 100+ GB/s |

### Lifecycle Cost Optimization

**Scenario: 100TB data with 10% active**

**Without Optimization:**
- S3 Standard: 100TB × $23 = $2,300/month

**With Lifecycle Policies:**
- Active (10TB): S3 Standard = $230
- Infrequent (30TB): S3 Standard-IA = $375
- Archive (60TB): S3 Glacier Instant = $240
- **Total: $845/month (63% savings)**

**With Intelligent-Tiering:**
- Automatic optimization
- Similar savings
- Small monitoring fee (~$250 for 100M objects)

```

### references/serverless-patterns.md

```markdown
# AWS Serverless Architecture Patterns

## Table of Contents

- [Lambda Function Patterns](#lambda-function-patterns)
- [API Gateway Patterns](#api-gateway-patterns)
- [Step Functions Orchestration](#step-functions-orchestration)
- [EventBridge Patterns](#eventbridge-patterns)
- [DynamoDB Integration](#dynamodb-integration)
- [Lambda SnapStart Configuration](#lambda-snapstart-configuration)
- [Response Streaming](#response-streaming)
- [Error Handling and Retry Logic](#error-handling-and-retry-logic)
- [Performance Optimization](#performance-optimization)
- [Anti-Patterns](#anti-patterns)

## Lambda Function Patterns

### Basic REST API Handler

**Pattern:** Single Lambda function handling HTTP requests through API Gateway.

**Use When:**
- Building CRUD APIs with predictable traffic
- Execution time under 15 minutes
- Need automatic scaling
- Cost optimization for variable workloads

**Architecture:**
```
API Gateway HTTP API → Lambda Function → DynamoDB/RDS
                                      → S3 (optional)
```

**Key Configuration:**
```yaml
# Lambda configuration
Runtime: python3.12 or nodejs20.x
Memory: 1024 MB (price-performance sweet spot)
Timeout: 30 seconds (API Gateway limit)
ReservedConcurrency: null (unlimited scaling)
ProvisionedConcurrency: 0 (avoid unless cold start critical)
```

**Cost Characteristics:**
- Free tier: 1M requests/month, 400,000 GB-seconds compute
- Beyond free tier: $0.20 per 1M requests + $0.0000166667 per GB-second
- 1M requests at 1GB memory, 500ms duration: ~$4.17/month

**Memory Allocation Guidance:**
- 128-256 MB: Simple data transformations, S3 processing
- 512-1024 MB: API handlers, moderate database queries
- 1536-3008 MB: Complex processing, ML inference, heavy I/O
- 10240 MB: Maximum, for CPU-intensive workloads

### Event-Driven Processing

**Pattern:** Lambda triggered by events from S3, DynamoDB Streams, or EventBridge.

**Architecture:**
```
S3 Upload → EventBridge Event → Lambda (transform) → DynamoDB (metadata)
                                                   → SQS (downstream)
```

**Configuration for Event Sources:**

**S3 Event:**
```yaml
EventSourceArn: !GetAtt MyBucket.Arn
Events:
  - s3:ObjectCreated:*
Filter:
  S3Key:
    Rules:
      - Name: prefix
        Value: uploads/
      - Name: suffix
        Value: .csv
```

**DynamoDB Stream:**
```yaml
EventSourceArn: !GetAtt MyTable.StreamArn
StartingPosition: LATEST
BatchSize: 100
MaximumBatchingWindowInSeconds: 10
ParallelizationFactor: 10  # Process 10 shards concurrently
```

**Batch Processing Configuration:**
- BatchSize: 1-10,000 for SQS/Kinesis
- MaximumBatchingWindowInSeconds: 0-300 (wait for batch to fill)
- ParallelizationFactor: 1-10 (concurrent executions per shard)

### Scheduled Task Pattern

**Pattern:** Lambda function running on schedule using EventBridge Rules.

**Use When:**
- Periodic data processing (ETL jobs)
- Cleanup tasks (delete expired records)
- Report generation
- Health checks and monitoring

**EventBridge Schedule Syntax:**
```
rate(5 minutes)                    # Every 5 minutes
rate(1 hour)                       # Every hour
rate(1 day)                        # Daily
cron(0 9 * * ? *)                 # 9 AM UTC daily
cron(0 0 ? * MON-FRI *)           # Midnight weekdays
cron(0/15 * * * ? *)              # Every 15 minutes
```

**Best Practices:**
- Use UTC timezone for cron expressions
- Set appropriate timeout (max 15 minutes)
- Implement idempotency (safe to retry)
- Use CloudWatch Logs for debugging

## API Gateway Patterns

### HTTP API vs REST API

**Decision Matrix:**

| Feature | HTTP API | REST API |
|---------|----------|----------|
| Cost | $1.00/million | $3.50/million |
| Latency | ~35% lower | Standard |
| JWT Authorization | Native | Custom authorizer needed |
| Request Validation | No | Yes |
| API Keys | No | Yes |
| Usage Plans | No | Yes |
| WebSocket | Separate | No |

**Recommendation:** Use HTTP API for 90% of use cases. Use REST API only when request validation, API keys, or usage plans required.

### Lambda Proxy Integration

**Pattern:** Pass entire request to Lambda, return entire response.

**Request Format Received:**
```json
{
  "version": "2.0",
  "routeKey": "POST /items",
  "rawPath": "/items",
  "headers": {
    "content-type": "application/json",
    "user-agent": "Mozilla/5.0"
  },
  "queryStringParameters": {
    "filter": "active"
  },
  "body": "{\"name\":\"item1\"}",
  "isBase64Encoded": false,
  "requestContext": {
    "requestId": "abc123",
    "http": {
      "method": "POST",
      "path": "/items"
    }
  }
}
```

**Required Response Format:**
```json
{
  "statusCode": 200,
  "headers": {
    "Content-Type": "application/json",
    "Access-Control-Allow-Origin": "*"
  },
  "body": "{\"message\":\"Success\"}"
}
```

**Common Errors:**
- Missing statusCode field (required)
- Body not stringified (must be string, not object)
- Incorrect header names (case-sensitive)

### CORS Configuration

**For HTTP API:**
```yaml
CorsConfiguration:
  AllowOrigins:
    - https://example.com
  AllowMethods:
    - GET
    - POST
    - PUT
    - DELETE
  AllowHeaders:
    - Content-Type
    - Authorization
  MaxAge: 300
  AllowCredentials: true
```

**For REST API:**
Must implement OPTIONS method with CORS headers in Lambda response.

### Custom Domain Names

**Pattern:** Map custom domain to API Gateway endpoint.

**Requirements:**
1. ACM certificate in us-east-1 (for edge-optimized) or same region (for regional)
2. Route 53 hosted zone or external DNS provider
3. API mapping configuration

**Configuration:**
```yaml
DomainName:
  DomainName: api.example.com
  CertificateArn: arn:aws:acm:us-east-1:123456789012:certificate/abc
  EndpointConfiguration:
    Types:
      - REGIONAL  # or EDGE for CloudFront distribution

ApiMapping:
  DomainName: api.example.com
  ApiId: !Ref HttpApi
  Stage: prod
  ApiMappingKey: v1  # results in api.example.com/v1
```

## Step Functions Orchestration

### Express Workflows vs Standard Workflows

**Decision Matrix:**

| Feature | Express Workflow | Standard Workflow |
|---------|-----------------|-------------------|
| Max Duration | 5 minutes | 1 year |
| Pricing Model | Per execution | Per state transition |
| Execution Guarantee | At-least-once | Exactly-once |
| Execution History | CloudWatch Logs | Built-in (90 days) |
| Cost (1M executions) | $1.00 | $25.00 |

**Use Cases:**
- **Express:** API response orchestration, real-time data processing
- **Standard:** Long-running workflows, ETL pipelines, human approval

### Common State Types

**Task State (Lambda Invocation):**
```json
{
  "ProcessData": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessFunction",
    "TimeoutSeconds": 300,
    "Retry": [{
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 2,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }],
    "Catch": [{
      "ErrorEquals": ["States.ALL"],
      "Next": "HandleError"
    }],
    "Next": "NextState"
  }
}
```

**Choice State (Conditional Branching):**
```json
{
  "CheckStatus": {
    "Type": "Choice",
    "Choices": [{
      "Variable": "$.status",
      "StringEquals": "SUCCESS",
      "Next": "SuccessState"
    }, {
      "Variable": "$.status",
      "StringEquals": "FAILED",
      "Next": "FailureState"
    }],
    "Default": "DefaultState"
  }
}
```

**Map State (Parallel Processing):**
```json
{
  "ProcessItems": {
    "Type": "Map",
    "ItemsPath": "$.items",
    "MaxConcurrency": 10,
    "Iterator": {
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessItem",
          "End": true
        }
      }
    },
    "Next": "Aggregate"
  }
}
```

### Distributed Map (2024+ Feature)

**Use When:**
- Processing thousands to millions of items
- Items stored in S3 (JSON array or CSV)
- Need massive parallelism (10,000+ concurrent executions)

**Configuration:**
```json
{
  "ProcessLargeDataset": {
    "Type": "Map",
    "ItemReader": {
      "Resource": "arn:aws:states:::s3:getObject",
      "Parameters": {
        "Bucket": "my-bucket",
        "Key": "data.json"
      }
    },
    "ItemSelector": {
      "item.$": "$$.Map.Item.Value"
    },
    "MaxConcurrency": 10000,
    "ToleratedFailurePercentage": 5,
    "ResultWriter": {
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "output-bucket",
        "Prefix": "results/"
      }
    },
    "Iterator": {
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Process",
          "End": true
        }
      }
    }
  }
}
```

**Performance:**
- Can process millions of items in minutes
- Automatic result aggregation to S3
- Built-in fault tolerance

## EventBridge Patterns

### Event-Driven Architecture

**Pattern:** Decouple producers and consumers using EventBridge.

**Architecture:**
```
Producer Service → EventBridge Event Bus → EventBridge Rule → Target (Lambda/SQS/Step Functions)
                                                            → Target (Additional consumers)
```

**Event Pattern Matching:**
```json
{
  "source": ["myapp.orders"],
  "detail-type": ["Order Placed"],
  "detail": {
    "amount": [{"numeric": [">=", 100]}],
    "status": ["confirmed"]
  }
}
```

**Advanced Filtering:**
```json
{
  "source": ["myapp.users"],
  "detail": {
    "location": {
      "state": [
        {"prefix": "US-"},
        "CA-BC",
        "CA-ON"
      ]
    },
    "metadata": {
      "plan": [
        {"anything-but": "free"}
      ]
    }
  }
}
```

### EventBridge Pipes (2023+ Feature)

**Pattern:** Simplified event processing with built-in filtering and enrichment.

**Architecture:**
```
Source (SQS/Kinesis/DynamoDB) → Filter → Enrichment (Lambda/API) → Target
```

**Use When:**
- Need to filter events before processing
- Require data enrichment from external API
- Want simpler configuration than Lambda + EventBridge

**Configuration:**
```yaml
Pipe:
  Source: !GetAtt SourceQueue.Arn
  SourceParameters:
    SqsQueueParameters:
      BatchSize: 10
      MaximumBatchingWindowInSeconds: 5

  Filter:
    Pattern: |
      {
        "body": {
          "amount": [{"numeric": [">", 100]}]
        }
      }

  Enrichment: !GetAtt EnrichmentFunction.Arn
  EnrichmentParameters:
    InputTemplate: |
      {
        "orderId": <$.body.orderId>,
        "customerId": <$.body.customerId>
      }

  Target: !GetAtt TargetFunction.Arn
  TargetParameters:
    InputTemplate: |
      {
        "enrichedData": <$.body>,
        "timestamp": <$.metadata.timestamp>
      }
```

**Benefits:**
- No custom Lambda code for routing
- Built-in transformation using JSONPath
- Automatic retries and DLQ support

### Schema Registry

**Pattern:** Define and version event schemas for type safety.

**Benefits:**
- Auto-generate code bindings (Java, Python, TypeScript)
- Schema validation
- Version management
- Discovery of available events

**Example Schema:**
```json
{
  "openapi": "3.0.0",
  "info": {
    "version": "1.0.0",
    "title": "OrderPlaced"
  },
  "paths": {},
  "components": {
    "schemas": {
      "OrderPlaced": {
        "type": "object",
        "required": ["orderId", "customerId", "amount"],
        "properties": {
          "orderId": {
            "type": "string",
            "format": "uuid"
          },
          "customerId": {
            "type": "string"
          },
          "amount": {
            "type": "number",
            "minimum": 0
          },
          "status": {
            "type": "string",
            "enum": ["pending", "confirmed", "shipped"]
          }
        }
      }
    }
  }
}
```

## DynamoDB Integration

### Single-Table Design

**Pattern:** Store multiple entity types in one table using GSIs.

**Primary Key Design:**
```
PK: CUSTOMER#123         SK: METADATA
PK: CUSTOMER#123         SK: ORDER#456
PK: CUSTOMER#123         SK: ORDER#789
PK: ORDER#456            SK: METADATA
PK: ORDER#456            SK: ITEM#1
```

**Access Patterns:**
1. Get customer: Query PK=CUSTOMER#123, SK begins_with METADATA
2. Get customer orders: Query PK=CUSTOMER#123, SK begins_with ORDER#
3. Get order details: Query PK=ORDER#456

**GSI for Inverted Access:**
```
GSI1PK: ORDER#456        GSI1SK: CUSTOMER#123
GSI1PK: ORDER#789        GSI1SK: CUSTOMER#123
```

Query orders by customer using GSI1PK=CUSTOMER#123.

### Lambda + DynamoDB Best Practices

**Batch Operations:**
```python
# Use batch_write_item for multiple puts (25 items max)
dynamodb.batch_write_item(
    RequestItems={
        'MyTable': [
            {'PutRequest': {'Item': item1}},
            {'PutRequest': {'Item': item2}},
        ]
    }
)

# Use batch_get_item for multiple reads (100 items max)
response = dynamodb.batch_get_item(
    RequestItems={
        'MyTable': {
            'Keys': [
                {'PK': 'CUSTOMER#123', 'SK': 'METADATA'},
                {'PK': 'CUSTOMER#456', 'SK': 'METADATA'},
            ]
        }
    }
)
```

**Connection Reuse:**
```python
# Initialize client OUTSIDE handler for reuse
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyTable')

def lambda_handler(event, context):
    # Reuses existing connection
    response = table.get_item(Key={'PK': 'CUSTOMER#123', 'SK': 'METADATA'})
```

**DynamoDB Streams Processing:**
```python
def lambda_handler(event, context):
    for record in event['Records']:
        if record['eventName'] == 'INSERT':
            new_item = record['dynamodb']['NewImage']
            # Process new item
        elif record['eventName'] == 'MODIFY':
            old_item = record['dynamodb']['OldImage']
            new_item = record['dynamodb']['NewImage']
            # Process update
        elif record['eventName'] == 'REMOVE':
            old_item = record['dynamodb']['OldImage']
            # Process deletion
```

## Lambda SnapStart Configuration

### Java Cold Start Optimization

**Use When:**
- Using Java 11 or Java 17 runtime
- Cold start latency is critical (APIs, synchronous processing)
- Application initialization is expensive (framework startup, connection pools)

**Performance Improvement:**
- Cold start: 10-15 seconds → 200-500 milliseconds
- Warm start: No change (still fast)
- Cost: Same as regular Lambda

**Configuration:**
```yaml
Function:
  Runtime: java17
  SnapStart:
    ApplyOn: PublishedVersions
  AutoPublishAlias: live
```

**Requirements:**
1. Must use published versions (not $LATEST)
2. Must use alias pointing to version
3. Invoke alias ARN, not function ARN

**Code Considerations:**

**Avoid:**
```java
// Network connections in initialization
private static HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(10))
    .build();  // Don't create in static initializer
```

**Prefer:**
```java
// Lazy initialization
private static HttpClient client;

private static HttpClient getClient() {
    if (client == null) {
        client = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .build();
    }
    return client;
}
```

**Uniqueness Requirements:**
- Generate unique IDs in handler, not initialization
- Refresh credentials on each invocation
- Don't cache sensitive data in static variables

## Response Streaming

### Large Response Pattern (2023+ Feature)

**Use When:**
- Response size exceeds 6 MB (synchronous limit)
- Need to stream data to client incrementally
- Generating large reports or files

**Maximum Size:**
- Standard response: 6 MB
- Streaming response: 20 MB

**Configuration:**
```yaml
Function:
  Runtime: nodejs20.x or python3.12
  InvokeMode: RESPONSE_STREAM
```

**Python Implementation:**
```python
def lambda_handler(event, context):
    def generate_data():
        # Stream data in chunks
        for i in range(1000):
            yield json.dumps({'chunk': i}) + '\n'

    return generate_data()
```

**Node.js Implementation:**
```javascript
import { Readable } from 'stream';

export const handler = awslambda.streamifyResponse(
    async (event, responseStream, context) => {
        const stream = Readable.from(generateData());
        stream.pipe(responseStream);
    }
);

function* generateData() {
    for (let i = 0; i < 1000; i++) {
        yield JSON.stringify({ chunk: i }) + '\n';
    }
}
```

**API Gateway Integration:**
Not supported with API Gateway. Use Lambda Function URL.

**Function URL Configuration:**
```yaml
FunctionUrl:
  AuthType: AWS_IAM  # or NONE for public
  InvokeMode: RESPONSE_STREAM
  Cors:
    AllowOrigins:
      - '*'
    AllowMethods:
      - GET
      - POST
```

## Error Handling and Retry Logic

### Retry Configuration

**Automatic Retries by Invocation Type:**

| Invocation Type | Retries | Use Case |
|----------------|---------|----------|
| Synchronous (API Gateway) | 0 | Client should retry |
| Asynchronous (S3, EventBridge) | 2 | AWS retries automatically |
| Event Source (SQS, Kinesis) | Until success or TTL | Configurable |

**Custom Retry Configuration:**
```yaml
EventSourceMapping:
  FunctionName: !Ref ProcessFunction
  EventSourceArn: !GetAtt MyQueue.Arn
  MaximumRetryAttempts: 3
  MaximumRecordAgeInSeconds: 3600  # Discard after 1 hour
  BisectBatchOnFunctionError: true  # Split failed batch
```

**Asynchronous Retry Configuration:**
```yaml
Function:
  EventInvokeConfig:
    MaximumRetryAttempts: 1  # 0-2 retries
    MaximumEventAgeInSeconds: 3600  # Discard after 1 hour
    DestinationConfig:
      OnSuccess:
        Destination: !GetAtt SuccessQueue.Arn
      OnFailure:
        Destination: !GetAtt DLQ.Arn
```

### Dead Letter Queue (DLQ)

**Pattern:** Send failed events to SQS or SNS for manual processing.

**Configuration:**
```yaml
Function:
  DeadLetterConfig:
    TargetArn: !GetAtt DLQ.Arn

DLQ:
  Type: AWS::SQS::Queue
  Properties:
    MessageRetentionPeriod: 1209600  # 14 days
    VisibilityTimeout: 300
```

**DLQ Processing:**
```python
# Separate Lambda to process DLQ
def dlq_handler(event, context):
    for record in event['Records']:
        body = json.loads(record['body'])
        # Log to CloudWatch or external monitoring
        logger.error(f"Failed event: {body}")
        # Send alert to SNS/PagerDuty
        # Store in S3 for analysis
```

### Idempotency

**Pattern:** Ensure safe retries using idempotency keys.

**Implementation (Python):**
```python
import boto3
import hashlib
import json

dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('IdempotencyTable')

def lambda_handler(event, context):
    # Generate idempotency key from event
    key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()

    # Check if already processed
    try:
        response = idempotency_table.get_item(Key={'RequestId': key})
        if 'Item' in response:
            # Already processed, return cached result
            return json.loads(response['Item']['Result'])
    except Exception as e:
        logger.error(f"Idempotency check failed: {e}")

    # Process event
    result = process_event(event)

    # Store result
    try:
        idempotency_table.put_item(
            Item={
                'RequestId': key,
                'Result': json.dumps(result),
                'TTL': int(time.time()) + 86400  # 24 hours
            }
        )
    except Exception as e:
        logger.error(f"Failed to store idempotency: {e}")

    return result
```

## Performance Optimization

### Memory vs CPU Tradeoff

**Key Insight:** CPU allocation scales linearly with memory.

| Memory | vCPU | Best For |
|--------|------|----------|
| 128 MB | 0.083 vCPU | Simple transformations |
| 512 MB | 0.33 vCPU | Light API handlers |
| 1024 MB | 0.67 vCPU | General purpose |
| 1769 MB | 1.0 vCPU | CPU-bound tasks |
| 3538 MB | 2.0 vCPU | Heavy processing |
| 10240 MB | 6.0 vCPU | Maximum performance |

**Cost vs Performance:**
- 128 MB, 1000ms = $0.0000002083 per invocation
- 1024 MB, 125ms = $0.0000002083 per invocation (8x faster, same cost!)

**Recommendation:** Use AWS Lambda Power Tuning to find optimal memory.

### Connection Pooling

**Pattern:** Reuse database connections across invocations.

**RDS Connection Pooling:**
```python
import pymysql

# Initialize OUTSIDE handler
connection = None

def get_connection():
    global connection
    if connection is None or not connection.open:
        connection = pymysql.connect(
            host=os.environ['DB_HOST'],
            user=os.environ['DB_USER'],
            password=os.environ['DB_PASSWORD'],
            database=os.environ['DB_NAME'],
            connect_timeout=5,
            cursorclass=pymysql.cursors.DictCursor
        )
    return connection

def lambda_handler(event, context):
    conn = get_connection()
    with conn.cursor() as cursor:
        cursor.execute("SELECT * FROM users WHERE id = %s", (event['userId'],))
        result = cursor.fetchone()
    return result
```

**RDS Proxy (Recommended):**
- Manages connection pooling automatically
- Reduces overhead by 66%+
- Built-in failover
- IAM authentication support

**Configuration:**
```yaml
DBProxy:
  Type: AWS::RDS::DBProxy
  Properties:
    EngineFamily: POSTGRESQL
    Auth:
      - AuthScheme: SECRETS
        SecretArn: !Ref DBSecret
    RoleArn: !GetAtt ProxyRole.Arn
    VpcSubnetIds:
      - !Ref PrivateSubnet1
      - !Ref PrivateSubnet2
```

### Provisioned Concurrency

**Use When:**
- Cold starts are unacceptable (<100ms p99 required)
- Predictable high traffic (not cost-effective for variable traffic)
- Cost is secondary to performance

**Cost Model:**
- Provisioned concurrency: $0.0000041667 per GB-second
- On-demand: $0.0000166667 per GB-second
- Provisioned is always running (charged even when idle)

**Configuration:**
```yaml
Function:
  ProvisionedConcurrencyConfig:
    ProvisionedConcurrentExecutions: 10
    AutoScalingConfig:
      MinCapacity: 5
      MaxCapacity: 100
      TargetValue: 0.70  # 70% utilization
```

**Application Auto Scaling:**
```yaml
ScalableTarget:
  ServiceNamespace: lambda
  ResourceId: !Sub "function:${FunctionName}:${Alias}"
  ScalableDimension: lambda:function:ProvisionedConcurrency
  MinCapacity: 5
  MaxCapacity: 100

ScalingPolicy:
  PolicyType: TargetTrackingScaling
  TargetTrackingScalingPolicyConfiguration:
    TargetValue: 0.70
    PredefinedMetricSpecification:
      PredefinedMetricType: LambdaProvisionedConcurrencyUtilization
```

## Anti-Patterns

### Don't: Run Long-Running Tasks

**Problem:** Lambda has 15-minute timeout limit.

**Solution:** Use Step Functions, ECS Fargate, or Batch.

### Don't: Store State in /tmp

**Problem:** /tmp is ephemeral and limited to 512 MB (10 GB in 2024+).

**Solution:** Use S3, EFS, or ElastiCache for persistent storage.

### Don't: Use Lambda for Predictable High Traffic

**Problem:** More expensive than Fargate/EC2 at constant utilization.

**Breakeven Analysis:**
- Lambda: $0.0000166667 per GB-second
- Fargate (1 vCPU, 2GB): $0.000011 per GB-second at 100% utilization

**Solution:** Use Fargate or EC2 for always-on workloads.

### Don't: Hardcode Configuration

**Problem:** Requires redeployment for configuration changes.

**Solution:** Use environment variables, Parameter Store, or AppConfig.

### Don't: Ignore Cold Start Impact

**Problem:** First invocation is slow (Java: 10-15s, Python: 200-500ms).

**Solutions:**
1. Use SnapStart (Java only)
2. Provision concurrency (expensive)
3. Keep functions warm (cron trigger)
4. Minimize dependencies
5. Use lighter runtimes (Node.js, Python over Java)

### Don't: Process Large Files in Lambda

**Problem:** 512 MB - 10 GB memory limit, 15-minute timeout.

**Solution:** Use S3 Select, Athena, or EMR for large data processing.

### Don't: Synchronous Step Functions for APIs

**Problem:** Express workflows have 5-minute limit, standard workflows are slow.

**Solution:** Use direct Lambda integration or asynchronous workflows.

### Don't: Ignore Concurrent Execution Limits

**Problem:** Account limit of 1,000 concurrent executions (can request increase).

**Solution:**
1. Request limit increase
2. Use reserved concurrency to prevent throttling
3. Implement backpressure (SQS rate limiting)
4. Use Provisioned Concurrency for critical functions

### Don't: Skip VPC Best Practices

**Problem:** Lambda in VPC requires ENIs (slow cold starts pre-2019).

**Modern Solution (Post-2019):**
- Hyperplane ENIs (shared across functions)
- No cold start penalty
- Use VPC for RDS/ElastiCache access
- Use VPC endpoints for AWS services (avoid NAT Gateway costs)

**Configuration:**
```yaml
Function:
  VpcConfig:
    SecurityGroupIds:
      - !Ref LambdaSecurityGroup
    SubnetIds:
      - !Ref PrivateSubnet1
      - !Ref PrivateSubnet2
```

```

### references/container-patterns.md

```markdown
# AWS Container Patterns

## Table of Contents

- [ECS Service Patterns](#ecs-service-patterns)
- [EKS Cluster Patterns](#eks-cluster-patterns)
- [Fargate vs EC2 Launch Types](#fargate-vs-ec2-launch-types)
- [Task Definition Best Practices](#task-definition-best-practices)
- [Service Discovery and Load Balancing](#service-discovery-and-load-balancing)
- [Auto Scaling Strategies](#auto-scaling-strategies)
- [Service Connect (Service Mesh)](#service-connect-service-mesh)
- [EKS Pod Identities](#eks-pod-identities)
- [Container Security](#container-security)
- [Logging and Monitoring](#logging-and-monitoring)
- [Cost Optimization](#cost-optimization)
- [Anti-Patterns](#anti-patterns)

## ECS Service Patterns

### Basic Web Application (Fargate + ALB)

**Pattern:** Containerized web service with auto-scaling and load balancing.

**Architecture:**
```
Internet → ALB (Application Load Balancer)
           → Target Group
             → ECS Service (Fargate tasks)
               → RDS or DynamoDB
               → ElastiCache (optional)
```

**Use When:**
- Building containerized web applications
- Need auto-scaling without managing servers
- Docker-based deployment workflow
- Team lacks Kubernetes expertise

**Task Definition Components:**
```yaml
TaskDefinition:
  Family: web-app
  NetworkMode: awsvpc  # Required for Fargate
  RequiresCompatibilities:
    - FARGATE
  Cpu: '512'  # 0.5 vCPU
  Memory: '1024'  # 1 GB
  ExecutionRoleArn: !GetAtt ExecutionRole.Arn  # Pull image, logs
  TaskRoleArn: !GetAtt TaskRole.Arn  # Container permissions

  ContainerDefinitions:
    - Name: web
      Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/web-app:latest
      PortMappings:
        - ContainerPort: 80
          Protocol: tcp
      Environment:
        - Name: NODE_ENV
          Value: production
      Secrets:
        - Name: DB_PASSWORD
          ValueFrom: arn:aws:secretsmanager:us-east-1:123456789012:secret:db-pass
      LogConfiguration:
        LogDriver: awslogs
        Options:
          awslogs-group: /ecs/web-app
          awslogs-region: us-east-1
          awslogs-stream-prefix: web
      HealthCheck:
        Command:
          - CMD-SHELL
          - curl -f http://localhost/health || exit 1
        Interval: 30
        Timeout: 5
        Retries: 3
        StartPeriod: 60
```

**Service Configuration:**
```yaml
Service:
  ServiceName: web-service
  Cluster: !Ref ECSCluster
  TaskDefinition: !Ref TaskDefinition
  DesiredCount: 2
  LaunchType: FARGATE

  NetworkConfiguration:
    AwsvpcConfiguration:
      Subnets:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroups:
        - !Ref ServiceSecurityGroup
      AssignPublicIp: DISABLED

  LoadBalancers:
    - TargetGroupArn: !Ref TargetGroup
      ContainerName: web
      ContainerPort: 80

  HealthCheckGracePeriodSeconds: 60

  DeploymentConfiguration:
    MaximumPercent: 200
    MinimumHealthyPercent: 100
    DeploymentCircuitBreaker:
      Enable: true
      Rollback: true
```

**Cost Estimate (2 tasks, 24/7):**
- Fargate (0.5 vCPU, 1GB): ~$35/month
- ALB: ~$20/month
- Data transfer: ~$5/month
- **Total: ~$60/month**

### Background Worker Pattern

**Pattern:** Process asynchronous jobs from SQS queue.

**Architecture:**
```
SQS Queue → ECS Service (Fargate tasks)
             → DynamoDB (job status)
             → S3 (results)
```

**Use When:**
- Processing background jobs (image processing, report generation)
- Need auto-scaling based on queue depth
- Variable workload patterns
- Prefer containers over Lambda (>15 min runtime, >10GB memory)

**Task Definition Differences:**
```yaml
ContainerDefinitions:
  - Name: worker
    Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/worker:latest
    # No PortMappings needed
    Environment:
      - Name: QUEUE_URL
        Value: !Ref WorkQueue
      - Name: BATCH_SIZE
        Value: '10'
    Essential: true
```

**Service Configuration:**
```yaml
Service:
  ServiceName: worker-service
  DesiredCount: 1  # Scale based on queue depth
  # No LoadBalancers configuration
```

**Auto Scaling by Queue Depth:**
```yaml
ScalableTarget:
  ServiceNamespace: ecs
  ResourceId: !Sub "service/${Cluster}/${ServiceName}"
  ScalableDimension: ecs:service:DesiredCount
  MinCapacity: 1
  MaxCapacity: 20

ScalingPolicy:
  PolicyType: TargetTrackingScaling
  TargetTrackingScalingPolicyConfiguration:
    CustomizedMetricSpecification:
      MetricName: ApproximateNumberOfMessagesVisible
      Namespace: AWS/SQS
      Statistic: Average
      Dimensions:
        - Name: QueueName
          Value: !GetAtt WorkQueue.QueueName
    TargetValue: 100  # Target 100 messages per task
    ScaleInCooldown: 300
    ScaleOutCooldown: 60
```

### Scheduled Task Pattern

**Pattern:** Run containerized cron jobs using EventBridge.

**Architecture:**
```
EventBridge Rule (schedule) → ECS Task (Fargate)
                               → Process data
                               → Store results in S3
```

**Use When:**
- Scheduled batch processing
- Need more than 15 minutes runtime (Lambda limit)
- Require more than 10 GB memory
- Complex dependencies or large Docker images

**EventBridge Rule:**
```yaml
ScheduledRule:
  ScheduleExpression: cron(0 2 * * ? *)  # 2 AM UTC daily
  State: ENABLED
  Targets:
    - Arn: !GetAtt ECSCluster.Arn
      RoleArn: !GetAtt EventsRole.Arn
      EcsParameters:
        TaskDefinitionArn: !Ref TaskDefinition
        LaunchType: FARGATE
        NetworkConfiguration:
          AwsVpcConfiguration:
            Subnets:
              - !Ref PrivateSubnet1
            SecurityGroups:
              - !Ref TaskSecurityGroup
            AssignPublicIp: DISABLED
        TaskCount: 1
```

**Benefits over Lambda:**
- No 15-minute timeout
- Up to 120 GB memory (Fargate)
- More familiar Docker ecosystem
- Can use same image as production service

## EKS Cluster Patterns

### Production-Ready EKS Cluster

**Pattern:** Managed Kubernetes cluster with best practices.

**Architecture:**
```
VPC (10.0.0.0/16)
├── Public Subnets (NAT Gateways, Load Balancers)
├── Private Subnets (EKS worker nodes)
└── Database Subnets (RDS, ElastiCache)

EKS Control Plane (AWS-managed)
└── Node Groups (EC2 or Fargate)
    └── Pods (application containers)
```

**Use When:**
- Team has Kubernetes expertise
- Need Kubernetes ecosystem (Helm, Operators, Istio)
- Multi-cloud or hybrid cloud strategy
- Complex orchestration requirements
- Migrating from on-premises Kubernetes

**Cluster Configuration:**
```yaml
EKSCluster:
  Name: production-cluster
  Version: '1.28'
  RoleArn: !GetAtt ClusterRole.Arn

  ResourcesVpcConfig:
    SubnetIds:
      - !Ref PrivateSubnet1
      - !Ref PrivateSubnet2
      - !Ref PrivateSubnet3
    EndpointPublicAccess: false  # Private cluster
    EndpointPrivateAccess: true
    SecurityGroupIds:
      - !Ref ClusterSecurityGroup

  Logging:
    ClusterLogging:
      EnabledTypes:
        - Type: api
        - Type: audit
        - Type: authenticator
        - Type: controllerManager
        - Type: scheduler

  EncryptionConfig:
    - Resources:
        - secrets
      Provider:
        KeyArn: !GetAtt KMSKey.Arn
```

**Managed Node Group (Recommended):**
```yaml
NodeGroup:
  ClusterName: !Ref EKSCluster
  NodegroupName: general-purpose
  NodeRole: !GetAtt NodeRole.Arn

  Subnets:
    - !Ref PrivateSubnet1
    - !Ref PrivateSubnet2

  ScalingConfig:
    MinSize: 2
    MaxSize: 10
    DesiredSize: 3

  InstanceTypes:
    - t3.medium
    - t3a.medium  # AMD alternative, 10% cheaper

  AmiType: AL2_x86_64  # Amazon Linux 2
  CapacityType: ON_DEMAND  # or SPOT for 70% savings

  UpdateConfig:
    MaxUnavailable: 1  # Rolling updates

  Labels:
    environment: production
    workload-type: general

  Taints:
    - Key: dedicated
      Value: general
      Effect: NoSchedule
```

**Fargate Profile (Serverless Nodes):**
```yaml
FargateProfile:
  ClusterName: !Ref EKSCluster
  FargateProfileName: serverless-workloads
  PodExecutionRoleArn: !GetAtt FargateRole.Arn

  Subnets:
    - !Ref PrivateSubnet1
    - !Ref PrivateSubnet2

  Selectors:
    - Namespace: serverless
      Labels:
        compute-type: fargate
    - Namespace: kube-system
      Labels:
        k8s-app: kube-dns  # CoreDNS on Fargate
```

**Cost Breakdown:**
- EKS control plane: $73/month
- Worker nodes (3x t3.medium, 24/7): ~$94/month
- NAT Gateways (2x): ~$65/month
- Data transfer: ~$10/month
- **Total: ~$242/month minimum**

### EKS Auto Mode (2024+ Feature)

**Pattern:** Fully managed node lifecycle, auto-scaling, and upgrades.

**Use When:**
- Want maximum automation
- Minimize operational overhead
- Acceptable with AWS-managed node pools
- Cost is less critical than simplicity

**Configuration:**
```yaml
EKSCluster:
  ComputeConfig:
    Enabled: true  # Enable Auto Mode
    NodePools:
      - general-purpose  # AWS-managed
    NodeRoleArn: !GetAtt AutoModeNodeRole.Arn
```

**Auto Mode Features:**
- Automatic node provisioning
- Auto-scaling without Cluster Autoscaler
- Automated OS patches and upgrades
- Built-in cost optimization

**Cost Impact:**
- ~10-15% higher than self-managed nodes
- Savings from reduced operational overhead
- Automatic spot instance usage

### Multi-Tenant EKS Pattern

**Pattern:** Isolated namespaces with resource quotas and network policies.

**Use When:**
- Multiple teams or applications sharing cluster
- Need cost efficiency (shared control plane)
- Strong isolation requirements
- Centralized cluster management

**Namespace Isolation:**
```yaml
apiVersion: v1
kind: Namespace
metadata:
  name: team-a
  labels:
    team: team-a
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "5"
    services.loadbalancers: "2"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: team-a-limits
  namespace: team-a
spec:
  limits:
    - max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: "100m"
        memory: 128Mi
      type: Container
```

**Network Policies (Deny by Default):**
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: team-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-within-namespace
  namespace: team-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}  # Allow from same namespace
```

**RBAC (Role-Based Access Control):**
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: team-a-developer
  namespace: team-a
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "jobs", "services"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["secrets", "configmaps"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-a-developers
  namespace: team-a
subjects:
  - kind: Group
    name: team-a
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: team-a-developer
  apiGroup: rbac.authorization.k8s.io
```

## Fargate vs EC2 Launch Types

### Decision Matrix

| Factor | Fargate | EC2 |
|--------|---------|-----|
| **Management** | AWS manages instances | Self-managed instances |
| **Pricing** | Per task-second | Per instance-hour |
| **Scaling** | Task-level (instant) | Instance-level (slower) |
| **Cost (predictable)** | Higher | Lower (RI/SP savings) |
| **Cost (variable)** | Lower | Higher (idle capacity) |
| **Start Time** | ~30 seconds | Instant (if capacity) |
| **Instance Access** | No SSH access | Full SSH access |
| **Customization** | Limited | Full control |
| **Spot Support** | Fargate Spot (70% off) | EC2 Spot (70% off) |

### Cost Comparison Example

**Workload:** Web service, 2 vCPU, 4 GB RAM, 24/7

**Fargate:**
- Pricing: $0.04048/hour (2 vCPU) + $0.004445/hour per GB = ~$0.0587/hour
- Monthly: $0.0587 × 730 hours = $42.85/month

**EC2 (t3.medium: 2 vCPU, 4 GB):**
- On-Demand: $0.0416/hour = $30.37/month
- 1-year Reserved Instance: $0.0208/hour = $15.18/month
- 3-year Reserved Instance: $0.0125/hour = $9.13/month

**Recommendation:**
- **Variable traffic:** Fargate (scale to zero)
- **Predictable 24/7:** EC2 with Reserved Instances
- **Development/Test:** Fargate (simpler, pay only when testing)
- **Production high-volume:** EC2 with RI (50-70% cost savings)

### Fargate Spot

**Pattern:** Run fault-tolerant tasks at 70% discount.

**Use When:**
- Batch processing jobs
- CI/CD build workers
- Data processing pipelines
- Development/test environments

**Configuration:**
```yaml
Service:
  CapacityProviderStrategy:
    - CapacityProvider: FARGATE_SPOT
      Weight: 4
      Base: 0
    - CapacityProvider: FARGATE
      Weight: 1
      Base: 2  # Always maintain 2 on-demand tasks
```

**Interruption Handling:**
- 2-minute warning before interruption
- Listen for SIGTERM signal
- Gracefully finish current work
- Task automatically restarted

**Python Example:**
```python
import signal
import sys

def sigterm_handler(signum, frame):
    print("SIGTERM received, gracefully shutting down")
    # Finish current work
    cleanup()
    sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)

# Main processing loop
while True:
    process_batch()
```

## Task Definition Best Practices

### Resource Allocation

**CPU and Memory Combinations (Fargate):**

| CPU (vCPU) | Memory Options (GB) |
|------------|---------------------|
| 0.25 | 0.5, 1, 2 |
| 0.5 | 1, 2, 3, 4 |
| 1 | 2, 3, 4, 5, 6, 7, 8 |
| 2 | 4-16 (1 GB increments) |
| 4 | 8-30 (1 GB increments) |
| 8 | 16-60 (4 GB increments) |
| 16 | 32-120 (8 GB increments) |

**Right-Sizing Approach:**
1. Start with 0.5 vCPU, 1 GB (common web apps)
2. Monitor CloudWatch metrics (CPUUtilization, MemoryUtilization)
3. Increase if consistently >70% utilization
4. Decrease if consistently <30% utilization

### Health Checks

**Container Health Check:**
```yaml
HealthCheck:
  Command:
    - CMD-SHELL
    - curl -f http://localhost:8080/health || exit 1
  Interval: 30  # Seconds between checks
  Timeout: 5  # Max time for check to complete
  Retries: 3  # Consecutive failures before unhealthy
  StartPeriod: 60  # Grace period for startup
```

**Load Balancer Health Check:**
```yaml
TargetGroup:
  HealthCheckPath: /health
  HealthCheckProtocol: HTTP
  HealthCheckIntervalSeconds: 30
  HealthCheckTimeoutSeconds: 5
  HealthyThresholdCount: 2  # Consecutive successes
  UnhealthyThresholdCount: 3  # Consecutive failures
  Matcher:
    HttpCode: 200  # or 200-299
```

**Best Practices:**
- Use dedicated health check endpoint
- Check critical dependencies (database connectivity)
- Return appropriate status codes
- Keep health checks fast (<1 second)
- Log health check failures

### Environment Variables vs Secrets

**Environment Variables (Plain Text):**
```yaml
Environment:
  - Name: NODE_ENV
    Value: production
  - Name: LOG_LEVEL
    Value: info
  - Name: API_ENDPOINT
    Value: https://api.example.com
```

**Use For:**
- Non-sensitive configuration
- Public endpoints
- Feature flags
- Environment names

**Secrets (Encrypted):**
```yaml
Secrets:
  - Name: DB_PASSWORD
    ValueFrom: arn:aws:secretsmanager:us-east-1:123456789012:secret:db-pass
  - Name: API_KEY
    ValueFrom: arn:aws:ssm:us-east-1:123456789012:parameter/api-key
```

**Use For:**
- Database credentials
- API keys
- OAuth tokens
- Private certificates

**IAM Permissions Required:**
```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "secretsmanager:GetSecretValue",
      "ssm:GetParameters"
    ],
    "Resource": [
      "arn:aws:secretsmanager:us-east-1:123456789012:secret:*",
      "arn:aws:ssm:us-east-1:123456789012:parameter/*"
    ]
  }]
}
```

### Logging Configuration

**CloudWatch Logs (Default):**
```yaml
LogConfiguration:
  LogDriver: awslogs
  Options:
    awslogs-group: /ecs/my-service
    awslogs-region: us-east-1
    awslogs-stream-prefix: task
    awslogs-create-group: true
```

**FireLens (Advanced Routing):**
```yaml
LogConfiguration:
  LogDriver: awsfirelens
  Options:
    Name: datadog
    apikey: !Ref DatadogApiKey
    dd_service: my-service
    dd_source: ecs
    dd_tags: env:production,team:platform
```

**Supported Destinations:**
- CloudWatch Logs
- Datadog
- New Relic
- Splunk
- Elasticsearch
- S3 (via Kinesis Firehose)

## Service Discovery and Load Balancing

### AWS Cloud Map (Service Discovery)

**Pattern:** DNS-based service discovery for microservices.

**Use When:**
- Microservices architecture
- Service-to-service communication
- Dynamic service endpoints
- No load balancer needed (internal traffic)

**Configuration:**
```yaml
ServiceDiscoveryService:
  Name: backend-api
  DnsConfig:
    NamespaceId: !Ref PrivateNamespace
    DnsRecords:
      - Type: A
        TTL: 10
  HealthCheckCustomConfig:
    FailureThreshold: 1

PrivateNamespace:
  Name: internal.example.com
  Vpc: !Ref VPC
```

**Service Registration:**
```yaml
Service:
  ServiceRegistries:
    - RegistryArn: !GetAtt ServiceDiscoveryService.Arn
      ContainerName: backend
      ContainerPort: 8080
```

**Usage in Application:**
```python
# Resolve service using DNS
import socket
hostname = "backend-api.internal.example.com"
ip_address = socket.gethostbyname(hostname)

# Make request
response = requests.get(f"http://{ip_address}:8080/api")
```

**Benefits:**
- No load balancer cost
- Automatic registration/deregistration
- Health check integration
- Low latency (DNS caching)

### Application Load Balancer Integration

**Path-Based Routing:**
```yaml
ListenerRule1:
  Priority: 1
  Conditions:
    - Field: path-pattern
      Values:
        - /api/*
  Actions:
    - Type: forward
      TargetGroupArn: !Ref BackendTargetGroup

ListenerRule2:
  Priority: 2
  Conditions:
    - Field: path-pattern
      Values:
        - /admin/*
  Actions:
    - Type: forward
      TargetGroupArn: !Ref AdminTargetGroup
```

**Host-Based Routing:**
```yaml
ListenerRule:
  Conditions:
    - Field: host-header
      Values:
        - api.example.com
        - api-staging.example.com
  Actions:
    - Type: forward
      TargetGroupArn: !Ref ApiTargetGroup
```

**Header-Based Routing:**
```yaml
ListenerRule:
  Conditions:
    - Field: http-header
      HttpHeaderConfig:
        HttpHeaderName: X-API-Version
        Values:
          - v2
  Actions:
    - Type: forward
      TargetGroupArn: !Ref V2TargetGroup
```

## Auto Scaling Strategies

### Target Tracking (Recommended)

**CPU-Based Scaling:**
```yaml
ScalingPolicy:
  PolicyType: TargetTrackingScaling
  TargetTrackingScalingPolicyConfiguration:
    TargetValue: 70.0  # 70% CPU utilization
    PredefinedMetricSpecification:
      PredefinedMetricType: ECSServiceAverageCPUUtilization
    ScaleInCooldown: 300  # 5 minutes
    ScaleOutCooldown: 60  # 1 minute
```

**Memory-Based Scaling:**
```yaml
TargetTrackingScalingPolicyConfiguration:
  TargetValue: 80.0  # 80% memory utilization
  PredefinedMetricSpecification:
    PredefinedMetricType: ECSServiceAverageMemoryUtilization
```

**ALB Request Count Scaling:**
```yaml
TargetTrackingScalingPolicyConfiguration:
  TargetValue: 1000  # 1000 requests per target
  PredefinedMetricSpecification:
    PredefinedMetricType: ALBRequestCountPerTarget
    ResourceLabel: !Sub
      - ${LoadBalancerFullName}/${TargetGroupFullName}
      - LoadBalancerFullName: !GetAtt ALB.LoadBalancerFullName
        TargetGroupFullName: !GetAtt TargetGroup.TargetGroupFullName
```

### Step Scaling (Advanced)

**Use When:**
- Need different scaling behavior at different thresholds
- Complex scaling logic
- Multiple alarm thresholds

**Configuration:**
```yaml
ScalingPolicy:
  PolicyType: StepScaling
  StepScalingPolicyConfiguration:
    AdjustmentType: PercentChangeInCapacity
    Cooldown: 300
    MetricAggregationType: Average
    StepAdjustments:
      - MetricIntervalLowerBound: 0
        MetricIntervalUpperBound: 10
        ScalingAdjustment: 10  # Add 10% capacity
      - MetricIntervalLowerBound: 10
        MetricIntervalUpperBound: 20
        ScalingAdjustment: 20  # Add 20% capacity
      - MetricIntervalLowerBound: 20
        ScalingAdjustment: 30  # Add 30% capacity
```

### Scheduled Scaling

**Use When:**
- Predictable traffic patterns (business hours)
- Batch processing at specific times
- Cost optimization (scale down at night)

**Configuration:**
```yaml
ScheduledAction1:
  ScheduledActionName: scale-up-morning
  Schedule: cron(0 7 * * MON-FRI *)  # 7 AM weekdays
  ScalableTargetAction:
    MinCapacity: 5
    MaxCapacity: 20

ScheduledAction2:
  ScheduledActionName: scale-down-evening
  Schedule: cron(0 19 * * * *)  # 7 PM daily
  ScalableTargetAction:
    MinCapacity: 1
    MaxCapacity: 5
```

## Service Connect (Service Mesh)

### Built-In Service Mesh (2023+ Feature)

**Use When:**
- Need service-to-service communication
- Want observability without code changes
- Require retry logic and circuit breaking
- Avoid complexity of Istio or App Mesh

**Architecture:**
```
Service A → Envoy Proxy → Service B Endpoint
                        → Service B Load Balancing
                        → CloudMap Discovery
```

**Configuration:**
```yaml
Cluster:
  ServiceConnectDefaults:
    Namespace: internal.local

Service:
  ServiceConnectConfiguration:
    Enabled: true
    Namespace: internal.local
    Services:
      - PortName: http
        ClientAliases:
          - Port: 8080
            DnsName: backend-api
        IngressPortOverride: 8080
    LogConfiguration:
      LogDriver: awslogs
      Options:
        awslogs-group: /ecs/service-connect
        awslogs-stream-prefix: proxy
```

**Benefits:**
- Zero code changes
- Automatic retries
- Circuit breaking
- Connection pooling
- Metrics and tracing
- mTLS encryption (optional)

**Observability:**
- Request success/failure rates
- Latency percentiles (p50, p99)
- Active connections
- Retry counts
- All metrics in CloudWatch

## EKS Pod Identities

### Simplified IAM for Pods (2024+ Feature)

**Use When:**
- Running applications on EKS
- Need AWS service access from pods
- Want simpler configuration than IRSA

**Old Way (IRSA - IAM Roles for Service Accounts):**
```yaml
# Complex OIDC provider setup required
# Per-namespace service account annotations
# Trust relationship with OIDC provider
```

**New Way (Pod Identities):**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyAppRole
```

**IAM Role Configuration:**
```yaml
MyAppRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal:
            Service: pods.eks.amazonaws.com
          Action: sts:AssumeRole
          Condition:
            StringEquals:
              aws:SourceAccount: !Ref AWS::AccountId
            ArnEquals:
              aws:SourceArn: !Sub 'arn:aws:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}'
    Policies:
      - PolicyName: S3Access
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - s3:GetObject
                - s3:PutObject
              Resource: !Sub '${Bucket.Arn}/*'
```

**Pod Specification:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  namespace: default
spec:
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-app:latest
      # AWS SDK automatically assumes role
```

**Benefits over IRSA:**
- Simpler setup (no OIDC provider)
- Works with any namespace
- Faster credential rotation
- Better audit logging

## Container Security

### Image Scanning

**ECR Image Scanning:**
```yaml
Repository:
  ImageScanningConfiguration:
    ScanOnPush: true
  ImageTagMutability: IMMUTABLE  # Prevent tag overwrites
```

**Scan Results Integration:**
- Automatic CVE detection
- Severity classification (Critical, High, Medium, Low)
- Integration with Security Hub
- Findings in ECR console

**Best Practices:**
- Scan all images before deployment
- Block deployments with critical vulnerabilities
- Regularly rebuild base images
- Use minimal base images (alpine, distroless)

### Runtime Security

**Read-Only Root Filesystem:**
```yaml
ContainerDefinitions:
  - Name: app
    ReadonlyRootFilesystem: true
    MountPoints:
      - SourceVolume: tmp
        ContainerPath: /tmp
        ReadOnly: false

Volumes:
  - Name: tmp
    Host:
      SourcePath: /tmp
```

**Drop Capabilities (Kubernetes):**
```yaml
securityContext:
  allowPrivilegeEscalation: false
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE  # Only if needed
```

**Network Segmentation:**
```yaml
SecurityGroup:
  Ingress:
    - IpProtocol: tcp
      FromPort: 80
      ToPort: 80
      SourceSecurityGroupId: !Ref ALBSecurityGroup
    # No SSH access
  Egress:
    - IpProtocol: tcp
      FromPort: 443
      ToPort: 443
      CidrIp: 0.0.0.0/0  # HTTPS only
    - IpProtocol: tcp
      FromPort: 5432
      ToPort: 5432
      DestinationSecurityGroupId: !Ref DBSecurityGroup
```

## Logging and Monitoring

### Container Insights

**Enable Container Insights:**
```yaml
Cluster:
  ClusterSettings:
    - Name: containerInsights
      Value: enabled
```

**Metrics Collected:**
- CPU utilization (cluster, service, task)
- Memory utilization
- Network throughput
- Disk I/O
- Task count

**Log Insights Queries:**
```sql
-- Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Slowest requests
fields @timestamp, duration, status
| filter path = "/api/users"
| stats avg(duration), max(duration), count() by bin(5m)
```

### X-Ray Tracing

**Enable X-Ray:**
```yaml
ContainerDefinitions:
  - Name: xray-daemon
    Image: amazon/aws-xray-daemon
    PortMappings:
      - ContainerPort: 2000
        Protocol: udp
  - Name: app
    Environment:
      - Name: AWS_XRAY_DAEMON_ADDRESS
        Value: localhost:2000
```

**Application Code (Python):**
```python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware

app = Flask(__name__)
XRayMiddleware(app, xray_recorder)

@app.route('/api/users')
def get_users():
    # Automatically traced
    return jsonify(users)
```

## Cost Optimization

### EC2 Launch Type Optimization

**Spot Instances (70% Savings):**
```yaml
CapacityProvider:
  Name: spot-capacity
  AutoScalingGroupProvider:
    AutoScalingGroupArn: !Ref SpotASG
    ManagedScaling:
      Status: ENABLED
      TargetCapacity: 100
    ManagedTerminationProtection: ENABLED

Service:
  CapacityProviderStrategy:
    - CapacityProvider: spot-capacity
      Weight: 4
    - CapacityProvider: ondemand-capacity
      Weight: 1
      Base: 2  # Always maintain 2 on-demand
```

**Graviton Processors (20% Cost Savings):**
```yaml
LaunchTemplate:
  InstanceType: t4g.medium  # Graviton-based
  # 20% cheaper than t3.medium
  # 40% better price-performance
```

**Reserved Instances:**
- 1-year: 30-40% savings
- 3-year: 50-60% savings
- Use for predictable baseline capacity

### Task-Level Optimization

**Right-Sizing:**
- Monitor CloudWatch metrics weekly
- Reduce over-provisioned resources
- Use Compute Optimizer recommendations

**Reduce Data Transfer:**
- Use VPC endpoints (avoid NAT Gateway costs)
- Place services in same AZ when possible
- Use CloudFront for static assets

**Storage Optimization:**
- Use ephemeral storage (free)
- Avoid EBS volumes unless necessary
- Clean up unused ECR images

## Anti-Patterns

### Don't: Run Databases in Containers

**Problem:** Stateful data, performance overhead, operational complexity.

**Solution:** Use RDS, Aurora, or DynamoDB.

### Don't: Use Latest Tag

**Problem:** Unpredictable deployments, difficult rollbacks.

**Solution:** Use immutable tags (commit SHA, semantic versions).

```yaml
# Bad
Image: my-app:latest

# Good
Image: my-app:v1.2.3
Image: my-app:commit-abc123
```

### Don't: Store Secrets in Environment Variables

**Problem:** Exposed in logs, console, API responses.

**Solution:** Use Secrets Manager or Parameter Store.

### Don't: Run Single Replica

**Problem:** No high availability, downtime during deployments.

**Solution:** Run minimum 2 replicas across multiple AZs.

### Don't: Ignore Resource Limits

**Problem:** Resource starvation, OOM kills, cascading failures.

**Solution:** Set appropriate CPU and memory limits.

```yaml
# Kubernetes
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"
```

### Don't: Use Default VPC

**Problem:** No subnet segmentation, poor security posture.

**Solution:** Create custom VPC with private subnets.

### Don't: Skip Health Checks

**Problem:** Traffic sent to unhealthy tasks, user-facing errors.

**Solution:** Implement comprehensive health checks.

### Don't: Ignore Deployment Circuit Breakers

**Problem:** Bad deployments can take down entire service.

**Solution:** Enable circuit breakers with automatic rollback.

```yaml
DeploymentConfiguration:
  DeploymentCircuitBreaker:
    Enable: true
    Rollback: true
```

```

### references/networking.md

```markdown
# AWS Networking - Deep Dive


## Table of Contents

- [VPC Architecture](#vpc-architecture)
  - [Standard 3-Tier Pattern](#standard-3-tier-pattern)
  - [Security Groups vs. NACLs](#security-groups-vs-nacls)
  - [NAT Gateway](#nat-gateway)
- [Load Balancers](#load-balancers)
  - [Application Load Balancer (ALB)](#application-load-balancer-alb)
  - [Network Load Balancer (NLB)](#network-load-balancer-nlb)
- [CloudFront (CDN)](#cloudfront-cdn)
- [Route 53 (DNS)](#route-53-dns)
  - [Routing Policies](#routing-policies)
- [VPC Peering](#vpc-peering)
- [PrivateLink](#privatelink)
- [Transit Gateway](#transit-gateway)

## VPC Architecture

### Standard 3-Tier Pattern

```
VPC: 10.0.0.0/16 (65,536 IPs)

Availability Zone A:
  Public Subnet:    10.0.1.0/24  (256 IPs: ALB, NAT Gateway, Bastion)
  Private Subnet:   10.0.11.0/24 (256 IPs: ECS, Lambda, App Servers)
  Database Subnet:  10.0.21.0/24 (256 IPs: RDS, Aurora, ElastiCache)

Availability Zone B:
  Public Subnet:    10.0.2.0/24
  Private Subnet:   10.0.12.0/24
  Database Subnet:  10.0.22.0/24

Availability Zone C:
  Public Subnet:    10.0.3.0/24
  Private Subnet:   10.0.13.0/24
  Database Subnet:  10.0.23.0/24
```

### Security Groups vs. NACLs

| Feature | Security Groups | Network ACLs |
|---------|----------------|--------------|
| **Level** | Instance (ENI) | Subnet |
| **State** | Stateful | Stateless |
| **Rules** | Allow only | Allow + Deny |
| **Return Traffic** | Automatic | Must configure |
| **Evaluation** | All rules | Numbered order |

### NAT Gateway

**Purpose:** Enable private subnet instances to access internet for updates, APIs

**Cost (us-east-1):**
- $0.045/hour = $32.85/month
- $0.045/GB processed
- Deploy one per AZ for HA

**Alternative:** NAT Instance (EC2) - cheaper but manual management

---

## Load Balancers

### Application Load Balancer (ALB)

**Features:**
- Layer 7 (HTTP/HTTPS)
- Path-based routing: `/api` → backend, `/web` → frontend
- Host-based routing: `api.example.com`, `web.example.com`
- WebSocket support
- Lambda targets (serverless backends)
- HTTP/2, gRPC support

**Cost:**
- $0.0225/hour = $16.43/month
- $0.008/LCU-hour (Load Balancer Capacity Unit)
- Minimum ~$20/month

### Network Load Balancer (NLB)

**Features:**
- Layer 4 (TCP/UDP)
- Ultra-low latency (<100 microseconds)
- Millions of requests/second
- Static IP addresses (Elastic IP)
- PrivateLink support

**Cost:**
- $0.0225/hour = $16.43/month
- $0.006/NLCU-hour

**Use When:**
- Extreme performance needed
- Static IPs required
- Non-HTTP protocols

---

## CloudFront (CDN)

**Purpose:** Global content delivery network with 450+ edge locations

**Features:**
- Cache static content (images, CSS, JS)
- Dynamic content acceleration
- DDoS protection (AWS Shield)
- Lambda@Edge for edge compute

**Cost:**
- Data transfer: $0.085/GB (first 10TB, decreases)
- Requests: $0.0075 per 10,000
- Free tier: 1TB transfer, 10M requests/month (12 months)

**Cache Behaviors:**
- Match path patterns: `/images/*`, `/api/*`
- TTL configuration per pattern
- Origin types: S3, ALB, custom

---

## Route 53 (DNS)

### Routing Policies

| Policy | Use Case |
|--------|----------|
| **Simple** | Single resource |
| **Weighted** | A/B testing, gradual migration (10% → 90%) |
| **Latency** | Route to lowest-latency region |
| **Failover** | Active-passive disaster recovery |
| **Geolocation** | Route based on user location |
| **Geoproximity** | Route based on resource location + bias |
| **Multi-value** | Return multiple IPs with health checks |

**Cost:**
- Hosted zone: $0.50/month
- Queries: $0.40 per million
- Health checks: $0.50/month each

---

## VPC Peering

**Purpose:** Connect two VPCs privately (same region or cross-region)

**Characteristics:**
- Non-transitive (A↔B, B↔C doesn't mean A↔C)
- No overlapping CIDR blocks
- Data transfer: $0.01/GB (same region), $0.02/GB (cross-region)

---

## PrivateLink

**Purpose:** Privately access AWS services or third-party services

**Use Cases:**
- Access S3, DynamoDB without internet gateway
- SaaS vendor connections
- Shared services across accounts

**Cost:**
- $0.01/hour per AZ = $7.30/month per AZ
- $0.01/GB processed

---

## Transit Gateway

**Purpose:** Hub-and-spoke network architecture for 100s of VPCs

**Features:**
- Connect VPCs, on-premises networks (VPN, Direct Connect)
- Transitive routing
- Route table customization

**Cost:**
- $0.05/hour per attachment = $36.50/month
- $0.02/GB processed

**Use When:**
- >5 VPCs to connect
- Need centralized routing
- Hybrid cloud architecture

```

### references/security.md

```markdown
# AWS Security Best Practices


## Table of Contents

- [IAM (Identity and Access Management)](#iam-identity-and-access-management)
  - [Least Privilege Policy Example](#least-privilege-policy-example)
  - [IAM Best Practices](#iam-best-practices)
  - [IAM Roles for Common Services](#iam-roles-for-common-services)
- [KMS (Key Management Service)](#kms-key-management-service)
  - [Key Types](#key-types)
  - [Encryption Patterns](#encryption-patterns)
  - [KMS API Costs](#kms-api-costs)
- [Secrets Manager](#secrets-manager)
  - [Cost Comparison](#cost-comparison)
  - [When to Use Each](#when-to-use-each)
  - [Automatic Rotation Example](#automatic-rotation-example)
- [WAF (Web Application Firewall)](#waf-web-application-firewall)
  - [Managed Rule Groups](#managed-rule-groups)
  - [Custom Rules](#custom-rules)
  - [Cost Model](#cost-model)
- [GuardDuty (Threat Detection)](#guardduty-threat-detection)
- [Security Hub](#security-hub)
- [Network Security](#network-security)
  - [Security Group Strategy](#security-group-strategy)
  - [VPC Flow Logs](#vpc-flow-logs)
- [Encryption Checklist](#encryption-checklist)
  - [Data at Rest](#data-at-rest)
  - [Data in Transit](#data-in-transit)
  - [Key Management](#key-management)
- [Compliance Frameworks](#compliance-frameworks)
  - [AWS Artifact](#aws-artifact)
  - [AWS Config](#aws-config)
- [Incident Response](#incident-response)
  - [CloudTrail](#cloudtrail)
  - [Security Automation](#security-automation)

## IAM (Identity and Access Management)

### Least Privilege Policy Example

```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject"
    ],
    "Resource": "arn:aws:s3:::my-bucket/uploads/*",
    "Condition": {
      "IpAddress": {
        "aws:SourceIp": "203.0.113.0/24"
      }
    }
  }]
}
```

### IAM Best Practices

1. **Use Roles, Not Users** for applications
2. **Enable MFA** for privileged users
3. **Use IAM Access Analyzer** to validate policies
4. **Implement Permission Boundaries** for maximum permissions
5. **Rotate Credentials** regularly (90 days)
6. **Use AWS Organizations SCPs** for guardrails

### IAM Roles for Common Services

**Lambda Execution Role:**
```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"Service": "lambda.amazonaws.com"},
    "Action": "sts:AssumeRole"
  }]
}
```

**ECS Task Role:**
```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"Service": "ecs-tasks.amazonaws.com"},
    "Action": "sts:AssumeRole"
  }]
}
```

---

## KMS (Key Management Service)

### Key Types

| Type | Cost | Rotation | Use Case |
|------|------|----------|----------|
| **AWS Managed** | Free | Automatic (3 years) | S3, EBS, RDS default |
| **Customer Managed** | $1/month | Manual or automatic | Custom policies |
| **Custom Key Store** | $1/month + CloudHSM | Manual | FIPS 140-2 Level 3 |

### Encryption Patterns

**Server-Side Encryption (SSE):**
- S3: SSE-S3 (free), SSE-KMS ($1/month key + API costs)
- EBS: Encrypted by default (recommended)
- RDS: Enable at creation

**Client-Side Encryption:**
- Encrypt before sending to AWS
- Application manages keys
- Use KMS Encrypt API or encryption SDK

### KMS API Costs

- $0.03 per 10,000 requests
- Free tier: AWS managed keys
- Monitor usage for high-volume applications

---

## Secrets Manager

### Cost Comparison

| Service | Cost/Secret | Rotation | Secret Size |
|---------|-------------|----------|-------------|
| **Secrets Manager** | $0.40/month | Automatic (Lambda) | 64KB |
| **Parameter Store (Standard)** | Free | Manual | 4KB |
| **Parameter Store (Advanced)** | $0.05/month | Manual | 8KB |

### When to Use Each

**Secrets Manager:**
- Database credentials with rotation
- API keys requiring rotation
- Multi-region replication needed

**Parameter Store:**
- Application configuration
- Non-rotating secrets
- Cost-sensitive scenarios

### Automatic Rotation Example

**RDS MySQL Credentials:**
1. Secrets Manager invokes Lambda every 30 days
2. Lambda creates new user with same permissions
3. Tests new credentials
4. Updates secret
5. Deletes old user

---

## WAF (Web Application Firewall)

### Managed Rule Groups

| Rule Group | Purpose | Cost |
|------------|---------|------|
| **Core Rule Set** | OWASP Top 10 | $10/month |
| **SQL Injection** | Database attacks | $10/month |
| **Known Bad Inputs** | CVE signatures | $10/month |
| **IP Reputation** | Block malicious IPs | $10/month |

### Custom Rules

**Rate Limiting:**
```json
{
  "Name": "RateLimitRule",
  "Priority": 1,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 2000,
      "AggregateKeyType": "IP"
    }
  },
  "Action": {"Block": {}}
}
```

**Geo-Blocking:**
```json
{
  "Name": "BlockCountries",
  "Priority": 2,
  "Statement": {
    "GeoMatchStatement": {
      "CountryCodes": ["CN", "RU"]
    }
  },
  "Action": {"Block": {}}
}
```

### Cost Model

- Web ACL: $5/month
- Rules: $1/month per rule
- Requests: $0.60 per million
- Example: 1 ACL + 5 rules + 10M requests = $5 + $5 + $6 = $16/month

---

## GuardDuty (Threat Detection)

**Purpose:** Intelligent threat detection using ML

**Data Sources:**
- VPC Flow Logs
- CloudTrail event logs
- DNS logs

**Cost:**
- CloudTrail: $4.50 per million events
- VPC Flow Logs: $0.50 per million events analyzed
- DNS logs: $0.40 per million queries

**Use Cases:**
- Detect compromised instances
- Identify reconnaissance attempts
- Find unauthorized access

---

## Security Hub

**Purpose:** Centralized security dashboard, compliance checks

**Features:**
- CIS AWS Foundations Benchmark
- PCI DSS compliance
- Integration with GuardDuty, Inspector, Macie

**Cost:**
- Security checks: $0.001 per check
- Findings ingestion: $0.00003 per finding

---

## Network Security

### Security Group Strategy

**Principle:** Default deny, explicit allow

**Example: Web Tier**
```
Inbound:
  - Port 443 (HTTPS) from 0.0.0.0/0
  - Port 80 (HTTP) from 0.0.0.0/0

Outbound:
  - All traffic (stateful return traffic automatic)
```

**Example: App Tier**
```
Inbound:
  - Port 8080 from web-tier-sg
  - Port 3000 from web-tier-sg

Outbound:
  - Port 5432 to database-tier-sg (PostgreSQL)
  - Port 443 to 0.0.0.0/0 (API calls)
```

### VPC Flow Logs

**Purpose:** Network traffic analysis, troubleshooting, security monitoring

**Cost:**
- $0.50 per GB ingested to CloudWatch Logs
- Can send to S3 for cheaper storage ($0.023/GB)

**Analysis Tools:**
- CloudWatch Insights for queries
- Athena for S3-stored logs
- Third-party tools (Splunk, Datadog)

---

## Encryption Checklist

### Data at Rest

- [ ] S3: SSE-S3 or SSE-KMS enabled
- [ ] EBS: Encryption by default enabled
- [ ] RDS: Encryption enabled at creation
- [ ] DynamoDB: Encryption enabled (free)
- [ ] EFS: Encryption enabled
- [ ] ElastiCache: At-rest encryption (Redis only)

### Data in Transit

- [ ] ALB/NLB: HTTPS listeners with TLS 1.2+
- [ ] CloudFront: HTTPS required
- [ ] RDS: Force SSL connections
- [ ] DynamoDB: HTTPS API calls
- [ ] S3: Bucket policies require HTTPS

### Key Management

- [ ] Use AWS KMS for customer-managed keys
- [ ] Enable automatic key rotation (365 days)
- [ ] Use key policies for access control
- [ ] Monitor KMS API usage (CloudTrail)

---

## Compliance Frameworks

### AWS Artifact

**Purpose:** Access compliance reports (SOC, PCI, ISO, HIPAA)

**Available Reports:**
- SOC 1, 2, 3
- PCI DSS
- ISO 27001, 27017, 27018
- HIPAA BAA

### AWS Config

**Purpose:** Resource inventory, configuration compliance

**Rules Examples:**
- encrypted-volumes: All EBS volumes encrypted
- s3-bucket-public-read-prohibited: No public S3 buckets
- rds-multi-az-enabled: RDS instances Multi-AZ
- iam-password-policy: Strong password requirements

**Cost:**
- $0.003 per configuration item
- $0.001 per rule evaluation

---

## Incident Response

### CloudTrail

**Purpose:** API audit logs for all AWS actions

**Features:**
- 90-day event history (free)
- Longer retention via S3 trail
- Multi-region trails
- Log file integrity validation

**Cost:**
- First copy of management events: Free
- S3 storage: Standard S3 rates
- CloudWatch Logs integration: $0.50/GB

### Security Automation

**Example: Auto-Remediation with Lambda**

1. Config Rule detects non-compliant resource
2. EventBridge triggers Lambda
3. Lambda remediates (e.g., enable encryption)
4. SNS notification sent

**Common Automations:**
- Revoke overly permissive security groups
- Enable encryption on new resources
- Delete public S3 buckets
- Rotate IAM access keys >90 days old

```

### references/well-architected.md

```markdown
# AWS Well-Architected Framework - Implementation Guide


## Table of Contents

- [Six Pillars](#six-pillars)
  - [1. Operational Excellence](#1-operational-excellence)
  - [2. Security](#2-security)
  - [3. Reliability](#3-reliability)
  - [4. Performance Efficiency](#4-performance-efficiency)
  - [5. Cost Optimization](#5-cost-optimization)
  - [6. Sustainability](#6-sustainability)
- [Well-Architected Review Process](#well-architected-review-process)
  - [1. Define Workload](#1-define-workload)
  - [2. Assess Against Pillars](#2-assess-against-pillars)
  - [3. Prioritize Improvements](#3-prioritize-improvements)
  - [4. Implement Changes](#4-implement-changes)
  - [5. Re-Review Regularly](#5-re-review-regularly)
- [Architecture Patterns by Pillar](#architecture-patterns-by-pillar)
  - [Operational Excellence Pattern](#operational-excellence-pattern)
  - [Security Pattern](#security-pattern)
  - [Reliability Pattern](#reliability-pattern)
  - [Performance Pattern](#performance-pattern)
  - [Cost Optimization Pattern](#cost-optimization-pattern)
- [Checklist by Pillar](#checklist-by-pillar)
  - [Operational Excellence](#operational-excellence)
  - [Security](#security)
  - [Reliability](#reliability)
  - [Performance Efficiency](#performance-efficiency)
  - [Cost Optimization](#cost-optimization)
  - [Sustainability](#sustainability)

## Six Pillars

### 1. Operational Excellence

**Design Principles:**
- Perform operations as code
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational events

**Implementation:**

**Infrastructure as Code:**
- Use CDK, Terraform, or CloudFormation
- Version control all infrastructure
- Peer review changes via pull requests
- Automate deployments (CI/CD)

**Deployment Strategies:**
- Blue-green deployments (instant rollback)
- Canary releases (gradual traffic shift)
- Feature flags (runtime toggles)

**Observability:**
- CloudWatch Logs (structured JSON logging)
- CloudWatch Metrics (custom metrics)
- X-Ray (distributed tracing)
- CloudWatch Alarms (proactive alerts)

**Runbooks:**
- Document common operations
- Automate with Systems Manager Automation
- Test via GameDays
- Version control runbooks

---

### 2. Security

**Design Principles:**
- Implement strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events

**Implementation:**

**Identity:**
- Use IAM roles for applications
- Implement least privilege
- Enable MFA for privileged users
- Use AWS Organizations for multi-account governance

**Detection:**
- Enable CloudTrail (all regions)
- Enable VPC Flow Logs
- Enable GuardDuty (threat detection)
- Enable Security Hub (compliance)

**Protection:**
- Encrypt data at rest (KMS)
- Encrypt data in transit (TLS 1.2+)
- Use WAF for application layer protection
- Implement security groups and NACLs

**Incident Response:**
- Automate responses with EventBridge + Lambda
- Pre-deploy forensic tools
- Practice incident response via GameDays

---

### 3. Reliability

**Design Principles:**
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally for resilience
- Stop guessing capacity
- Manage change via automation

**Implementation:**

**Multi-AZ Architecture:**
- RDS Multi-AZ (automatic failover)
- Aurora replicas across AZs
- ECS/EKS tasks distributed across AZs
- ALB/NLB across multiple AZs

**Auto-Scaling:**
- EC2 Auto Scaling Groups
- ECS Service Auto Scaling
- DynamoDB Auto Scaling
- Application Auto Scaling

**Backup and Recovery:**
- RDS automated backups (7-35 days)
- S3 versioning and replication
- EBS snapshots (automated lifecycle)
- Cross-region backups

**Chaos Engineering:**
- AWS Fault Injection Simulator
- Test failure scenarios regularly
- Validate recovery procedures

**Change Management:**
- Infrastructure as code (no manual changes)
- Blue-green deployments
- Automated testing (unit, integration, e2e)

---

### 4. Performance Efficiency

**Design Principles:**
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy

**Implementation:**

**Compute Optimization:**
- Use Compute Optimizer for rightsizing
- Lambda for event-driven workloads
- Fargate for containers (no EC2 management)
- Graviton processors (25% better performance, 60% less energy)

**Storage Optimization:**
- S3 Intelligent-Tiering (auto-optimize)
- EBS gp3 (20% cheaper than gp2)
- EFS Intelligent-Tiering (save 92% on IA files)

**Database Optimization:**
- Use RDS Performance Insights
- Implement read replicas (offload reads)
- Use DynamoDB DAX (microsecond latency)
- ElastiCache for caching

**Caching Strategy:**
```
User → CloudFront (static content)
    → API Gateway (API response caching)
      → Lambda
        → DAX (DynamoDB cache)
          → DynamoDB
```

**Global Delivery:**
- CloudFront for static content
- Global Accelerator for TCP/UDP
- Route 53 latency-based routing
- Aurora Global Database (<1s replication)

---

### 5. Cost Optimization

**Design Principles:**
- Implement cloud financial management
- Adopt a consumption model
- Measure overall efficiency
- Stop spending on undifferentiated heavy lifting
- Analyze and attribute expenditure

**Implementation:**

**Right-Sizing:**
- Use Compute Optimizer recommendations
- Monitor CloudWatch metrics (CPU, memory)
- Start small, scale based on data
- Automate scaling (don't over-provision)

**Pricing Models:**

| Model | Commitment | Savings | Best For |
|-------|------------|---------|----------|
| On-Demand | None | 0% | Variable workloads |
| Savings Plans | 1-3 years | 30-40% | Flexible compute commitment |
| Reserved Instances | 1-3 years | 30-60% | Predictable, specific instances |
| Spot Instances | None | 60-90% | Fault-tolerant, flexible workloads |

**Storage Optimization:**
- S3 Intelligent-Tiering (auto-optimize to cheapest tier)
- S3 Lifecycle policies (transition to Glacier)
- EBS gp3 (20% cheaper than gp2)
- Delete unused EBS snapshots
- Archive old snapshots (75% cheaper)

**Monitoring:**
- AWS Cost Explorer (visualize spending)
- AWS Budgets (set alerts)
- Cost Allocation Tags (attribute costs)
- Trusted Advisor (cost optimization checks)

**Example Cost Optimization:**

```
Before:
- 10 m5.large on-demand 24/7 = $700/month
- S3 Standard for all data (100TB) = $2,300/month
- Total: $3,000/month

After:
- 5 m5.large Reserved (baseline) = $350/month (50% savings)
- 5 m5.large Spot (variable) = $70/month (90% savings)
- S3 Intelligent-Tiering (100TB) = $845/month (63% savings)
- Total: $1,265/month (58% savings)
```

---

### 6. Sustainability

**Design Principles:**
- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt more efficient hardware
- Use managed services
- Reduce downstream impact

**Implementation:**

**Energy-Efficient Compute:**
- Use Graviton3 instances (60% less energy)
- Lambda (pay per request, no idle)
- Fargate (no EC2 overhead)

**Region Selection:**
- Choose regions with renewable energy
- AWS publishes carbon footprint reports
- Example: US West (Oregon) uses 100% renewable

**Storage Efficiency:**
- Delete unused data
- Compress data
- Use appropriate storage tiers
- S3 Intelligent-Tiering (auto-optimize)

**Software Optimization:**
- Optimize code for performance (less CPU = less energy)
- Async processing (batch operations)
- Minimize data transfer (use caching, edge locations)

**Measure Impact:**
- Customer Carbon Footprint Tool (in AWS Billing Console)
- Track carbon emissions per service
- Set reduction goals

---

## Well-Architected Review Process

### 1. Define Workload

- Application name and purpose
- Architecture diagram
- Traffic patterns
- Compliance requirements

### 2. Assess Against Pillars

Use AWS Well-Architected Tool (free):
- Answer questions per pillar
- Identify high and medium risk issues
- Generate improvement plan

### 3. Prioritize Improvements

**Risk Levels:**
- High Risk (HRI): Address immediately
- Medium Risk (MRI): Plan to address
- None: No issues identified

**Example Findings:**

| Pillar | Issue | Risk | Recommendation |
|--------|-------|------|----------------|
| Reliability | Single AZ deployment | HRI | Deploy Multi-AZ |
| Security | No CloudTrail | HRI | Enable CloudTrail |
| Cost | On-demand only | MRI | Purchase Reserved Instances |
| Performance | No caching | MRI | Add CloudFront, ElastiCache |

### 4. Implement Changes

- Create tickets for each improvement
- Use infrastructure as code
- Test changes in non-production
- Measure impact

### 5. Re-Review Regularly

- Quarterly reviews for production workloads
- After major architecture changes
- Before significant events (sales, launches)

---

## Architecture Patterns by Pillar

### Operational Excellence Pattern

```
GitHub Repository (IaC)
  → GitHub Actions (CI/CD)
    → CDK Deploy
      → CloudFormation Stack
        → Infrastructure
          → CloudWatch Logs/Metrics/Alarms
            → SNS Notifications
              → On-call rotation
```

### Security Pattern

```
User Request
  → WAF (block threats)
    → CloudFront (DDoS protection)
      → ALB (TLS termination)
        → ECS Tasks (app in private subnet)
          → RDS (encrypted, private subnet)
          → Secrets Manager (credentials)
          → CloudTrail (audit logs)
          → GuardDuty (threat detection)
```

### Reliability Pattern

```
Multi-Region Active-Active:

Region A (Primary):
  Route 53 (latency-based)
    → CloudFront
      → ALB (3 AZs)
        → ECS Fargate (auto-scaling)
          → Aurora Global (primary)

Region B (Secondary):
  Route 53 (latency-based)
    → CloudFront
      → ALB (3 AZs)
        → ECS Fargate (auto-scaling)
          → Aurora Global (read-only)
```

### Performance Pattern

```
User Request
  → Route 53 (latency-based routing)
    → CloudFront (edge caching)
      → API Gateway (API caching)
        → Lambda (Provisioned Concurrency)
          → DAX (DynamoDB cache)
            → DynamoDB Global Tables
```

### Cost Optimization Pattern

```
Compute:
  - Baseline: Reserved Instances (predictable)
  - Variable: Spot Instances (fault-tolerant tasks)
  - Serverless: Lambda (event-driven)

Storage:
  - Hot: S3 Standard (frequent access)
  - Warm: S3 Standard-IA (infrequent)
  - Cold: S3 Glacier (archive)
  - Auto: S3 Intelligent-Tiering (unknown)

Database:
  - Production: Aurora (performance + HA)
  - Dev/Test: Aurora Serverless v2 (pay per use)
  - Cache: ElastiCache (reduce DB load)
```

---

## Checklist by Pillar

### Operational Excellence
- [ ] Infrastructure as code (CDK/Terraform)
- [ ] CI/CD pipeline automated
- [ ] Structured logging (JSON)
- [ ] Custom CloudWatch metrics
- [ ] Alarms configured with SNS
- [ ] Runbooks documented
- [ ] Disaster recovery tested

### Security
- [ ] IAM roles (no hardcoded credentials)
- [ ] MFA enabled for privileged users
- [ ] CloudTrail enabled (all regions)
- [ ] VPC Flow Logs enabled
- [ ] GuardDuty enabled
- [ ] Encryption at rest (all services)
- [ ] TLS 1.2+ in transit
- [ ] Secrets Manager for credentials
- [ ] Security groups follow least privilege

### Reliability
- [ ] Multi-AZ deployments
- [ ] Auto-scaling configured
- [ ] Automated backups enabled
- [ ] Cross-region backups (critical data)
- [ ] Health checks on load balancers
- [ ] RDS Multi-AZ or Aurora
- [ ] Route 53 health checks
- [ ] Chaos engineering tested

### Performance Efficiency
- [ ] Right-sized instances (Compute Optimizer)
- [ ] Caching implemented (CloudFront, ElastiCache)
- [ ] CDN for static content
- [ ] Read replicas for databases
- [ ] Asynchronous processing where applicable
- [ ] Monitoring and alerting active

### Cost Optimization
- [ ] Reserved Instances or Savings Plans
- [ ] Spot Instances for fault-tolerant workloads
- [ ] S3 lifecycle policies configured
- [ ] Unused resources deleted (EBS, snapshots)
- [ ] Cost allocation tags applied
- [ ] AWS Budgets configured
- [ ] Rightsizing recommendations reviewed monthly

### Sustainability
- [ ] Graviton instances where supported
- [ ] Renewable energy regions preferred
- [ ] S3 Intelligent-Tiering enabled
- [ ] Lambda for event-driven (no idle)
- [ ] Auto-scaling to match demand
- [ ] Carbon footprint monitored

```

### examples/cloudformation/lambda-api.yaml

```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Production-ready Lambda + API Gateway + DynamoDB REST API'

Parameters:
  Stage:
    Type: String
    Default: prod
    AllowedValues:
      - dev
      - staging
      - prod
    Description: Deployment stage

  DomainName:
    Type: String
    Default: ''
    Description: Custom domain name for API (optional, e.g., api.example.com)

  CertificateArn:
    Type: String
    Default: ''
    Description: ACM certificate ARN for custom domain (required if DomainName is specified)

Conditions:
  HasCustomDomain: !Not [!Equals [!Ref DomainName, '']]

Resources:
  # DynamoDB Table
  ItemsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Sub '${AWS::StackName}-items'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: itemId
          AttributeType: S
        - AttributeName: userId
          AttributeType: S
        - AttributeName: createdAt
          AttributeType: N
      KeySchema:
        - AttributeName: itemId
          KeyType: HASH
      GlobalSecondaryIndexes:
        - IndexName: UserIdIndex
          KeySchema:
            - AttributeName: userId
              KeyType: HASH
            - AttributeName: createdAt
              KeyType: RANGE
          Projection:
            ProjectionType: ALL
      StreamSpecification:
        StreamViewType: NEW_AND_OLD_IMAGES
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
      SSESpecification:
        SSEEnabled: true
      Tags:
        - Key: Environment
          Value: !Ref Stage

  # Lambda Execution Role
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: DynamoDBAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - dynamodb:GetItem
                  - dynamodb:PutItem
                  - dynamodb:UpdateItem
                  - dynamodb:DeleteItem
                  - dynamodb:Query
                  - dynamodb:Scan
                Resource:
                  - !GetAtt ItemsTable.Arn
                  - !Sub '${ItemsTable.Arn}/index/*'

  # Lambda Function
  ApiFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Sub '${AWS::StackName}-api'
      Runtime: python3.12
      Handler: index.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Timeout: 30
      MemorySize: 1024
      Environment:
        Variables:
          TABLE_NAME: !Ref ItemsTable
          STAGE: !Ref Stage
      Code:
        ZipFile: |
          import json
          import os
          import boto3
          import uuid
          from datetime import datetime
          from decimal import Decimal

          dynamodb = boto3.resource('dynamodb')
          table = dynamodb.Table(os.environ['TABLE_NAME'])

          class DecimalEncoder(json.JSONEncoder):
              def default(self, obj):
                  if isinstance(obj, Decimal):
                      return float(obj)
                  return super(DecimalEncoder, self).default(obj)

          def cors_headers():
              return {
                  'Content-Type': 'application/json',
                  'Access-Control-Allow-Origin': '*',
                  'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE,OPTIONS',
                  'Access-Control-Allow-Headers': 'Content-Type,Authorization'
              }

          def response(status_code, body):
              return {
                  'statusCode': status_code,
                  'headers': cors_headers(),
                  'body': json.dumps(body, cls=DecimalEncoder)
              }

          def lambda_handler(event, context):
              try:
                  http_method = event.get('requestContext', {}).get('http', {}).get('method')
                  path = event.get('rawPath', '')
                  path_params = event.get('pathParameters', {})
                  query_params = event.get('queryStringParameters', {})

                  # CORS preflight
                  if http_method == 'OPTIONS':
                      return response(200, {'message': 'OK'})

                  # Parse body for POST/PUT
                  body = {}
                  if event.get('body'):
                      body = json.loads(event['body'])

                  # Route requests
                  if path == '/items' and http_method == 'GET':
                      return get_items(query_params)
                  elif path == '/items' and http_method == 'POST':
                      return create_item(body)
                  elif path.startswith('/items/') and http_method == 'GET':
                      item_id = path_params.get('id')
                      return get_item(item_id)
                  elif path.startswith('/items/') and http_method == 'PUT':
                      item_id = path_params.get('id')
                      return update_item(item_id, body)
                  elif path.startswith('/items/') and http_method == 'DELETE':
                      item_id = path_params.get('id')
                      return delete_item(item_id)
                  elif path == '/health' and http_method == 'GET':
                      return response(200, {'status': 'healthy'})
                  else:
                      return response(404, {'error': 'Not found'})

              except Exception as e:
                  print(f"Error: {str(e)}")
                  return response(500, {'error': 'Internal server error'})

          def get_items(query_params):
              """List all items, optionally filtered by userId"""
              try:
                  user_id = query_params.get('userId')

                  if user_id:
                      # Query by GSI
                      result = table.query(
                          IndexName='UserIdIndex',
                          KeyConditionExpression='userId = :userId',
                          ExpressionAttributeValues={':userId': user_id},
                          ScanIndexForward=False,
                          Limit=100
                      )
                  else:
                      # Scan all items
                      result = table.scan(Limit=100)

                  return response(200, {
                      'items': result.get('Items', []),
                      'count': len(result.get('Items', []))
                  })
              except Exception as e:
                  print(f"Error getting items: {str(e)}")
                  return response(500, {'error': 'Failed to get items'})

          def get_item(item_id):
              """Get single item by ID"""
              try:
                  if not item_id:
                      return response(400, {'error': 'Missing itemId'})

                  result = table.get_item(Key={'itemId': item_id})

                  if 'Item' not in result:
                      return response(404, {'error': 'Item not found'})

                  return response(200, result['Item'])
              except Exception as e:
                  print(f"Error getting item: {str(e)}")
                  return response(500, {'error': 'Failed to get item'})

          def create_item(body):
              """Create new item"""
              try:
                  if not body.get('userId') or not body.get('name'):
                      return response(400, {'error': 'Missing required fields: userId, name'})

                  item_id = str(uuid.uuid4())
                  timestamp = int(datetime.utcnow().timestamp())

                  item = {
                      'itemId': item_id,
                      'userId': body['userId'],
                      'name': body['name'],
                      'description': body.get('description', ''),
                      'createdAt': timestamp,
                      'updatedAt': timestamp
                  }

                  table.put_item(Item=item)

                  return response(201, item)
              except Exception as e:
                  print(f"Error creating item: {str(e)}")
                  return response(500, {'error': 'Failed to create item'})

          def update_item(item_id, body):
              """Update existing item"""
              try:
                  if not item_id:
                      return response(400, {'error': 'Missing itemId'})

                  # Check if item exists
                  existing = table.get_item(Key={'itemId': item_id})
                  if 'Item' not in existing:
                      return response(404, {'error': 'Item not found'})

                  timestamp = int(datetime.utcnow().timestamp())

                  # Build update expression
                  update_expr = 'SET updatedAt = :updatedAt'
                  expr_values = {':updatedAt': timestamp}

                  if 'name' in body:
                      update_expr += ', #name = :name'
                      expr_values[':name'] = body['name']

                  if 'description' in body:
                      update_expr += ', description = :description'
                      expr_values[':description'] = body['description']

                  result = table.update_item(
                      Key={'itemId': item_id},
                      UpdateExpression=update_expr,
                      ExpressionAttributeValues=expr_values,
                      ExpressionAttributeNames={'#name': 'name'} if 'name' in body else None,
                      ReturnValues='ALL_NEW'
                  )

                  return response(200, result['Attributes'])
              except Exception as e:
                  print(f"Error updating item: {str(e)}")
                  return response(500, {'error': 'Failed to update item'})

          def delete_item(item_id):
              """Delete item"""
              try:
                  if not item_id:
                      return response(400, {'error': 'Missing itemId'})

                  # Check if item exists
                  existing = table.get_item(Key={'itemId': item_id})
                  if 'Item' not in existing:
                      return response(404, {'error': 'Item not found'})

                  table.delete_item(Key={'itemId': item_id})

                  return response(200, {'message': 'Item deleted successfully'})
              except Exception as e:
                  print(f"Error deleting item: {str(e)}")
                  return response(500, {'error': 'Failed to delete item'})
      Tags:
        - Key: Environment
          Value: !Ref Stage

  # Lambda Log Group
  ApiLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/lambda/${ApiFunction}'
      RetentionInDays: 7

  # HTTP API Gateway
  HttpApi:
    Type: AWS::ApiGatewayV2::Api
    Properties:
      Name: !Sub '${AWS::StackName}-api'
      ProtocolType: HTTP
      CorsConfiguration:
        AllowOrigins:
          - '*'
        AllowMethods:
          - GET
          - POST
          - PUT
          - DELETE
          - OPTIONS
        AllowHeaders:
          - Content-Type
          - Authorization
        MaxAge: 300

  # API Stage
  ApiStage:
    Type: AWS::ApiGatewayV2::Stage
    Properties:
      ApiId: !Ref HttpApi
      StageName: !Ref Stage
      AutoDeploy: true
      AccessLogSettings:
        DestinationArn: !GetAtt ApiAccessLogGroup.Arn
        Format: '$context.requestId $context.error.message $context.error.messageString'
      DefaultRouteSettings:
        ThrottlingBurstLimit: 100
        ThrottlingRateLimit: 50

  # API Access Logs
  ApiAccessLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/apigateway/${AWS::StackName}'
      RetentionInDays: 7

  # Lambda Integration
  LambdaIntegration:
    Type: AWS::ApiGatewayV2::Integration
    Properties:
      ApiId: !Ref HttpApi
      IntegrationType: AWS_PROXY
      IntegrationUri: !GetAtt ApiFunction.Arn
      PayloadFormatVersion: '2.0'

  # API Routes
  HealthRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'GET /health'
      Target: !Sub 'integrations/${LambdaIntegration}'

  GetItemsRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'GET /items'
      Target: !Sub 'integrations/${LambdaIntegration}'

  CreateItemRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'POST /items'
      Target: !Sub 'integrations/${LambdaIntegration}'

  GetItemRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'GET /items/{id}'
      Target: !Sub 'integrations/${LambdaIntegration}'

  UpdateItemRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'PUT /items/{id}'
      Target: !Sub 'integrations/${LambdaIntegration}'

  DeleteItemRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'DELETE /items/{id}'
      Target: !Sub 'integrations/${LambdaIntegration}'

  OptionsRoute:
    Type: AWS::ApiGatewayV2::Route
    Properties:
      ApiId: !Ref HttpApi
      RouteKey: 'OPTIONS /{proxy+}'
      Target: !Sub 'integrations/${LambdaIntegration}'

  # Lambda Permission for API Gateway
  ApiGatewayInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref ApiFunction
      Action: lambda:InvokeFunction
      Principal: apigateway.amazonaws.com
      SourceArn: !Sub 'arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${HttpApi}/*'

  # Custom Domain (Optional)
  CustomDomain:
    Type: AWS::ApiGatewayV2::DomainName
    Condition: HasCustomDomain
    Properties:
      DomainName: !Ref DomainName
      DomainNameConfigurations:
        - CertificateArn: !Ref CertificateArn
          EndpointType: REGIONAL

  ApiMapping:
    Type: AWS::ApiGatewayV2::ApiMapping
    Condition: HasCustomDomain
    Properties:
      ApiId: !Ref HttpApi
      DomainName: !Ref CustomDomain
      Stage: !Ref ApiStage

  # CloudWatch Alarm for Errors
  ApiErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${AWS::StackName}-api-errors'
      AlarmDescription: Alert when API error rate is high
      MetricName: Errors
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: FunctionName
          Value: !Ref ApiFunction

  # CloudWatch Alarm for Throttling
  ApiThrottleAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${AWS::StackName}-api-throttles'
      AlarmDescription: Alert when API is being throttled
      MetricName: Throttles
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: FunctionName
          Value: !Ref ApiFunction

Outputs:
  ApiEndpoint:
    Description: HTTP API Gateway endpoint URL
    Value: !Sub 'https://${HttpApi}.execute-api.${AWS::Region}.amazonaws.com/${Stage}'
    Export:
      Name: !Sub '${AWS::StackName}-ApiEndpoint'

  CustomDomainEndpoint:
    Description: Custom domain endpoint (if configured)
    Condition: HasCustomDomain
    Value: !Sub 'https://${DomainName}'

  DynamoDBTableName:
    Description: DynamoDB table name
    Value: !Ref ItemsTable
    Export:
      Name: !Sub '${AWS::StackName}-TableName'

  LambdaFunctionArn:
    Description: Lambda function ARN
    Value: !GetAtt ApiFunction.Arn
    Export:
      Name: !Sub '${AWS::StackName}-FunctionArn'

  ApiId:
    Description: HTTP API Gateway ID
    Value: !Ref HttpApi
    Export:
      Name: !Sub '${AWS::StackName}-ApiId'

```

deploying-on-aws | SkillHub