aws-lambda
Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install awslabs-agent-plugins-aws-lambda
Repository
Skill path: plugins/aws-serverless/skills/aws-lambda
Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead.
Open repositoryBest for
Primary workflow: Run DevOps.
Technical facets: Full Stack, Backend, DevOps, Designer, Testing, Integration.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: awslabs.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install aws-lambda into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/awslabs/agent-plugins before adding aws-lambda to shared team environments
- Use aws-lambda for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: aws-lambda
description: "Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead."
argument-hint: "[what are you building?]"
---
# AWS Lambda Serverless Development
Design, build, deploy, and debug serverless applications with AWS serverless services. This skill provides access to serverless development guidance through the AWS Serverless MCP Server, helping you to build production-ready serverless applications with best practices built-in.
Use SAM CLI for project initialization and deployment, Lambda Web Adapter for web applications, or Event Source Mappings for event-driven architectures. AWS handles infrastructure provisioning, scaling, and monitoring automatically.
**Key capabilities:**
- **SAM CLI Integration**: Initialize, build, deploy, and test serverless applications
- **Web Application Deployment**: Deploy full-stack applications with Lambda Web Adapter
- **Event Source Mappings**: Configure Lambda triggers for DynamoDB, Kinesis, SQS, Kafka
- **Lambda durable functions**: Resilient multi-step applications with checkpointing — see the [durable-functions skill](../aws-lambda-durable-functions/) for guidance
- **Schema Management**: Type-safe EventBridge integration with schema registry
- **Observability**: CloudWatch logs, metrics, and X-Ray tracing
- **Performance Optimization**: Right-sizing, cost optimization, and troubleshooting
## When to Load Reference Files
Load the appropriate reference file based on what the user is working on:
- **Getting started**, **what to build**, **project type decision**, or **working with existing projects** -> see [references/getting-started.md](references/getting-started.md)
- **SAM**, **CDK**, **deployment**, **IaC templates**, **CDK constructs**, or **CI/CD pipelines** -> see the [aws-serverless-deployment skill](../aws-serverless-deployment/) (separate skill in this plugin)
- **Web app deployment**, **Lambda Web Adapter**, **API endpoints**, **CORS**, **authentication**, **custom domains**, or **sam local start-api** -> see [references/web-app-deployment.md](references/web-app-deployment.md)
- **Event sources**, **DynamoDB Streams**, **Kinesis**, **SQS**, **Kafka**, **S3 notifications**, or **SNS** -> see [references/event-sources.md](references/event-sources.md)
- **EventBridge**, **event bus**, **event patterns**, **event design**, **Pipes**, or **schema registry** -> see [references/event-driven-architecture.md](references/event-driven-architecture.md)
- **Durable functions**, **checkpointing**, **replay model**, **saga pattern**, or **long-running Lambda workflows** -> see the [durable-functions skill](../aws-lambda-durable-functions/) (separate skill in this plugin with full SDK reference, testing, and deployment guides)
- **Orchestration**, **workflows**, or **Durable Functions vs Step Functions** -> see [references/orchestration-and-workflows.md](references/orchestration-and-workflows.md)
- **Step Functions**, **ASL**, **state machines**, **JSONata**, **Distributed Map**, or **SDK integrations** -> see [references/step-functions.md](references/step-functions.md)
- **Step Functions testing**, **TestState API**, **mocking service integrations**, or **state machine unit tests** -> see [references/step-functions-testing.md](references/step-functions-testing.md)
- **Observability**, **logging**, **tracing**, **metrics**, **alarms**, or **dashboards** -> see [references/observability.md](references/observability.md)
- **Optimization**, **cold starts**, **memory tuning**, **cost**, or **streaming** -> see [references/optimization.md](references/optimization.md)
- **Powertools**, **idempotency**, **feature flags**, **parameters**, **parser**, **batch processing**, or **data masking** -> see [references/powertools.md](references/powertools.md)
- **Troubleshooting**, **errors**, **debugging**, or **deployment failures** -> see [references/troubleshooting.md](references/troubleshooting.md)
## Best Practices
### Project Setup
- Do: Use `sam_init` or `cdk init` with an appropriate template for your use case
- Do: Set global defaults for timeout, memory, runtime, and tracing (`Globals` in SAM, construct props in CDK)
- Do: Use AWS Lambda Powertools for structured logging, tracing, metrics (EMF), idempotency, and batch processing — available for Python, TypeScript, Java, and .NET
- Don't: Copy-paste templates from the internet without understanding the resource configuration
- Don't: Use the same memory and timeout values for all functions regardless of workload
### Security
- Do: Follow least-privilege IAM policies scoped to specific resources and actions
- Do: Use `secure_esm_*` tools to generate correct IAM policies for event source mappings
- Do: Store secrets in AWS Secrets Manager or SSM Parameter Store, never in environment variables
- Do: Use VPC endpoints instead of NAT Gateways for AWS service access when possible
- Do: Enable Amazon GuardDuty Lambda Protection to monitor function network activity for threats (cryptocurrency mining, data exfiltration, C2 callbacks)
- Don't: Use wildcard (`*`) resource ARNs or actions in IAM policies
- Don't: Hardcode credentials or secrets in application code or templates
- Don't: Store user data or sensitive information in module-level variables — execution environments can be reused across different callers
### Idempotency
- Do: Write idempotent function code — Lambda delivers events **at least once**, so duplicate invocations must be safe
- Do: Use the AWS Lambda Powertools Idempotency utility (backed by DynamoDB) for critical operations
- Do: Validate and deduplicate events at the start of the handler before performing side effects
- Don't: Assume an event will only ever be processed once
For topic-specific best practices, see the dedicated guide files in the reference table above.
## Lambda Limits Quick Reference
Limits that developers commonly hit:
| Resource | Limit |
| -------------------------------------------- | ----------------------------------- |
| Function timeout | 900 seconds (15 minutes) |
| Memory | 128 MB – 10,240 MB |
| 1 vCPU equivalent | 1,769 MB memory |
| Synchronous payload (request + response) | 6 MB each |
| Async invocation payload | 1 MB |
| Streamed response | 200 MB |
| Deployment package (.zip, uncompressed) | 250 MB |
| Deployment package (.zip upload, compressed) | 50 MB |
| Container image | 10 GB |
| Layers per function | 5 |
| Environment variables (aggregate) | 4 KB |
| `/tmp` ephemeral storage | 512 MB – 10,240 MB |
| Account concurrent executions (default) | 1,000 (requestable increase) |
| Burst scaling rate | 1,000 new executions per 10 seconds |
Check Service Quotas for your account limits: `aws lambda get-account-settings`
## Troubleshooting Quick Reference
| Error | Cause | Solution |
| ----------------------------------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build Failed` | Missing dependencies | Run `sam_build` with `use_container: true` |
| `Stack is in ROLLBACK_COMPLETE` | Previous deploy failed | Delete stack with `aws cloudformation delete-stack`, redeploy |
| `IteratorAge` increasing | Stream consumer falling behind | Increase `ParallelizationFactor` and `BatchSize`. Use `esm_optimize` |
| EventBridge events silently dropped | No DLQ, retries exhausted | Add `RetryPolicy` + `DeadLetterConfig` to rule target |
| Step Functions failing silently | No retry on Task state | Add `Retry` with `Lambda.ServiceException`, `Lambda.AWSLambdaException` |
| Durable Function not resuming | Missing IAM permissions | Add `lambda:CheckpointDurableExecution` and `lambda:GetDurableExecutionState` — see [durable-functions skill](../aws-lambda-durable-functions/) |
For detailed troubleshooting, see [references/troubleshooting.md](references/troubleshooting.md).
## Configuration
### AWS CLI Setup
This skill requires that AWS credentials are configured on the host machine:
**Verify access**: Run `aws sts get-caller-identity` to confirm credentials are valid
### SAM CLI Setup
1. **Install SAM CLI**: Follow the [SAM CLI installation guide](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html)
2. **Verify**: Run `sam --version`
### Container Runtime Setup
1. **Install a Docker compatible container runtime**: Required for `sam_local_invoke` and container-based builds
2. **Verify**: Use an appropriate command such as `docker --version` or `finch --version`
### MCP Server Configuration
**Write access is enabled by default.** The plugin ships with `--allow-write` in `.mcp.json`, so the MCP server can create projects, generate IaC, and deploy on behalf of the user.
Access to sensitive data (like Lambda and API Gateway logs) is **not** enabled by default. To grant it, add `--allow-sensitive-data-access` to `.mcp.json`.
### SAM Template Validation Hook
This plugin includes a `PostToolUse` hook that runs `sam validate` automatically after any edit to `template.yaml` or `template.yml`. If validation fails, the error is returned as a system message so you can fix it immediately. The hook requires SAM CLI and `jq` to be installed; if either is missing, validation is skipped with a system message. Users can disable it via `/hooks`.
**Verify**: Run `jq --version`
## Language selection
Default: TypeScript
Override syntax:
- "use Python" → Generate Python code
- "use JavaScript" → Generate JavaScript code
When not specified, ALWAYS use TypeScript
## IaC framework selection
Default: CDK
Override syntax:
- "use CloudFormation" → Generate YAML templates
- "use SAM" → Generate YAML templates
When not specified, ALWAYS use CDK
### Serverless MCP Server Unavailable
- Inform user: "AWS Serverless MCP not responding"
- Ask: "Proceed without MCP support?"
- DO NOT continue without user confirmation
## Resources
- [AWS SAM Documentation](https://docs.aws.amazon.com/serverless-application-model/)
- [AWS Lambda Documentation](https://docs.aws.amazon.com/lambda/)
- [AWS Lambda Powertools](https://docs.aws.amazon.com/powertools/)
- [AWS CDK Documentation](https://docs.aws.amazon.com/cdk/)
- [AWS Serverless MCP Server](https://github.com/awslabs/mcp/tree/main/src/aws-serverless-mcp-server)
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/getting-started.md
```markdown
# Getting Started with AWS Serverless Development
## Prerequisites
Verify these tools before proceeding:
```bash
aws --version # AWS CLI
aws sts get-caller-identity # Credentials configured
sam --version # SAM CLI
```
**Verify** that any Docker-compatible container runtime is installed (Docker, Finch, Podman, etc.). Use the appropriate command for your runtime (e.g., `finch --version`).
If `aws sts get-caller-identity` fails, ask user to set up credentials. If using CDK instead of SAM, also run `cdk --version` — see [cdk-project-setup.md](../../aws-serverless-deployment/references/cdk-project-setup.md).
## What Are You Building?
### REST/HTTP API
An API backend serving JSON over HTTPS — the most common serverless pattern.
**Quick start:**
- Template: `hello-world` (single function + API Gateway) or `quick-start-web` (web framework)
- Runtime: `nodejs22.x` or `python3.12`
- Architecture: `arm64`
**Read next:**
- [sam-project-setup.md](../../aws-serverless-deployment/references/sam-project-setup.md) — project scaffolding, deployment workflow, handler examples, container image packaging for large dependencies
- [web-app-deployment.md](web-app-deployment.md) — API endpoint selection (HTTP API vs REST API vs Function URL vs ALB), CORS, custom domains, authentication
### Full-Stack Web Application
A frontend (React, Vue, Angular, Next.js) with a backend API, deployed together.
**Quick start:**
- Template: `quick-start-web`
- Use `deploy_webapp` with `deployment_type: "fullstack"` for S3 + CloudFront + Lambda + API Gateway
**Read next:**
- [web-app-deployment.md](web-app-deployment.md) — Lambda Web Adapter, project structure, frontend updates
### Event Processor
A Lambda function triggered by a queue, stream, or database change — SQS, Kinesis, DynamoDB Streams, Kafka, or DocumentDB.
**Quick start:**
- Template: `hello-world` (then add an event source in `template.yaml`)
- Use `esm_guidance` to get the correct ESM configuration for your source
- Use `secure_esm_*` tools to generate least-privilege IAM policies
**Read next:**
- [event-sources.md](event-sources.md) — source-specific configuration, event filtering, batch processing examples
- [observability.md](observability.md) — structured logging, tracing, and monitoring for event processors
- [optimization.md](optimization.md) — ESM tuning parameters
### File/Object Processor
A Lambda function triggered when files are uploaded to or deleted from S3 — image processing, file validation, data import, thumbnail generation.
**Quick start:**
- Template: `hello-world` (then add an S3 event in `template.yaml`)
- Use prefix/suffix filters to limit triggers to specific paths or file types
**Read next:**
- [event-sources.md](event-sources.md) — S3 event notification configuration and recursive trigger prevention
### Notification Fan-Out
One event triggers multiple independent consumers — order notifications, alert distribution, cross-service communication.
**Quick start:**
- Create an SNS topic and subscribe multiple Lambda functions
- Use filter policies to route subsets of messages to specific consumers
**Read next:**
- [event-sources.md](event-sources.md) — SNS subscription configuration, filter policies, and DLQ setup
- [event-driven-architecture.md](event-driven-architecture.md) — for complex routing with EventBridge instead of SNS
### Event-Driven Architecture
Multiple services communicating through events on EventBridge — decoupled, independently deployable.
**Quick start:**
- Create a custom event bus (never use the default bus for application events)
- Define event schemas with `metadata` envelope for idempotency and tracing
- Use `search_schema` and `describe_schema` for schema discovery
**Read next:**
- [event-driven-architecture.md](event-driven-architecture.md) — event bus setup, event patterns, event design, Pipes, archive and replay
- [observability.md](observability.md) — correlation ID propagation, EventBridge metrics, and alarm strategy
- [orchestration-and-workflows.md](orchestration-and-workflows.md) — if you need reliable sequencing or human-in-the-loop
### Multi-Step Workflow or AI Pipeline
A workflow with sequential steps, parallel execution, human approval, or checkpointing — order processing, document pipelines, agentic AI.
**Quick start:**
- **Python 3.11+ or Node.js 22+**: Use Lambda durable functions for workflows expressed as code — see the [durable-functions skill](../../aws-lambda-durable-functions/) for comprehensive guidance
- **Any runtime**: Use Step Functions for visual orchestration with 200+ AWS service integrations
- **High-throughput, short-lived**: Use Step Functions Express (100k+ exec/sec)
**Read next:**
- [orchestration-and-workflows.md](orchestration-and-workflows.md) — Step Functions ASL, testing, patterns, and a Durable Functions vs Step Functions comparison
### Scheduled Job
A Lambda function triggered on a cron schedule — reports, cleanup tasks, data sync.
Add a `Schedule` event to your function in `template.yaml`:
```yaml
MyScheduledFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/report.handler
Events:
DailyReport:
Type: Schedule
Properties:
Schedule: cron(0 8 * * ? *) # 8:00 AM UTC daily
Enabled: true
```
**Read next:**
- [sam-project-setup.md](../../aws-serverless-deployment/references/sam-project-setup.md) — project setup and deployment workflow
## Working with Existing Projects
When joining or modifying an existing SAM project:
1. Look for `template.yaml` (or `template.yml`) at the project root
2. Check `samconfig.toml` for deployment configuration and environment profiles
3. Run `sam_build` to verify the project builds
4. Use `sam_logs` and `get_metrics` to understand current behavior before making changes
For CDK projects, look for `cdk.json` and run `cdk synth` to verify synthesis. See [cdk-project-setup.md](../../aws-serverless-deployment/references/cdk-project-setup.md).
```
### references/web-app-deployment.md
```markdown
# Web Application Deployment Guide
## Overview
Deploy web applications to AWS Serverless using the `deploy_webapp` tool and Lambda Web Adapter. This covers backend APIs, static frontends, and full-stack applications — from endpoint selection through custom domains and frontend updates.
## Deployment Types
Choose the deployment type based on your application:
| Type | Use Case | What Gets Created |
| ------------- | --------------------------- | -------------------------------------- |
| **backend** | API services, microservices | Lambda + API Gateway |
| **frontend** | Static sites, SPAs | S3 + CloudFront |
| **fullstack** | Complete web apps | Lambda + API Gateway + S3 + CloudFront |
## API Endpoint Options
| Option | Best For | Notes |
| ----------------------------- | ------------------------------------------------------------------- | --------------------------------------------------------------- |
| **HTTP API (API Gateway v2)** | Most REST/HTTP APIs | 70% cheaper than REST API, lower latency |
| **REST API (API Gateway v1)** | APIs needing WAF, caching, usage plans, request transforms | More features, higher cost |
| **Lambda Function URL** | Simple HTTPS endpoints, webhooks, single-function APIs | No API Gateway; free (pay only for Lambda). Supports streaming. |
| **Application Load Balancer** | High-traffic APIs with mixed Lambda/container targets, existing ALB | Fixed hourly cost; efficient at high request volumes |
### Lambda Function URLs
The simplest option for a single Lambda function that needs an HTTPS endpoint:
```yaml
MyFunction:
Type: AWS::Serverless::Function
Properties:
FunctionUrlConfig:
AuthType: AWS_IAM # or NONE for public endpoints
Cors:
AllowOrigins:
- "https://myapp.example.com"
```
Use `AuthType: NONE` only for public webhook receivers or assets where you handle auth in the function. For internal services, use `AWS_IAM` and sign requests with SigV4.
**Prefer Function URLs over API Gateway when:**
- Serving payloads larger than 10 MB (API Gateway's response limit)
- Handling requests longer than 29 seconds (API Gateway's integration timeout)
- Building a Lambdalith where per-endpoint metrics are managed in-code
- Internal service-to-service calls authenticated with AWS_IAM
**Prefer API Gateway when:**
- You need per-endpoint CloudWatch metrics without custom instrumentation
- You require Cognito authorizers, usage plans, API keys, or request validation
- You are exposing WebSocket APIs
- You want WAF integration at the API level
### CORS
Configure CORS on the API Gateway for browser-based frontend access:
- Set `AllowOrigin` to your frontend domain in production (avoid `*`)
- Include necessary headers: `Content-Type`, `Authorization`, `X-Api-Key`
- Set appropriate `MaxAge` for preflight caching
### Custom Domains
Use the `configure_domain` tool to set up custom domains with:
- ACM certificate (must be in us-east-1 for CloudFront)
- Route 53 DNS record
- API Gateway base path mapping
## Lambda Web Adapter
Lambda Web Adapter allows standard web frameworks to run on Lambda without code changes. The `deploy_webapp` tool automatically configures it.
**How it works:**
- Adds the Lambda Web Adapter layer to your function
- Sets `AWS_LAMBDA_EXEC_WRAPPER` to `/opt/bootstrap`
- Configures the `PORT` environment variable for your application
- Your framework listens on that port as it would normally
**Custom startup**: For applications needing pre-start steps (migrations, config loading), provide a startup script that runs setup commands before `exec`-ing your application.
**Supported backend frameworks:** Express.js, FastAPI, Flask, Spring Boot, ASP.NET Core, Gin
**Supported frontend frameworks:** React, Vue.js, Angular, Next.js, Svelte
## Project Structure
### Backend-Only
```text
my-backend/
├── src/
│ ├── app.js # Express application
│ ├── routes/ # API routes
│ └── middleware/ # Custom middleware
├── package.json
└── Dockerfile # Optional
```
### Frontend-Only
```text
my-frontend/
├── dist/ # Built assets
│ ├── index.html
│ └── assets/
└── package.json
```
### Full-Stack
```text
my-fullstack-app/
├── frontend/
│ ├── dist/ # Built frontend
│ └── package.json
├── backend/
│ ├── src/
│ └── package.json
└── deployment-config.json
```
## Frontend Updates
Use `update_webapp_frontend` to push new frontend assets to S3 and optionally invalidate the CloudFront cache. For zero-downtime updates, use content-hashed filenames so old and new assets can coexist.
## Database Integration
### RDS with Lambda
- Place Lambda in VPC with private subnets for RDS access
- Use VPC endpoints to avoid NAT Gateway costs for AWS service calls
- Set connection pool max to 1 per Lambda instance — but note that Lambda Managed Instances handle multiple concurrent requests, so use a proper connection pool there
- Store connection strings in Secrets Manager
For DynamoDB optimization (billing, key design, query patterns), see [optimization.md](optimization.md).
## Environment Management
Use `samconfig.toml` environment-specific sections for multi-environment deployments. See [sam-project-setup.md](../../aws-serverless-deployment/references/sam-project-setup.md) for configuration details.
Store environment-specific secrets in Secrets Manager or SSM Parameter Store, referenced by environment name.
## Authentication & Authorization
### Choosing an Auth Approach
| Approach | Best For | API Type |
| ------------------------------------------- | --------------------------------------- | -------------------- |
| **Cognito User Pools + JWT authorizer** | User sign-up/sign-in for your own app | HTTP API (v2) |
| **Cognito User Pools + Cognito authorizer** | Same, with built-in token validation | REST API (v1) |
| **Lambda authorizer (token-based)** | Custom auth logic, third-party IdPs | Both |
| **Lambda authorizer (request-based)** | Multi-source auth (headers, query, IP) | Both |
| **IAM authorization** | Service-to-service, internal APIs | Both + Function URLs |
| **API keys + usage plans** | Rate limiting third-party API consumers | REST API (v1) only |
### Cognito User Pools vs Identity Pools
- **User Pools** = authentication (who are you?). Handles sign-up, sign-in, MFA, and issues JWT tokens. This is what most web APIs need.
- **Identity Pools** = authorization (what AWS resources can you access?). Exchanges tokens (from User Pools, Google, Facebook) for temporary AWS credentials. Use this when clients need direct AWS access (S3 upload from browser, IoT).
Most API backends only need User Pools + a JWT or Cognito authorizer on API Gateway.
### JWT Authorizer (HTTP API)
The simplest and cheapest option for HTTP APIs backed by Cognito:
```yaml
MyCognitoUserPool:
Type: AWS::Cognito::UserPool
MyCognitoUserPoolClient:
Type: AWS::Cognito::UserPoolClient
Properties:
UserPoolId: !Ref MyCognitoUserPool
ExplicitAuthFlows:
- ALLOW_USER_SRP_AUTH
- ALLOW_REFRESH_TOKEN_AUTH
MyApi:
Type: AWS::Serverless::HttpApi
Properties:
Auth:
DefaultAuthorizer: MyCognitoAuth
Authorizers:
MyCognitoAuth:
AuthorizationScopes:
- email
IdentitySource: $request.header.Authorization
JwtConfiguration:
issuer: !Sub "https://cognito-idp.${AWS::Region}.amazonaws.com/${MyCognitoUserPool}"
audience:
- !Ref MyCognitoUserPoolClient
```
JWT authorizers validate the token signature and claims at the API Gateway level — no Lambda invocation needed for auth. This is free (no extra cost beyond HTTP API pricing).
### Lambda Authorizer
Use Lambda authorizers when you need custom auth logic: validating tokens from a third-party IdP, checking database-backed permissions, or combining multiple auth signals.
**Token-based** authorizers receive only the `Authorization` header. **Request-based** authorizers receive the full request (headers, query string, path, IP) — use these when auth depends on more than just a bearer token.
```yaml
MyApi:
Type: AWS::Serverless::Api
Properties:
Auth:
DefaultAuthorizer: MyLambdaAuth
Authorizers:
MyLambdaAuth:
FunctionArn: !GetAtt AuthFunction.Arn
FunctionPayloadType: TOKEN
Identity:
Header: Authorization
ReauthorizeEvery: 300 # Cache auth result for 5 minutes
```
Set `ReauthorizeEvery` > 0 to cache authorization results and avoid invoking the authorizer Lambda on every request. A value of 300 (5 minutes) is a reasonable default.
### API Keys and Usage Plans
API keys are for tracking and throttling third-party API consumers — they are **not** an authentication mechanism. Always combine API keys with a proper authorizer.
```yaml
MyApi:
Type: AWS::Serverless::Api
Properties:
Auth:
ApiKeyRequired: true
UsagePlan:
CreateUsagePlan: PER_API
Throttle:
BurstLimit: 100
RateLimit: 50
Quota:
Limit: 10000
Period: MONTH
```
API keys and usage plans are REST API (v1) only. HTTP APIs do not support them.
### Auth Best Practices
- [ ] Use JWT authorizers with HTTP API for most web applications — cheapest, lowest latency
- [ ] Cache Lambda authorizer results (`ReauthorizeEvery` > 0) to avoid per-request invocations
- [ ] Never use API keys as the sole authentication mechanism — they are for usage tracking, not identity
- [ ] Use IAM authorization (SigV4) for service-to-service calls, not shared API keys
- [ ] Store JWT client IDs and secrets in SSM Parameter Store, not in template literals
## Performance
- **Caching**: Use CloudFront caching for static assets. Disable caching for API paths.
- **Response streaming**: For LLM/AI responses, large payloads (> 6 MB), or long-running operations, use Lambda response streaming to reduce TTFB. See [optimization.md](optimization.md) for configuration.
For cold start optimization, memory right-sizing, and connection pooling, see [optimization.md](optimization.md).
```
### references/event-sources.md
```markdown
# Lambda Event Sources Guide
## Overview
This guide covers the most common Lambda event sources — both polling-based (Event Source Mappings) and push-based (S3 notifications, SNS subscriptions). Use `esm_guidance` for polling source setup recommendations and `esm_optimize` for performance tuning.
**Delivery guarantee:** All Lambda event sources deliver events **at least once**. Your function must be idempotent — the same record may be processed more than once. Use the AWS Lambda Powertools Idempotency utility (backed by DynamoDB) to handle duplicates safely.
## Polling-Based Event Sources (ESM)
Event Source Mappings use a Lambda-managed poller that reads from the source, batches records, and invokes your function. You control throughput via batch size, concurrency, and parallelization.
### DynamoDB Streams
**Use case:** React to data changes in DynamoDB tables
**Key configuration:**
- `StartingPosition`: `LATEST` for new records only, `TRIM_HORIZON` for all
- `BatchSize`: 1-10000 (default 100)
- `ParallelizationFactor`: 1-10 (default 1, increase for throughput)
- `BisectBatchOnFunctionError`: Enable to isolate poison records
- `MaximumRetryAttempts`: Set to prevent infinite retries (default unlimited)
- `MaximumBatchingWindowInSeconds`: Buffer time before invoking (0-300)
**Best practices:**
- Enable `BisectBatchOnFunctionError` and set `MaximumRetryAttempts` to 3
- Configure a dead-letter queue for records that exhaust retries
- Use `ParallelizationFactor` > 1 when processing can't keep up
### Kinesis Streams
**Use case:** Process real-time streaming data
**Key configuration:**
- `BatchSize`: 1-10000 (default 100)
- `ParallelizationFactor`: 1-10 (should not exceed shard count)
- `MaximumBatchingWindowInSeconds`: Buffer time (0-300)
- `TumblingWindowInSeconds`: For aggregation scenarios (0-900)
- `StartingPosition`: `LATEST` or `TRIM_HORIZON`
**Best practices:**
- Higher batch sizes reduce invocation costs but increase timeout risk
- Use tumbling windows for time-based aggregation (counts, sums, averages)
- Enable enhanced fan-out when multiple consumers read from the same stream
### SQS Queues
**Use case:** Decouple components with reliable messaging
**Key configuration:**
- `BatchSize`: 1-10000 (default 10)
- `MaximumBatchingWindowInSeconds`: Buffer time (0-300)
- `MaximumConcurrency`: Limit concurrent Lambda invocations
- `FunctionResponseTypes`: Set to `["ReportBatchItemFailures"]` to avoid reprocessing successful messages
**FIFO queue considerations:**
- Use `BatchSize: 1` for strict ordering
- Limit `MaximumConcurrency` to prevent out-of-order processing
- Use message group IDs for parallel processing within groups
**Python SQS batch processor with partial failure reporting:**
```python
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.batch import (
BatchProcessor, EventType, process_partial_response,
)
from aws_lambda_powertools.utilities.data_classes.sqs_event import SQSRecord
logger = Logger()
processor = BatchProcessor(event_type=EventType.SQS)
def process_record(record: SQSRecord):
body = record.json_body
logger.info("Processing order", order_id=body["orderId"])
# Business logic here — raise to mark this record as failed
return {"orderId": body["orderId"], "status": "processed"}
@logger.inject_lambda_context
def handler(event, context):
return process_partial_response(
event=event, record_handler=process_record, processor=processor, context=context,
)
```
**TypeScript SQS batch processor:**
```typescript
import {
BatchProcessor,
EventType,
processPartialResponse,
} from '@aws-lambda-powertools/batch';
import { Logger } from '@aws-lambda-powertools/logger';
import type { SQSHandler, SQSRecord } from 'aws-lambda';
const processor = new BatchProcessor(EventType.SQS);
const logger = new Logger();
const recordHandler = async (record: SQSRecord): Promise<void> => {
const body = JSON.parse(record.body);
logger.info('Processing order', { orderId: body.orderId });
};
export const handler: SQSHandler = async (event, context) =>
processPartialResponse(event, recordHandler, processor, { context });
```
Both examples require `FunctionResponseTypes: ["ReportBatchItemFailures"]` in the ESM configuration.
**Best practices:**
- Always enable `ReportBatchItemFailures` for partial failure handling
- Set queue `VisibilityTimeout` >= Lambda function timeout
- Configure a DLQ with `maxReceiveCount` of 3-5
### MSK/Kafka
**Use case:** Process high-throughput streaming data from Kafka
**Key configuration:**
- `Topics`: List of Kafka topics to consume
- `BatchSize`: 1-10000 (default 100)
- `MaximumBatchingWindowInSeconds`: Buffer time (0-300)
- `StartingPosition`: `LATEST` or `TRIM_HORIZON`
- `ConsumerGroupId`: Consumer group identifier
**Network requirements:**
- Lambda must have VPC access to the MSK cluster
- Security groups must allow traffic on ports 9092 (plaintext) or 9094 (TLS)
- Use IAM authentication or SASL/SCRAM for authentication
**Best practices:**
- Use `esm_kafka_troubleshoot` for connectivity issues
- Generate IAM policies with `secure_esm_msk_policy`
**Powertools Kafka Consumer:** For Kafka events with Avro, Protobuf, or JSON Schema payloads, use the Kafka Consumer utility to deserialize records automatically instead of manually parsing byte arrays:
```python
from aws_lambda_powertools.utilities.kafka import KafkaConsumer, SchemaConfig
schema_config = SchemaConfig(schema_type="AVRO", schema_registry_url="https://my-registry.example.com")
consumer = KafkaConsumer(schema_config=schema_config)
@consumer.handler
def handler(event, context):
for record in consumer.records:
# record.value is already deserialized from Avro
order = record.value
print(f"Order: {order['orderId']}")
```
Available for Python, TypeScript, Java, and .NET. Supports Avro, Protobuf, and JSON Schema with both AWS Glue Schema Registry and Confluent Schema Registry.
### Amazon MQ
**Use case:** Process messages from ActiveMQ or RabbitMQ brokers
Lambda connects to Amazon MQ using a VPC-attached ESM. Configure the broker in a private subnet and ensure Lambda's security group can reach the broker port (61614 for ActiveMQ over STOMP/TLS, 5671 for RabbitMQ over AMQP/TLS).
### Self-Managed Kafka
**Use case:** Process messages from a Kafka cluster you operate (not MSK)
Use a self-managed Kafka ESM when your cluster is not AWS-managed. Lambda connects via the network configuration you provide. Supports SASL/PLAIN, SASL/SCRAM, mTLS, and VPC connectivity.
### Amazon DocumentDB (with change streams)
**Use case:** React to changes in DocumentDB collections
Similar to DynamoDB Streams — Lambda polls the DocumentDB change stream. Requires DocumentDB change streams to be enabled on the collection and Lambda to have VPC access to the cluster.
## Push-Based Event Sources
These sources invoke Lambda directly via asynchronous invocation — no ESM poller involved. Lower latency than polling sources, but less control over throughput and concurrency.
### S3 Event Notifications
**Use case:** React to object uploads, deletions, or modifications — image processing, file validation, data import, thumbnail generation
**SAM template:**
```yaml
ImageProcessor:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/process_image.handler
Runtime: python3.12
Architectures: [arm64]
Policies:
- S3ReadPolicy:
BucketName: !Ref UploadBucket
Events:
ImageUploaded:
Type: S3
Properties:
Bucket: !Ref UploadBucket
Events: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: prefix
Value: uploads/
- Name: suffix
Value: .jpg
UploadBucket:
Type: AWS::S3::Bucket
```
**Key configuration:**
- `Events`: Event types to trigger on — `s3:ObjectCreated:*`, `s3:ObjectRemoved:*`, `s3:ObjectCreated:Put`, etc.
- `Filter.S3Key.Rules`: Prefix and/or suffix filters to limit which objects trigger the function
- `Bucket`: Must reference an `AWS::S3::Bucket` declared in the same SAM template
**Best practices:**
- Avoid recursive triggers — if your function writes back to the same bucket that triggers it, Lambda will loop infinitely. Use a separate output bucket or a different prefix
- S3 delivers events at least once and NOT in order — write idempotent handlers and don't depend on event sequencing
- URL-decode the object key (`urllib.parse.unquote_plus` in Python, `decodeURIComponent` in JS) — S3 URL-encodes special characters in keys
- Use prefix/suffix filters to limit invocations to relevant objects
- For complex routing (multiple consumers, content-based filtering, cross-account), use S3 → EventBridge instead — enable EventBridge notifications on the bucket and create rules
**Python S3 event handler:**
```python
import json
import urllib.parse
import boto3
from aws_lambda_powertools import Logger
logger = Logger()
s3 = boto3.client('s3')
@logger.inject_lambda_context
def handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.parse.unquote_plus(record['s3']['object']['key'])
size = record['s3']['object']['size']
logger.info("Processing object", bucket=bucket, key=key, size=size)
response = s3.get_object(Bucket=bucket, Key=key)
# Process the object content
```
**TypeScript S3 event handler:**
```typescript
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import type { S3Event, Context } from 'aws-lambda';
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
const logger = new Logger();
const tracer = new Tracer();
const s3 = tracer.captureAWSv3Client(new S3Client({}));
export const handler = async (event: S3Event, context: Context): Promise<void> => {
logger.addContext(context);
for (const record of event.Records) {
const bucket = record.s3.bucket.name;
const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' '));
const size = record.s3.object.size;
logger.info('Processing object', { bucket, key, size });
const response = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
// Process the object content
}
};
```
### SNS Subscriptions
**Use case:** Fan-out processing — one published message triggers multiple independent Lambda consumers
**SAM template:**
```yaml
OrderNotifier:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/notify.handler
Runtime: python3.12
Architectures: [arm64]
Events:
OrderEvent:
Type: SNS
Properties:
Topic: !Ref OrderTopic
FilterPolicy:
event_type:
- order_placed
- order_shipped
FilterPolicyScope: MessageAttributes
RedrivePolicy:
deadLetterTargetArn: !GetAtt OrderDLQ.Arn
OrderTopic:
Type: AWS::SNS::Topic
OrderDLQ:
Type: AWS::SQS::Queue
Properties:
MessageRetentionPeriod: 1209600 # 14 days
```
**Key configuration:**
- `Topic`: ARN or reference to the SNS topic
- `FilterPolicy`: JSON filter to receive only matching messages — reduces invocations and cost
- `FilterPolicyScope`: `MessageAttributes` (default) or `MessageBody`
- `RedrivePolicy`: DLQ ARN for messages that fail delivery — configured at the subscription level, not the topic
**Best practices:**
- Use filter policies to reduce invocations — filter at the subscription level, not in handler code
- Configure a redrive policy (DLQ) on every subscription — SNS retries server-side errors up to 100,015 times over 23 days; client-side errors (deleted function) go directly to DLQ
- Set DLQ `MessageRetentionPeriod` to 14 days (maximum) for investigation time
- SNS delivers at least once; write idempotent handlers
- FIFO SNS topics do NOT support Lambda subscriptions — use standard topics only
- For simple point-to-point delivery, prefer SQS → Lambda (ESM) over SNS → Lambda
- For complex event routing with pattern matching, prefer EventBridge over SNS
**Python SNS event handler:**
```python
import json
from aws_lambda_powertools import Logger
logger = Logger()
@logger.inject_lambda_context
def handler(event, context):
for record in event['Records']:
message = json.loads(record['Sns']['Message'])
subject = record['Sns'].get('Subject', '')
message_id = record['Sns']['MessageId']
logger.info("Processing SNS message", message_id=message_id, subject=subject)
# Process the message
```
**TypeScript SNS event handler:**
```typescript
import { Logger } from '@aws-lambda-powertools/logger';
import type { SNSEvent, Context } from 'aws-lambda';
const logger = new Logger();
export const handler = async (event: SNSEvent, context: Context): Promise<void> => {
logger.addContext(context);
for (const record of event.Records) {
const message = JSON.parse(record.Sns.Message);
const subject = record.Sns.Subject ?? '';
const messageId = record.Sns.MessageId;
logger.info('Processing SNS message', { messageId, subject });
// Process the message
}
};
```
## Event Filtering
### ESM Filtering (Polling Sources)
ESM event filtering lets Lambda evaluate filter criteria **before invoking your function**, reducing unnecessary invocations and costs.
**Add filters in SAM template:**
```yaml
MyFunction:
Type: AWS::Serverless::Function
Events:
MySQSEvent:
Type: SQS
Properties:
Queue: !GetAtt MyQueue.Arn
FilterCriteria:
Filters:
- Pattern: '{"body": {"eventType": ["ORDER_CREATED"]}}'
```
**Filter pattern syntax** matches against the event structure:
- Exact match: `{"body": {"status": ["ACTIVE"]}}`
- Prefix match: `{"body": {"id": [{"prefix": "order-"}]}}`
- Numeric range: `{"body": {"amount": [{"numeric": [">", 100]}]}}`
- Exists check: `{"body": {"metadata": [{"exists": true}]}}`
Filtering is supported for SQS, Kinesis, DynamoDB Streams, MSK, self-managed Kafka, and MQ. Records that don't match filters are dropped before Lambda is invoked — SQS messages are deleted, stream records are skipped.
### Push Source Filtering
Push-based sources use their own filtering mechanisms (not ESM FilterCriteria):
- **S3**: Prefix/suffix filters on the object key via `Filter.S3Key.Rules` in the SAM template
- **SNS**: Subscription filter policies via `FilterPolicy` — supports matching on `MessageAttributes` (default) or `MessageBody`
## ESM Provisioned Mode
For high-throughput Kafka (MSK or self-managed) and SQS workloads, provisioned mode provides:
- **3x faster autoscaling** compared to default mode
- **16x higher maximum capacity**
- Manual control over minimum and maximum event pollers
Enable in SAM template:
```yaml
MySQSEvent:
Type: SQS
Properties:
Queue: !GetAtt MyQueue.Arn
ProvisionedPollerConfig:
MinimumPollers: 2
MaximumPollers: 20
```
Use provisioned mode when default autoscaling is too slow to absorb traffic spikes without message backlog.
## Batch Size Guidelines
| Priority | Small (1-10) | Medium (10-100) | Large (100-1000+) |
| ---------------- | ------------------------- | --------------- | --------------------------------------- |
| **Latency** | Lowest | Moderate | Higher |
| **Cost** | Higher (more invocations) | Balanced | Lower (fewer invocations) |
| **Timeout risk** | Low | Low | Higher (more processing per invocation) |
## Error Handling
- **Stream sources** (DynamoDB, Kinesis): Records retry until success, expiry, or max retries. Enable `BisectBatchOnFunctionError` and set `MaximumRetryAttempts`.
- **SQS**: Failed messages return to the queue after visibility timeout. Use `ReportBatchItemFailures` for partial batch success.
- **Kafka**: Similar to stream sources. Failed batches retry based on ESM configuration.
Always configure a dead-letter queue or on-failure destination to capture records that cannot be processed.
## Monitoring
For event source metrics to alarm on (IteratorAge, Errors, Throttles, DLQ depth), alarm thresholds, and dashboard setup, see [observability.md](observability.md).
## Schema Integration
For type-safe event processing with EventBridge:
1. Use `search_schema` to find event schemas
2. Use `describe_schema` to get the full definition
3. Generate typed handlers based on the schema
```
### references/event-driven-architecture.md
```markdown
# Event-Driven Architecture Guide
## Choreography vs Orchestration
The most important architectural decision in an event-driven system is whether services coordinate through choreography or orchestration.
| | Choreography | Orchestration |
| -------------------- | -------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| **How it works** | Services emit events; other services react independently | A central coordinator (Step Functions, Durable Functions) directs each step |
| **Coupling** | Loose — publisher doesn't know about consumers | Tighter — coordinator knows about each step |
| **Visibility** | Distributed; hard to see the full flow | Centralized; execution history in one place |
| **Failure handling** | Each service handles its own failures | Central error handling and retry logic |
| **Best for** | Independent services reacting to facts about the world | Business-critical workflows requiring audit trails, visibility, and reliable sequencing |
**Use choreography (EventBridge + Lambda)** when services are genuinely independent and don't need to know the outcome of each other's processing.
**Use orchestration (Step Functions / Lambda durable functions)** when you need reliable sequencing, compensating transactions, human approval steps, or the ability to visualize and debug the full workflow.
In practice, most systems use both: orchestration within a bounded context, choreography between bounded contexts.
---
## EventBridge Concepts
### Event Bus Types
| Type | Use Case |
| --------------- | ------------------------------------------------------------------------------ |
| **Default bus** | Receives AWS service events (EC2 state changes, S3 events, CodePipeline, etc.) |
| **Custom bus** | Your application events; recommended for all custom event routing |
| **Partner bus** | Receives events from SaaS partners (Datadog, Zendesk, Stripe, etc.) |
Always use a **custom event bus** for application events — keeps your events separate from AWS service noise and simplifies IAM and monitoring.
### Standard EventBridge Event Structure
Every event on the bus has this envelope (added automatically by EventBridge):
```json
{
"version": "0",
"id": "12345678-1234-1234-1234-123456789012",
"source": "com.mycompany.orders",
"detail-type": "OrderPlaced",
"account": "123456789012",
"time": "2025-01-15T10:30:00Z",
"region": "us-east-1",
"resources": [],
"detail": {
"orderId": "ord-987",
"userId": "usr-123",
"total": 49.99
}
}
```
- `source` — identifies the publishing service (`com.mycompany.orders`)
- `detail-type` — identifies the event type (`OrderPlaced`)
- `detail` — your business payload (up to 1 MB per event entry)
### SAM Configuration
**Custom event bus:**
```yaml
OrderEventBus:
Type: AWS::Events::EventBus
Properties:
Name: order-events
OrderEventBusArn:
Type: AWS::SSM::Parameter
Properties:
Name: /myapp/order-event-bus-arn
Value: !GetAtt OrderEventBus.Arn
Type: String
```
**Lambda publishing events:**
```yaml
MyFunction:
Type: AWS::Serverless::Function
Properties:
Policies:
- Statement:
- Effect: Allow
Action: events:PutEvents
Resource: !GetAtt OrderEventBus.Arn
Environment:
Variables:
EVENT_BUS_ARN: !GetAtt OrderEventBus.Arn
```
**Lambda subscribing via rule:**
```yaml
ProcessOrderFunction:
Type: AWS::Serverless::Function
Properties:
Events:
OrderPlaced:
Type: EventBridgeRule
Properties:
EventBusName: !Ref OrderEventBus
Pattern:
source:
- com.mycompany.orders
detail-type:
- OrderPlaced
RetryPolicy:
MaximumRetryAttempts: 3
MaximumEventAgeInSeconds: 3600
DeadLetterConfig:
Type: SQS
QueueLogicalId: ProcessOrderDLQ
ProcessOrderDLQ:
Type: AWS::SQS::Queue
Properties:
MessageRetentionPeriod: 1209600 # 14 days
```
### Publishing Events from Lambda
**Python:**
```python
import json
import os
from datetime import datetime, timezone
import uuid
import boto3
from aws_lambda_powertools import Logger
logger = Logger()
events_client = boto3.client("events")
event_bus_arn = os.environ["EVENT_BUS_ARN"]
def publish_order_placed(order: dict):
events_client.put_events(
Entries=[{
"EventBusName": event_bus_arn,
"Source": "com.mycompany.orders",
"DetailType": "OrderPlaced",
"Detail": json.dumps({
"metadata": {
"id": str(uuid.uuid4()),
"version": "1",
"timestamp": datetime.now(timezone.utc).isoformat(),
"correlationId": logger.get_correlation_id(),
"service": "order-service",
},
"data": order,
}),
}]
)
```
**TypeScript:**
```typescript
import { EventBridgeClient, PutEventsCommand } from '@aws-sdk/client-eventbridge';
import { randomUUID } from 'crypto';
const client = new EventBridgeClient({});
const eventBusArn = process.env.EVENT_BUS_ARN!;
async function publishOrderPlaced(order: Record<string, unknown>): Promise<void> {
await client.send(new PutEventsCommand({
Entries: [{
EventBusName: eventBusArn,
Source: 'com.mycompany.orders',
DetailType: 'OrderPlaced',
Detail: JSON.stringify({
metadata: {
id: randomUUID(),
version: '1',
timestamp: new Date().toISOString(),
service: 'order-service',
},
data: order,
}),
}],
}));
}
```
---
## Event Patterns (Content-Based Routing)
EventBridge rules use **event patterns** to filter which events reach a target. Patterns match against the standard envelope fields and anything inside `detail`.
**All values are arrays** — a field matches if the event value equals any element in the array:
```json
{
"source": ["com.mycompany.orders"],
"detail-type": ["OrderPlaced", "OrderUpdated"],
"detail": {
"status": ["CONFIRMED"],
"total": [{ "numeric": [">", 100] }]
}
}
```
**Common pattern operators:**
| Operator | Example | Matches |
| --------------- | ----------------------------------------------- | ------------------------- |
| Exact match | `"status": ["ACTIVE"]` | status equals "ACTIVE" |
| Multiple values | `"status": ["ACTIVE", "PENDING"]` | status is either |
| Prefix | `"id": [{"prefix": "ord-"}]` | id starts with "ord-" |
| Anything-but | `"status": [{"anything-but": ["CANCELLED"]}]` | status is not "CANCELLED" |
| Exists | `"refundId": [{"exists": true}]` | refundId field is present |
| Numeric range | `"amount": [{"numeric": [">=", 0, "<", 1000]}]` | 0 ≤ amount < 1000 |
| Null | `"coupon": [null]` | coupon field is null |
**Up to 5 targets per rule.** If you need to fan out to more consumers, route to an SNS topic or use the rule to fan out via SQS.
---
## Event Design
### Event Envelopes
Wrap your business payload in a custom metadata layer inside `detail`. This provides consistent fields for filtering, deduplication, and observability across all events regardless of the transport (EventBridge, SNS, SQS, Kinesis):
```json
{
"metadata": {
"id": "01HXHMF28A94NS7NSHC5GM80F4",
"version": "1",
"timestamp": "2025-01-15T10:30:00Z",
"domain": "orders",
"service": "order-service",
"correlationId": "req-abc123"
},
"data": {
"orderId": "ord-987",
"userId": "usr-123",
"total": 49.99
}
}
```
**Key metadata fields:**
- `id` — unique event identifier; use for idempotency deduplication
- `version` — schema version of the `data` payload
- `timestamp` — when the event occurred (not when it was received)
- `correlationId` — trace ID that flows through the entire request chain
- `domain` / `service` — for filtering and observability
### Light Events vs Rich Events
**Light events** carry only IDs and directly relevant fields. Consumers fetch additional data they need.
**Rich events** include expanded entities — the complete state of the object at the time of the event.
| | Light events | Rich events |
| ------------------------- | ----------------------------------------- | ------------------------------------- |
| **Payload size** | Small | Large |
| **Subscriber complexity** | Higher (must hydrate) | Lower (self-contained) |
| **Race conditions** | Risk between event publish and data fetch | None |
| **Coupling** | Consumer coupled to publisher's API | Consumer coupled to event schema only |
**Guidance:**
- **Within a bounded context** (same team, same domain): light events are fine — you control both sides
- **Across bounded contexts** (different teams, different domains): prefer rich events — unknown consumers shouldn't need to call back into your service to understand what happened
### Event Versioning
Prefer the **no-breaking-changes policy**: always add new fields, never remove or rename existing fields, never change a field's type. Consumers that ignore unknown fields continue working without any changes.
When a breaking change is unavoidable, the cleanest approach is versioning in the `detail-type`:
```text
OrderPlaced.v1 → OrderPlaced.v2
```
Consumers subscribe to specific versions. The publisher emits both versions during the migration window, then retires `v1` once all consumers have migrated.
**Avoid:** using Lambda versions/aliases or API Gateway stages to version event-driven integrations — IAM roles don't version alongside function versions, which creates subtle permission bugs.
---
## Retry Policy and Dead-Letter Queues
EventBridge retries failed Lambda invocations with exponential backoff. Configure both `RetryPolicy` and a DLQ on every rule target that processes important events:
```yaml
Events:
OrderEvent:
Type: EventBridgeRule
Properties:
Pattern:
source: [com.mycompany.orders]
RetryPolicy:
MaximumRetryAttempts: 3 # 0–185; default 185
MaximumEventAgeInSeconds: 3600 # 60–86400; default 86400 (24h)
DeadLetterConfig:
Type: SQS
QueueLogicalId: MyDLQ
```
- `MaximumRetryAttempts` — how many times EventBridge retries before sending to DLQ
- `MaximumEventAgeInSeconds` — EventBridge stops retrying after this age, even if retries remain
- Without a DLQ, events that exhaust retries are silently dropped
**Process your DLQ actively.** Set a CloudWatch alarm on `ApproximateNumberOfMessagesVisible` and reprocess events by replaying them back to the event bus.
---
## Archive and Replay
EventBridge can archive all events (or a filtered subset) and replay them at any time. This is invaluable for:
- Reprocessing events after deploying a bug fix
- Bootstrapping new consumers with historical events
- Disaster recovery
**Create an archive in SAM:**
```yaml
OrderEventArchive:
Type: AWS::Events::Archive
Properties:
SourceArn: !GetAtt OrderEventBus.Arn
EventPattern:
source:
- com.mycompany.orders
RetentionDays: 30
```
Replay from the console, CLI, or API by specifying the archive, a time range, and optionally a different target bus.
---
## EventBridge Pipes
Pipes provide **point-to-point** integrations following a Source → Filter → Enrich → Target pattern. Use Pipes when you need to:
- Connect a stream/queue source to a target with enrichment (e.g., add customer data before sending to Step Functions)
- Filter events before they reach the target (you only pay for events that pass the filter)
- Transform payloads without writing a Lambda function
```yaml
OrderPipe:
Type: AWS::Pipes::Pipe
Properties:
Source: !GetAtt OrderQueue.Arn
SourceParameters:
SqsQueueParameters:
BatchSize: 10
Filter:
Filters:
- Pattern: '{"body": {"eventType": ["ORDER_CREATED"]}}'
Enrichment: !GetAtt EnrichOrderFunction.Arn
Target: !Ref OrderProcessingStateMachine
TargetParameters:
StepFunctionStateMachineParameters:
InvocationType: FIRE_AND_FORGET
RoleArn: !GetAtt PipeRole.Arn
```
**Pipes vs Rules:**
- Use **Pipes** for point-to-point (one source → one target) with enrichment or transformation
- Use **Rules** for fan-out (one event → multiple targets) or when the source is the event bus itself
---
## Schema Registry
The EventBridge Schema Registry discovers and stores schemas for events on your bus. Use it to generate typed code bindings and enforce contracts.
**Discover schemas automatically** by enabling schema discovery on your event bus:
```yaml
OrderBusDiscovery:
Type: AWS::EventSchemas::Discoverer
Properties:
SourceArn: !GetAtt OrderEventBus.Arn
```
Once events flow through the bus, schemas appear in the registry. Then use the MCP tools to work with them:
1. `list_registries` — browse available registries
2. `search_schema` with keywords (e.g., `"order"`) — find relevant schemas
3. `describe_schema` — get the full schema definition
4. Download code bindings from the console (Java, Python, TypeScript) — generates typed event classes
**Use schemas as contracts:** consumer teams reference a specific schema version. The publisher must not make breaking changes to a versioned schema without bumping the version.
---
## Push vs Poll
| Pattern | Services | Characteristics |
| -------- | ---------------------------------- | -------------------------------------------------------------------------------------- |
| **Push** | EventBridge, SNS | Low latency; event delivered as soon as it occurs; minimal compute waste at low volume |
| **Poll** | SQS+ESM, Kinesis, DynamoDB Streams | Full throughput control; ordered processing; higher latency; better for batching |
Choose **push (EventBridge)** when:
- You need real-time fan-out to multiple independent consumers
- Services are loosely coupled and don't need ordering guarantees
- You want content-based routing without consumer-side filtering code
Choose **poll (SQS/Kinesis)** when:
- You need strict ordering within a partition or message group
- Consumer needs to control throughput (e.g., protect a downstream database)
- You need large batch sizes for cost efficiency (e.g., bulk database writes)
---
## Observability
For EventBridge metrics to alarm on, CloudWatch Logs Insights queries, and correlation ID propagation patterns, see [observability.md](observability.md).
**Key principle:** X-Ray traces Lambda execution but does not automatically connect publisher to consumer across the event bus. Propagate `correlationId` in the event `metadata` envelope (see Event Envelopes above) and use CloudWatch Logs Insights to reconstruct cross-service request chains.
```
### references/orchestration-and-workflows.md
```markdown
# Orchestration and Workflows Guide
## Choosing an Orchestration Approach
| Approach | Best For | Runtime |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Lambda durable functions** | Multi-step business logic and AI/ML pipelines expressed as sequential code, with checkpointing and human-in-the-loop — see the [durable-functions skill](../../aws-lambda-durable-functions/) | Python: Python 3.11+ (Currently only Lambda runtime environments 3.13+ come with the Durable Execution SDK pre-installed. 3.11 is the min supported Python version by the Durable SDK itself, however, you could use OCI to bring your own container image with your own Python runtime + Durable SDK.), Node.js 22+ |
| **Step Functions Standard** | Cross-service orchestration, long-running auditable workflows, non-idempotent operations | Any (JSON/YAML ASL definition) |
| **Step Functions Express** | High-volume, short-lived event processing, idempotent operations (100k+ exec/sec) | Any |
| **EventBridge + Lambda** | Loosely coupled event-driven choreography with no central coordinator — see [event-driven-architecture.md](event-driven-architecture.md) | Any |
**Key distinction:** Lambda durable functions keep the workflow logic inside your Lambda code using standard language constructs. Step Functions define the workflow as a separate graph-based state machine that calls Lambda (and 9,000+ API actions across 200+ AWS services). Use durable functions when the workflow is tightly coupled to business logic written in Python or Node.js. Use Step Functions when you need visual design, cross-service coordination, or native service integrations without Lambda as an intermediary.
---
## Lambda durable functions
Lambda Durable Functions enable resilient multi-step applications that execute for up to one year, with automatic checkpointing, replay, and suspension — without consuming compute charges during wait periods.
### Durable functions vs Step Functions
| | Durable functions | Step Functions |
| ------------------------ | -------------------------------------------------------- | --------------------------------------------------------------------- |
| **Programming model** | Sequential code with `context.step()` | Graph-based state machine (ASL JSON/YAML) |
| **Runtimes** | Python 3.13+, Node.js 22+ | Any (runtime-agnostic) |
| **Workflow definition** | Inside your Lambda function code | Separate `.asl.json` file |
| **AWS integrations** | Via SDK calls inside steps | 9,000+ native API actions (no Lambda needed) |
| **Execution visibility** | CloudWatch Logs + `get-durable-execution-history` | Step Functions console, execution history API |
| **Max duration** | Up to 1 year | Standard: 1 year, Express: 5 minutes |
| **Execution semantics** | At-least-once with checkpointing | Standard: exactly-once, Express: at-least-once |
| **Billing** | Active compute time only (free during waits) | Per state transition (Standard) or per execution + duration (Express) |
| **Best for** | Business logic workflows, AI pipelines, code-first teams | Cross-service orchestration, visual workflows, polyglot teams |
**For comprehensive durable functions guidance** — including the SDK, programming model, replay rules, testing, error handling, and deployment patterns — see the [durable-functions skill](../../aws-lambda-durable-functions/) in this plugin.
---
## AWS Step Functions
For comprehensive Step Functions guidance — Standard vs Express workflows, ASL definitions, JSONata, SDK integrations, Distributed Map, testing, and best practices — see [step-functions.md](step-functions.md).
```
### references/step-functions.md
```markdown
# AWS Step Functions
Step Functions provides visual workflow orchestration with native integrations to 9,000+ API actions across 200+ AWS services. Define workflows as state machines in Amazon States Language (ASL).
## Standard vs Express Workflows
| | Standard | Express |
| --------------------------------- | ------------------------------------ | ------------------------------------------- |
| **Max duration** | 1 year | 5 minutes |
| **Execution semantics** | Exactly-once | At-least-once (async) / At-most-once (sync) |
| **Execution history** | Retained 90 days, queryable via API | CloudWatch Logs only |
| **Max throughput** | 2,000 exec/sec | 100,000 exec/sec |
| **Pricing model** | Per state transition | Per execution count + duration |
| **`.sync` / `.waitForTaskToken`** | Supported | Not supported |
| **Best for** | Auditable, non-idempotent operations | High-volume, idempotent event processing |
**Choose Standard** for: payment processing, order fulfillment, compliance workflows, anything that must never execute twice.
**Choose Express** for: IoT data ingestion, streaming transformations, mobile backends, high-throughput short-lived processing.
## Key State Types
| State | Purpose |
| ------------------ | ------------------------------------------------------------------------------------ |
| `Task` | Execute work — invoke Lambda, call any AWS service via SDK integration |
| `Choice` | Branch based on input data conditions (no `Next` required on branches) |
| `Parallel` | Execute multiple branches concurrently; waits for all branches to complete |
| `Map` | Iterate over an array; use Distributed Map mode for up to 10M items from S3/DynamoDB |
| `Wait` | Pause for a fixed duration or until a specific timestamp |
| `Pass` | Pass input to output, optionally injecting or transforming data |
| `Succeed` / `Fail` | End execution successfully or with an error and cause |
## SAM Template
```yaml
Resources:
MyWorkflow:
Type: AWS::Serverless::StateMachine
Properties:
DefinitionUri: statemachine/my_workflow.asl.json
Type: STANDARD # or EXPRESS
DefinitionSubstitutions:
ProcessFunctionArn: !GetAtt ProcessFunction.Arn
ResultsTable: !Ref ResultsTable
Policies:
- LambdaInvokePolicy:
FunctionName: !Ref ProcessFunction
- DynamoDBWritePolicy:
TableName: !Ref ResultsTable
Tracing:
Enabled: true
Logging:
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt WorkflowLogGroup.Arn
IncludeExecutionData: true
Level: ERROR # Use ALL for debugging, ERROR in production
WorkflowLogGroup:
Type: AWS::Logs::LogGroup
Properties:
RetentionInDays: 30
```
## State Machine Definition (ASL)
Use `DefinitionSubstitutions` to inject ARNs — never hardcode them:
```json
{
"Comment": "Order processing workflow",
"QueryLanguage": "JSONata",
"StartAt": "ProcessOrder",
"States": {
"ProcessOrder": {
"Type": "Task",
"Resource": "${ProcessFunctionArn}",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "HandleError" }],
"Next": "SaveResult"
},
"SaveResult": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Arguments": {
"TableName": "${ResultsTable}",
"Item": {
"id": { "S": "{% $states.input.orderId %}" },
"status": { "S": "completed" }
}
},
"End": true
},
"HandleError": {
"Type": "Fail",
"Error": "OrderProcessingFailed"
}
}
}
```
## JSONata — Recommended Query Language
JSONata is the modern, preferred way to reference and transform data in ASL. It replaces the five JSONPath I/O fields (`InputPath`, `Parameters`, `ResultSelector`, `ResultPath`, `OutputPath`) with just two: `Arguments` (inputs) and `Output` (result shape).
**Enable at the top level** to apply to all states:
```json
{ "QueryLanguage": "JSONata", "StartAt": "...", "States": {...} }
```
**Or per-state** to migrate incrementally:
```json
{ "Type": "Task", "QueryLanguage": "JSONata", ... }
```
**Expression syntax** — wrap expressions in `{% %}`:
```json
"Arguments": {
"userId": "{% $states.input.user.id %}",
"greeting": "{% 'Hello, ' & $states.input.user.name %}",
"total": "{% $sum($states.input.items.price) %}"
}
```
**Built-in Step Functions JSONata functions:**
- `$uuid()` — generate a v4 UUID
- `$parse(str)` — deserialize a JSON string to an object
- `$partition(array, size)` — split array into chunks
- `$range(start, end, step)` — generate a number array
- `$hash(value, algorithm)` — compute MD5/SHA-256/etc. hash
**JSONPath is still supported** and is the default if `QueryLanguage` is omitted — existing state machines do not need to be migrated.
## Integration Patterns
| Pattern | ARN suffix | Behaviour |
| --------------------- | ------------------- | --------------------------------------------------------------------- |
| **Request Response** | _(none)_ | Call service, proceed after HTTP 200 |
| **Run a Job** | `.sync` | Call service, wait for job completion |
| **Wait for Callback** | `.waitForTaskToken` | Pass `$$.Task.Token`, pause until `SendTaskSuccess`/`SendTaskFailure` |
**Wait for Callback** is the human-in-the-loop pattern: pass the task token to an external system (email, Slack, ticketing), call `sfn:SendTaskSuccess` with the token when approved.
## SDK Integrations — Avoid Lambda for Simple AWS Calls
Step Functions can call any AWS service API directly without a Lambda intermediary. This saves both cost and latency for simple operations:
```json
"SaveToDynamoDB": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Arguments": {
"TableName": "my-table",
"Item": { "id": { "S": "{% $states.input.id %}" } }
},
"End": true
}
```
```json
"PublishEvent": {
"Type": "Task",
"Resource": "arn:aws:states:::events:putEvents",
"Arguments": {
"Entries": [{
"EventBusName": "my-bus",
"Source": "my.service",
"DetailType": "OrderPlaced",
"Detail": "{% $states.input %}"
}]
},
"End": true
}
```
Avoiding Lambda intermediaries for simple DynamoDB reads/writes, SNS publishes, SQS sends, and EventBridge puts eliminates invocation latency and cost.
## Distributed Map — Large-Scale Processing
`Map` state with `Mode: DISTRIBUTED` processes up to 10 million items from S3, DynamoDB, or inline arrays, with each item running as an independent child workflow:
```json
"ProcessFiles": {
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
"StartAt": "ProcessSingleFile",
"States": { "ProcessSingleFile": { "Type": "Task", "Resource": "${ProcessFunctionArn}", "End": true } }
},
"MaxConcurrency": 100,
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": { "Bucket.$": "$.bucket", "Prefix.$": "$.prefix" }
},
"End": true
}
```
## Testing
For testing Step Functions workflows, see [step-functions-testing.md](step-functions-testing.md) — covers TestState API (mocking, inspection levels, retry simulation, chained tests) and Step Functions Local (Docker).
## Anti-Polling Pattern
The typical polling loop — `Wait → Check Status → Choice → loop` — is an expensive anti-pattern in Standard workflows because every state transition is billed. Replace it with the **callback + event-driven** approach:
1. Lambda starts the long-running task and receives a task token (`$$.Task.Token`)
2. Store the task token alongside the job ID in DynamoDB
3. Use `.waitForTaskToken` to pause the state machine at zero cost
4. When the job completes, an EventBridge rule triggers a Lambda that looks up the token and calls `sfn:SendTaskSuccess`
```json
"StartJob": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Arguments": {
"FunctionName": "${StartJobFunctionArn}",
"Payload": {
"taskToken": "{% $$.Task.Token %}",
"input": "{% $states.input %}"
}
},
"HeartbeatSeconds": 3600,
"Next": "ProcessResult"
}
```
For third-party APIs that don't emit events, pass a callback URL to the external service so it can POST back to your endpoint when done, which then calls `SendTaskSuccess`.
**Lambda durable functions alternative:** `context.wait_for_callback()` / `context.waitForCallback()` implements the same pattern without manual token management.
## Fan-Out / Fan-In
| Scale | Recommended approach |
| --------------------------------- | --------------------------------------------------------------------------- |
| Up to 40 items | Step Functions `Map` state (Inline mode) |
| Up to 10 million items | Step Functions `Map` state (Distributed mode, child Express workflows) |
| Millions of items, cost-sensitive | Custom: S3 → Lambda fan-out → SQS workers → DynamoDB tracking → aggregation |
For most teams, Step Functions Distributed Map is the right trade-off between cost and operational simplicity. A custom S3+SQS+DynamoDB solution is meaningfully cheaper at very high item counts but carries significant implementation overhead.
## Timeout Handling
Always set **both** `TimeoutSeconds` and `HeartbeatSeconds` on Task states. Without them, a hung downstream call can hold the execution open indefinitely:
```json
"CallExternalAPI": {
"Type": "Task",
"Resource": "${FunctionArn}",
"TimeoutSeconds": 300,
"HeartbeatSeconds": 60,
"Retry": [...]
}
```
- `TimeoutSeconds` — maximum total time for the state (including retries)
- `HeartbeatSeconds` — maximum time between heartbeat signals; fails faster when a worker disappears silently
**Handling Express workflow timeouts:** Express workflows do not publish `TIMED_OUT` events to EventBridge. Wrap Express workflows inside a parent Standard workflow — the Standard workflow can catch the timeout and trigger remediation.
## Best Practices
- **Always add `Retry` on Task states** — Lambda returns transient errors (`Lambda.ServiceException`, `Lambda.AWSLambdaException`, `Lambda.TooManyRequestsException`) under load; without retry, these fail the execution
- **Use `Catch` for error routing** — route failures to a dedicated error-handling state rather than letting the execution fail silently
- **Use `DefinitionSubstitutions`** — never hardcode ARNs or table names in `.asl.json` files
- **Use JSONata for new workflows** — it produces simpler, more readable definitions than JSONPath
- **Use SDK integrations directly** — call DynamoDB, SNS, SQS, EventBridge, etc. without a Lambda wrapper for simple operations
- **Enable X-Ray tracing** (`Tracing.Enabled: true`) for end-to-end visibility across Step Functions and Lambda spans
- **Set logging to `Level: ERROR` in production** and `Level: ALL` when debugging; `IncludeExecutionData: true` is required to see input/output in logs
- **Standard workflows**: prefer for non-idempotent operations — exactly-once semantics prevent accidental double-charges or duplicate records
- **Express workflows**: ensure downstream operations are idempotent — at-least-once delivery means tasks may run more than once
```
### references/step-functions-testing.md
```markdown
# AWS Step Functions Testing
## TestState API
TestState API enables unit and integration testing of Step Functions without deployment. Key capabilities:
- **Mock service integrations** — Test without invoking real services
- **Advanced states** — Map, Parallel, Activity, `.sync`, `.waitForTaskToken` (require mocks)
- **Control execution** — Simulate retries, Map iterations, error scenarios
- **Chain tests** — Use output→input to test execution paths
- **Optional IAM** — When mocking, `roleArn` optional
```bash
aws stepfunctions test-state \
--definition '{"Type":"Task","Resource":"arn:aws:states:::lambda:invoke","Arguments":{...},"End":true}' \
--input '{"data":"value"}' \
--mock '{"result":"{\"StatusCode\":200,\"Payload\":{\"body\":\"success\"}}"}' \
--inspection-level DEBUG
```
## Inspection Levels
| Level | Returns | Use Case |
| --------- | ------------------------------------------------------------------------------- | ------------------- |
| **INFO** | `output`, `status`, `nextState` | Quick validation |
| **DEBUG** | + `afterInputPath`, `afterParameters`, `afterResultSelector`, `afterResultPath` | Data flow debugging |
| **TRACE** | + HTTP `request`/`response` (use `--reveal-secrets` for auth) | HTTP Task debugging |
## Critical: Service-Specific Mock Structure
**⚠️ Mocks MUST match AWS service API response schema exactly** — field names (case-sensitive), types, required fields.
### Finding Mock Structure
1. Identify service from `Resource` ARN: `arn:aws:states:::lambda:invoke` → Lambda `Invoke` API
2. Consult AWS SDK docs for that API's Response Syntax
3. Structure mock to match
### Common Service Mocks
| Service | API | Mock Structure | Example |
| --------------- | ---------------- | --------------------------------------- | ----------------------------------------------------------------------------- |
| Lambda | `Invoke` | `{StatusCode, Payload, FunctionError?}` | `'{"result":"{\"StatusCode\":200,\"Payload\":{\"body\":\"ok\"}}\"}'` |
| DynamoDB | `PutItem` | `{Attributes?}` | `'{"result":"{\"Attributes\":{\"id\":{\"S\":\"123\"}}}"}'` |
| DynamoDB | `GetItem` | `{Item?}` | `'{"result":"{\"Item\":{\"id\":{\"S\":\"123\"}}}"}'` |
| SNS | `Publish` | `{MessageId}` | `'{"result":"{\"MessageId\":\"abc-123\"}"}'` |
| SQS | `SendMessage` | `{MessageId, MD5OfMessageBody}` | `'{"result":"{\"MessageId\":\"xyz\",\"MD5OfMessageBody\":\"...\"}"}'` |
| EventBridge | `PutEvents` | `{FailedEntryCount, Entries[]}` | `'{"result":"{\"FailedEntryCount\":0,\"Entries\":[{\"EventId\":\"123\"}]}"}'` |
| S3 | `PutObject` | `{ETag, VersionId?}` | `'{"result":"{\"ETag\":\"\\\"abc123\\\"\"}"}'` |
| Step Functions | `StartExecution` | `{ExecutionArn, StartDate}` | `'{"result":"{\"ExecutionArn\":\"arn:...\",\"StartDate\":\"...\"}"}'` |
| Secrets Manager | `GetSecretValue` | `{ARN, Name, SecretString?}` | `'{"result":"{\"Name\":\"MySecret\",\"SecretString\":\"...\"}"}'` |
**For `.sync` patterns:** Mock the **polling API** (e.g., `startExecution.sync:2` → mock `DescribeExecution`, NOT `StartExecution`)
### Mock Syntax
**Success:** `--mock '{"result":"<service API response JSON>"}'`\
**Error:** `--mock '{"errorOutput":{"error":"ErrorCode","cause":"description"}}'`\
**Validation:** `--mock '{"fieldValidationMode":"STRICT|PRESENT|NONE","result":"..."}'`
**Validation modes:**
- `STRICT` (default): All required fields, correct types — use in CI/CD
- `PRESENT`: Only validate fields present — flexible testing
- `NONE`: No validation — quick prototyping only
## Testing Map States
Tests Map's **input/output processing**, not iterations inside. Mock = entire Map output.
```bash
aws stepfunctions test-state \
--definition '{
"Type":"Map",
"ItemsPath":"$.items",
"ItemSelector":{"value.$":"$$.Map.Item.Value"},
"ItemProcessor":{"ProcessorConfig":{"Mode":"INLINE"},...},
"End":true
}' \
--input '{"items":[1,2,3]}' \
--mock '{"result":"[10,20,30]"}' \
--inspection-level DEBUG
```
**DEBUG returns:** `afterItemsPath`, `afterItemSelector`, `afterItemBatcher`, `toleratedFailureCount`, `maxConcurrency`
**Distributed Map:** Provide data in input (as if read from S3)\
**Failure threshold testing:** Use `--state-configuration '{"mapIterationFailureCount":N}'`\
**Testing state within Map:** `--state-name` auto-populates `$$.Map.Item.Index`, `$$.Map.Item.Value`
## Testing Parallel States
Mock = JSON array, one element per branch (in definition order):
```bash
--mock '{"result":"[{\"branch1\":\"result1\"},{\"branch2\":\"result2\"}]"}'
```
## Testing Error Handling
### Retry Logic
```bash
--state-configuration '{"retrierRetryCount":1}' \
--mock '{"errorOutput":{"error":"Lambda.ServiceException","cause":"..."}}' \
--inspection-level DEBUG
```
Response includes: `status:"RETRIABLE"`, `retryBackoffIntervalSeconds`, `retryIndex`
### Catch Handlers
```bash
--mock '{"errorOutput":{"error":"Lambda.TooManyRequestsException","cause":"..."}}' \
--inspection-level DEBUG
```
Response includes: `status:"CAUGHT_ERROR"`, `nextState`, `catchIndex`, error in `output` via `ResultPath`
### Error Propagation in Map/Parallel
```bash
--state-name "ChildState" \
--state-configuration '{"errorCausedByState":"ChildState"}' \
--mock '{"errorOutput":{"error":"States.TaskFailed","cause":"..."}}'
```
## Testing .sync and .waitForTaskToken
**Required:** Must provide mock (validation exception otherwise)
### .sync Patterns
Mock the **polling API**, not initial call:
```bash
# startExecution.sync:2 → mock DescribeExecution
--mock '{"result":"{\"Status\":\"SUCCEEDED\",\"Output\":\"{...}\"}"}'
```
Common patterns: `startExecution.sync:2`→`DescribeExecution`, `batch:submitJob.sync`→`DescribeJobs`, `glue:startJobRun.sync`→`GetJobRun`
### .waitForTaskToken
```bash
--context '{"Task":{"Token":"test-token-123"}}' \
--mock '{"result":"{\"StatusCode\":200,\"Payload\":{\"status\":\"approved\"}}"}'
```
## Activity States
Require mock:
```bash
--definition '{"Type":"Task","Resource":"arn:aws:states:...:activity:MyActivity",...}' \
--mock '{"result":"{\"result\":\"completed\"}"}'
```
## Chaining Tests (Integration Testing)
```bash
RESULT_1=$(aws stepfunctions test-state --state-name "State1" ... | jq -r '.output')
NEXT_1=$(... | jq -r '.nextState')
RESULT_2=$(aws stepfunctions test-state --state-name "$NEXT_1" --input "$RESULT_1" ...)
```
Validates: data transformations, state transitions, end-to-end paths
## Context Fields
Test states referencing execution context:
```bash
--context '{
"Execution":{"Id":"arn:...","Name":"test-123","StartTime":"2024-01-01T10:00:00.000Z"},
"State":{"Name":"ProcessData","EnteredTime":"2024-01-01T10:00:05.000Z"},
"Task":{"Token":"test-token-abc123"}
}'
```
## HTTP Tasks (TRACE)
```bash
--resource "arn:aws:states:::http:invoke" \
--inspection-level TRACE \
--reveal-secrets # Requires states:RevealSecrets permission
```
Returns: `inspectionData.request` (method, URL, headers, body), `inspectionData.response` (status, headers, body)
## Troubleshooting
| Error | Fix |
| ----------------------- | ---------------------------------------------- |
| Invalid field type | Check AWS SDK docs for correct types |
| Required field missing | Add field OR use `fieldValidationMode:PRESENT` |
| .sync validation failed | Mock polling API, not initial call |
**Debug workflow:**
1. Start `fieldValidationMode:NONE` for logic testing
2. Switch to `PRESENT` for partial validation
3. Use `STRICT` in CI/CD
## Test Automation Pattern
```bash
#!/bin/bash
test_state() {
local state_name=$1
local input=$2
local mock=$3
aws stepfunctions test-state \
--definition "$(cat statemachine.asl.json)" \
--state-name "$state_name" \
--input "$input" \
--mock "$mock" \
--inspection-level DEBUG
}
# Test chain
RESULT=$(test_state "State1" '{"id":"123"}' '{"result":"..."}' | jq -r '.output')
test_state "State2" "$RESULT" '{"result":"..."}'
```
## Best Practices
1. **Always verify mock structure** against AWS SDK docs for the specific service
2. **For .sync, mock polling API** (DescribeX/GetX), not initial call
3. **Use STRICT validation in CI/CD** to catch mismatches early
4. **Test all error paths** with appropriate error codes
5. **Chain tests** to validate multi-state execution paths
6. **Start with NONE→PRESENT→STRICT** when developing mocks
7. **Use DEBUG for data flow**, TRACE for HTTP debugging
8. **Mock external dependencies** to isolate state machine logic
9. **Test Map failure thresholds** with `mapIterationFailureCount`
10. **Never commit `--reveal-secrets` output** to version control
## Quick Reference
```bash
# Basic test
aws stepfunctions test-state --definition '{...}' --input '{...}' --mock '{...}'
# Test specific state in state machine
aws stepfunctions test-state --definition "$(cat sm.json)" --state-name "MyState" --input '{...}' --mock '{...}'
# Test retry (2nd attempt)
--state-configuration '{"retrierRetryCount":1}' --mock '{"errorOutput":{...}}'
# Test Map failure threshold
--state-configuration '{"mapIterationFailureCount":5}' --mock '{"errorOutput":{...}}'
# Test with context
--context '{"Execution":{"Id":"..."}, "Task":{"Token":"..."}}'
# HTTP Task with secrets
--inspection-level TRACE --reveal-secrets
# Mock validation modes
--mock '{"fieldValidationMode":"STRICT|PRESENT|NONE","result":"..."}'
```
## Step Functions Local (Docker)
Run a local emulator for integration testing. Note it is unsupported and does not have full feature parity with the cloud service:
```bash
docker run -p 8083:8083 amazon/aws-stepfunctions-local
# Run alongside sam local start-lambda for Lambda-integrated tests
sam local start-lambda &
docker run -p 8083:8083 \
-e LAMBDA_ENDPOINT=http://host.docker.internal:3001 \
amazon/aws-stepfunctions-local
```
Then use the AWS CLI with `--endpoint-url http://localhost:8083` to create and execute state machines locally.
For most use cases, the TestState API (above) is preferred — it tests against real AWS service behavior without requiring Docker or a local emulator.
```
### references/observability.md
```markdown
# Observability Guide
## Strategy
Serverless observability relies on three pillars — logs, metrics, and traces — but the approach differs from traditional infrastructure. There are no servers to SSH into, functions are distributed by nature, and cold starts create visibility gaps. Every function should emit structured logs, publish custom metrics, and participate in distributed traces from day one.
Use AWS Lambda Powertools as the consistent instrumentation layer across all functions. It handles Logger, Tracer, and Metrics with minimal boilerplate. For installation and the full utilities reference, see [powertools.md](powertools.md).
### Decorator Stacking Order
When combining multiple Powertools decorators on a single handler, order matters. Decorators execute outer-to-inner on invocation, inner-to-outer on return. Stack them in this order so that logging context is available first, metrics flush after business logic, and tracing wraps the entire execution:
**Python:**
```python
from aws_lambda_powertools import Logger, Metrics, Tracer
logger = Logger()
metrics = Metrics()
tracer = Tracer()
@logger.inject_lambda_context(correlation_id_path="requestContext.requestId")
@metrics.log_metrics(capture_cold_start_metric=True)
@tracer.capture_lambda_handler
def handler(event, context):
# Logger context available first (outermost)
# Tracer captures the full handler execution (innermost)
# Metrics flush after handler returns (middle, on exit)
...
```
**TypeScript (middy middleware):**
```typescript
import { Logger, injectLambdaContext } from '@aws-lambda-powertools/logger';
import { Metrics, logMetrics } from '@aws-lambda-powertools/metrics';
import { Tracer, captureLambdaHandler } from '@aws-lambda-powertools/tracer';
import middy from '@middy/core';
const logger = new Logger();
const metrics = new Metrics();
const tracer = new Tracer();
const lambdaHandler = async (event: any, context: any) => {
// Business logic
};
export const handler = middy(lambdaHandler)
.use(injectLambdaContext(logger))
.use(logMetrics(metrics))
.use(captureLambdaHandler(tracer));
```
Middy middleware executes in registration order (top-to-bottom on request, bottom-to-top on response), achieving the same effect as the Python decorator stack.
## Structured Logging
Use structured JSON logging so CloudWatch Logs Insights can query across fields rather than parsing free-text messages.
**Required fields in every log entry:**
- `request_id` — Lambda request ID for correlating logs to a single invocation
- `function_name` — identifies which function emitted the log
- `level` — DEBUG, INFO, WARN, ERROR
- Business context (user ID, order ID, operation type) as fields, not embedded in message strings
**Python:**
```python
from aws_lambda_powertools import Logger
logger = Logger() # Reads POWERTOOLS_SERVICE_NAME from env
@logger.inject_lambda_context(correlation_id_path="requestContext.requestId")
def handler(event, context):
logger.info("Processing order", order_id=event["orderId"], amount=event["total"])
# Output: {"level":"INFO","message":"Processing order","order_id":"ord-123",
# "amount":49.99,"request_id":"...","function_name":"...","correlation_id":"..."}
```
**TypeScript:**
```typescript
import { Logger } from '@aws-lambda-powertools/logger';
import type { Context } from 'aws-lambda';
const logger = new Logger();
export const handler = async (event: any, context: Context) => {
logger.addContext(context);
logger.info('Processing order', { orderId: event.orderId, amount: event.total });
};
```
**Log level strategy:**
| Environment | Level | Rationale |
| ----------- | ----- | --------------------------------------------------------- |
| Development | DEBUG | Full visibility during development |
| Staging | INFO | Verify behavior without noise |
| Production | WARN | Minimize log volume and cost; drop to INFO when debugging |
Set `LOG_LEVEL` or `POWERTOOLS_LOG_LEVEL` via environment variable — no code changes needed to adjust per environment.
Enable `POWERTOOLS_LOG_DEDUPLICATION_DISABLED=true` in test environments to prevent log deduplication issues with test frameworks.
## Distributed Tracing
X-Ray traces the execution path through your Lambda function and all downstream AWS SDK calls. Enable it globally in your SAM template:
```yaml
Globals:
Function:
Tracing: Active
```
**Python:**
```python
from aws_lambda_powertools import Tracer
tracer = Tracer()
@tracer.capture_lambda_handler
def handler(event, context):
order = get_order(event["orderId"])
return order
@tracer.capture_method
def get_order(order_id: str):
tracer.put_annotation(key="orderId", value=order_id)
tracer.put_metadata(key="orderSource", value="dynamodb")
# Annotations are indexed and searchable in X-Ray console
# Metadata is attached but not indexed
return table.get_item(Key={"orderId": order_id})["Item"]
```
**TypeScript:**
```typescript
import { Tracer } from '@aws-lambda-powertools/tracer';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient } from '@aws-sdk/lib-dynamodb';
const tracer = new Tracer();
const client = tracer.captureAWSv3Client(
DynamoDBDocumentClient.from(new DynamoDBClient({}))
);
export const handler = async (event: any, context: any) => {
tracer.putAnnotation('orderId', event.orderId);
const subsegment = tracer.getSegment()!.addNewSubsegment('processOrder');
try {
// Custom logic traced as a subsegment
} finally {
subsegment.close();
}
};
```
**Key concepts:**
- **Annotations** — indexed key-value pairs you can filter on in the X-Ray console (e.g., find all traces for `orderId=ord-123`)
- **Metadata** — non-indexed data attached to a segment for debugging context
- **`captureAWSv3Client`** — automatically traces all AWS SDK calls (DynamoDB, S3, SQS, etc.) without manual subsegments
- **Subsegments** — wrap custom logic blocks to measure their duration separately
- **Cold start annotation** — add `ColdStart: true/false` as an annotation so you can filter X-Ray traces by cold start status and measure cold start impact on latency separately from warm invocations. Use the `capture_cold_start_metric=True` option on `@metrics.log_metrics` to track cold starts automatically via EMF metrics.
**Sampling rules:** X-Ray defaults to 1 request/second reservoir + 5% of additional requests. For high-throughput functions, this is usually sufficient. Lower the percentage if tracing costs are a concern. Configure custom rules via the X-Ray console or API.
**Limitation:** X-Ray trace context does **not** propagate across EventBridge, SQS, or SNS. The publisher and consumer appear as separate traces. Use correlation IDs (see below) to reconstruct cross-service request chains in logs.
## CloudWatch Application Signals
Application Signals provides APM-style capabilities on top of X-Ray, giving you service-level visibility without building custom dashboards or alarms from scratch. It uses the AWS Distro for OpenTelemetry (ADOT) Lambda layer to auto-instrument your functions.
**What it provides:**
- **Service dependency map** — visual topology showing how your Lambda functions connect to downstream services (DynamoDB, S3, SQS, other Lambda functions, external APIs)
- **Pre-built service dashboards** — per-service latency (p50, p90, p99), error rate, throughput, and fault breakdown without manual widget configuration
- **SLO tracking** — define Service Level Objectives (latency p99 < 500ms, availability > 99.9%) and monitor compliance over rolling windows
- **Anomaly detection** — automatic alerting when a service deviates from its learned baseline
**Enable via SAM template:**
```yaml
Globals:
Function:
Tracing: Active
Layers:
- !Sub arn:aws:lambda:${AWS::Region}:901920570463:layer:aws-otel-python-amd64-ver-1-25-0:1
Environment:
Variables:
AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument
OTEL_AWS_APPLICATION_SIGNALS_ENABLED: "true"
OTEL_METRICS_EXPORTER: none
OTEL_TRACES_SAMPLER: xray
```
Layer ARNs vary by runtime and architecture. Check the [ADOT Lambda layer documentation](https://aws-otel.github.io/docs/getting-started/lambda/) for the correct ARN for your runtime (Python, Node.js, Java, .NET) and architecture (amd64, arm64).
**SLO configuration** happens in the CloudWatch console or via API after deployment — define SLIs (latency percentile, error rate, availability) and set objectives with burn rate alerting.
**When to use Application Signals vs plain X-Ray:**
| Scenario | Recommendation |
| -------------------------------------- | -------------------------------------------------------------- |
| Single function, basic tracing | X-Ray with Powertools Tracer |
| Multi-service system, need service map | Application Signals |
| SLO tracking and compliance reporting | Application Signals |
| Custom trace annotations and filtering | X-Ray with Powertools Tracer (complements Application Signals) |
Application Signals and Powertools Tracer are complementary — ADOT handles auto-instrumentation and service-level metrics, while Powertools adds custom annotations, metadata, and fine-grained subsegments. Use both together for full coverage.
**Pricing:** Application Signals charges per service signal ingested. For low-traffic services the cost is negligible; for high-throughput services, review the [Application Signals pricing page](https://aws.amazon.com/cloudwatch/pricing/) and use X-Ray sampling rules to control trace volume.
## Custom Metrics
**Use Embedded Metric Format (EMF)** instead of calling `cloudwatch:PutMetricData`. EMF writes metrics as structured log entries that CloudWatch parses asynchronously — zero latency overhead and no extra API cost.
**Python:**
```python
from aws_lambda_powertools import Metrics
from aws_lambda_powertools.metrics import MetricUnit
metrics = Metrics() # Reads POWERTOOLS_METRICS_NAMESPACE from env
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event, context):
metrics.add_dimension(name="environment", value="prod")
metrics.add_metric(name="OrdersProcessed", unit=MetricUnit.Count, value=1)
metrics.add_metric(name="OrderTotal", unit=MetricUnit.Count, value=event["total"])
```
**TypeScript:**
```typescript
import { Metrics, MetricUnit } from '@aws-lambda-powertools/metrics';
const metrics = new Metrics();
export const handler = async (event: any) => {
metrics.addDimension('environment', 'prod');
metrics.addMetric('OrdersProcessed', MetricUnit.Count, 1);
metrics.addMetric('OrderTotal', MetricUnit.Count, event.total);
metrics.publishStoredMetrics();
};
```
**Dimensions:** Standard dimensions should include service name and environment. Avoid high-cardinality dimensions (user IDs, request IDs) — each unique combination creates a separate CloudWatch metric and incurs cost.
**Resolution:** Standard resolution (60-second aggregation) is appropriate for most use cases. Use high resolution (1-second) only for latency-sensitive SLAs where you need sub-minute granularity.
`capture_cold_start_metric=True` (Python) automatically publishes a `ColdStart` metric so you can track cold start frequency without manual instrumentation.
### Technical Metrics vs Business KPIs
Distinguish between **technical metrics** (errors, duration, throttles) and **business KPI metrics** (orders processed, revenue, user signups). Both are emitted via EMF, but they serve different audiences and alarm strategies.
**Technical metrics** are infrastructure-facing — they tell you whether the system is healthy. AWS provides most of these automatically; you supplement with custom metrics for gaps (e.g., cold start count).
**Business KPI metrics** are product-facing — they tell you whether the system is doing its job. These must be explicitly instrumented:
```python
# Technical metric — system health
metrics.add_metric(name="OrderProcessingErrors", unit=MetricUnit.Count, value=1)
# Business KPI metric — product health
metrics.add_metric(name="OrdersPlaced", unit=MetricUnit.Count, value=1)
metrics.add_metric(name="OrderRevenue", unit=MetricUnit.Count, value=order["total"])
```
**Alarm differently:** Technical metric alarms page the on-call engineer. Business KPI alarms (e.g., orders drop to zero) should notify the product team and may indicate a business issue rather than an infrastructure failure.
**Metric aggregation limit:** CloudWatch EMF supports up to 100 metrics per log entry. If a single invocation emits more than 100 metrics, split them across multiple EMF blobs by calling `metrics.publish_stored_metrics()` (Python) or `metrics.publishStoredMetrics()` (TypeScript) mid-handler, then continuing to add metrics.
## CloudWatch Alarms
### Lambda Metrics
| Metric | What it means | Recommended alarm |
| ---------------------- | ---------------------------------- | -------------------------- |
| `Errors` | Invocations that returned an error | Error rate > 1% over 5 min |
| `Duration` (p90) | Latency affecting most users | p90 > 1.5x your baseline |
| `Duration` (p99) | Latency outliers | p99 > 3x your baseline |
| `Throttles` | Rejected due to concurrency limits | > 0 |
| `ConcurrentExecutions` | Current concurrent invocations | > 80% of account limit |
Use **p90 for early warning** (catches widespread degradation) and **p99 for tail latency** (catches outlier slowness). Alert on p90 first — if p90 is breaching, most users are affected.
### Event Source Metrics
| Metric | What it means | Alarm |
| ---------------------------------------- | -------------------------------------------- | ----- |
| `IteratorAge` (streams) | Lag between record production and processing | > 60s |
| DLQ `ApproximateNumberOfMessagesVisible` | Messages that exhausted retries | > 0 |
### EventBridge Metrics
| Metric | What it means | Alarm |
| ----------------------- | -------------------------------------------- | ----------------- |
| `FailedInvocations` | Target invocations that failed after retries | > 0 |
| `ThrottledRules` | Rules throttled due to target limits | > 0 |
| `DeadLetterInvocations` | Events sent to DLQ | > 0 |
| `MatchedEvents` | Events that matched a rule | Anomaly detection |
### Alarm Best Practices
- Use **anomaly detection** for functions with variable traffic instead of static thresholds — CloudWatch learns the expected pattern and alerts on deviations
- Use **composite alarms** to reduce alert fatigue — combine multiple signals (e.g., error rate AND duration spike) before paging
- Set alarm actions to SNS for notifications; chain SNS → Lambda for auto-remediation (e.g., increase reserved concurrency on throttle alarm)
- Use `get_metrics` to retrieve current values before setting thresholds — base alarms on observed behavior, not guesses
## CloudWatch Logs Insights
Useful queries for serverless debugging and analysis.
**Trace a correlation ID across services:**
```text
fields @timestamp, @message
| filter @message like /corr-id-value/
| sort @timestamp asc
```
Run this query across multiple log groups simultaneously to follow a request through the entire service chain.
**Cold start frequency:**
```text
filter @message like /REPORT/ and @message like /Init Duration/
| stats count() as coldStarts, avg(@initDuration) as avgInitMs by bin(1h)
```
**Error patterns:**
```text
filter @message like /ERROR/
| stats count() as errors by @message
| sort errors desc
| limit 20
```
**Duration percentiles over time:**
```text
filter @type = "REPORT"
| stats avg(@duration) as avgMs, pct(@duration, 99) as p99Ms by bin(5m)
```
**Slowest invocations:**
```text
filter @type = "REPORT"
| sort @duration desc
| limit 10
```
## Lambda Insights
Lambda Insights provides enhanced monitoring with per-function CPU utilization, memory usage, network throughput, and disk I/O — metrics that standard Lambda monitoring does not expose.
**Enable via SAM template:**
```yaml
Globals:
Function:
Layers:
- !Sub arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:53
Policies:
- CloudWatchLambdaInsightsExecutionRolePolicy
```
**When to use:**
- Diagnosing memory leaks (memory usage grows across invocations)
- Identifying CPU-bound functions that need more memory (and thus more CPU)
- Network bottlenecks from slow downstream calls
- Understanding disk I/O patterns for `/tmp`-heavy workloads
**Pricing:** Lambda Insights writes performance log events to CloudWatch Logs and publishes enhanced metrics — you pay standard CloudWatch Logs ingestion and metrics pricing based on invocation volume. Enable selectively for functions you are actively troubleshooting, not across all functions by default.
## Dashboards
### Two-Tier Dashboard Strategy
Build two dashboards per service, each serving a different audience:
**High-level dashboard** (for SREs, managers, stakeholders):
- Overall error rate and availability (SLO compliance)
- Business KPI trends (orders placed, revenue, active users)
- Cross-service health summary (one row per service: green/yellow/red)
- Time range: 24 hours or 7 days
**Low-level dashboard** (for developers debugging issues):
- Per-function: Errors, Duration p90/p99, ConcurrentExecutions, Throttles, ColdStarts
- Per-API: 4xx rate, 5xx rate, Latency p99, Request count
- Per-table: ThrottledReadRequests, ThrottledWriteRequests, ConsumedReadCapacityUnits
- Per-stream: IteratorAge, GetRecords.IteratorAgeMilliseconds
- Time range: 1 hour or 3 hours
The high-level dashboard answers "is the system healthy?" The low-level dashboard answers "why is the system unhealthy?"
### Dashboard as Code
**SAM template:**
```yaml
ServerlessDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${AWS::StackName}-health"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Errors", "FunctionName", "${MyFunction}", {"stat": "Sum"}],
["AWS/Lambda", "Duration", "FunctionName", "${MyFunction}", {"stat": "p99"}],
["AWS/Lambda", "Throttles", "FunctionName", "${MyFunction}", {"stat": "Sum"}]
],
"period": 300,
"region": "${AWS::Region}",
"title": "Lambda Health"
}
}
]
}
```
## Correlation ID Propagation
X-Ray traces break at async boundaries (EventBridge, SQS, SNS). Propagate a `correlationId` through event metadata to reconstruct the full request chain in logs.
**Pattern:** The producing function generates or forwards a correlation ID. The consuming function extracts it and injects it into Logger. All log entries from both functions share the same correlation ID, queryable via Logs Insights.
**Producer (Python):**
```python
import json
import boto3
from aws_lambda_powertools import Logger
logger = Logger()
events_client = boto3.client("events")
@logger.inject_lambda_context
def handler(event, context):
events_client.put_events(Entries=[{
"Source": "orders.service",
"DetailType": "OrderPlaced",
"EventBusName": "my-app-bus",
"Detail": json.dumps({
"metadata": {
"correlationId": logger.get_correlation_id() or context.aws_request_id,
},
"data": {"orderId": "ord-123", "total": 49.99},
}),
}])
```
**Consumer (Python):**
```python
from aws_lambda_powertools import Logger
logger = Logger()
@logger.inject_lambda_context(correlation_id_path="detail.metadata.correlationId")
def handler(event, context):
logger.info("Processing event")
# Every log entry now includes correlation_id from the producer
```
This pattern works identically for SQS (inject in message attributes) and SNS (inject in message attributes). The key is consistency: always inject `correlationId` in the same location so consumers can extract it with a predictable path.
For the event envelope structure (including `correlationId`, `domain`, `service` fields), see the Event Envelopes section in [event-driven-architecture.md](event-driven-architecture.md).
## Environment Variables
| Variable | Purpose |
| ------------------------------ | ------------------------------------------- |
| `POWERTOOLS_SERVICE_NAME` | Service name for Logger, Tracer, Metrics |
| `POWERTOOLS_LOG_LEVEL` | Log level (DEBUG, INFO, WARN, ERROR) |
| `POWERTOOLS_METRICS_NAMESPACE` | CloudWatch Metrics namespace |
| `POWERTOOLS_DEV` | Enable verbose output for local development |
Set these in the `Globals.Function.Environment.Variables` section of your SAM template. `POWERTOOLS_SERVICE_NAME` and `POWERTOOLS_METRICS_NAMESPACE` are required for Metrics; Logger and Tracer will use `POWERTOOLS_SERVICE_NAME` but fall back to the function name if unset.
## Cost Management
Observability has a cost. Manage it deliberately.
**Log retention:** Set CloudWatch log retention per environment — don't pay to store debug logs forever.
| Environment | Retention | Rationale |
| ----------- | --------- | ----------------------------- |
| Development | 7 days | Short-lived debugging |
| Staging | 30 days | Enough for release validation |
| Production | 90 days | Incident investigation window |
| Compliance | 365+ days | Regulatory requirements |
```yaml
MyFunctionLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/lambda/${MyFunction}"
RetentionInDays: 90
```
**X-Ray sampling:** The default (1 req/sec + 5%) is fine for most workloads. For functions processing thousands of requests per second, the 5% creates significant trace volume. Create a custom sampling rule with a lower fixed rate.
**Metrics cardinality:** Every unique combination of dimensions creates a separate CloudWatch metric. A dimension with 10,000 unique values (like `userId`) creates 10,000 metrics. Use annotations in X-Ray traces for high-cardinality identifiers, not metric dimensions.
**Lambda Insights:** Adds CloudWatch Logs ingestion and metrics costs per enabled function. The cost scales with invocation volume, so high-throughput functions cost more. Enable for functions you're actively investigating, disable when done.
```
### references/optimization.md
```markdown
# Optimization Guide
## Memory and CPU Right-Sizing
Lambda allocates CPU proportionally to memory. The goal is to find the configuration where cost-per-invocation is minimized while meeting latency requirements.
**Strategy:**
1. Use `get_metrics` to measure current duration, memory utilization, and invocation count
2. Test with different memory settings using AWS Lambda Power Tuning
3. Choose the memory level where cost (duration x memory price) is lowest
**General guidelines:**
- 128 MB: Lightweight tasks (routing, simple transformations)
- 512 MB: Standard API handlers, moderate data processing
- 1024 MB: Compute-intensive tasks, image processing
- 3008+ MB: ML inference, large data processing
**CPU scaling:** Lambda allocates CPU proportionally to memory. At **1,769 MB** a function has the equivalent of one full vCPU. Functions below this threshold share a single vCPU. If your function is CPU-bound, the optimal memory setting is often 1,769 MB or higher.
**arm64 (Graviton) savings:** arm64 is approximately **20% cheaper per GB-second** than x86_64 and typically provides equal or faster performance. This is the single easiest cost optimization for most functions. Use x86_64 only when you depend on x86-only native binaries — older Python C extension wheels (NumPy, Pandas pre-built for x86), Node.js native addons compiled for x86, or vendor-provided Lambda layers that ship only x86 binaries. Most pure-Python, pure-Node.js, and Java/Go/.NET workloads run on arm64 without changes.
### AWS Lambda Power Tuning
[Lambda Power Tuning](https://github.com/alexcasalboni/aws-lambda-power-tuning) is an open-source Step Functions state machine that automates memory/cost optimization. It invokes your function at multiple memory settings, measures duration and cost, and recommends the optimal configuration.
**How it works:**
1. Deploy the state machine into your AWS account (one-time setup)
2. Execute it with your function ARN and the memory values to test
3. It invokes your function N times at each memory setting, collects metrics
4. Returns the optimal memory size plus a visualization URL showing the cost/performance curve
**Deploy via SAR (simplest):**
```bash
aws serverlessrepo create-cloud-formation-change-set \
--application-id arn:aws:serverlessrepo:us-east-1:451282441545:applications/aws-lambda-power-tuning \
--stack-name lambda-power-tuning \
--capabilities CAPABILITY_IAM
```
Or deploy via SAM CLI:
```bash
sam init --location https://github.com/alexcasalboni/aws-lambda-power-tuning
sam deploy --guided
```
**Run the state machine:**
```json
{
"lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:my-function",
"powerValues": [128, 256, 512, 1024, 1769, 3008],
"num": 50,
"payload": {}
}
```
| Parameter | Description |
| ------------- | ---------------------------------------------------------- |
| `lambdaARN` | ARN of the function to tune |
| `powerValues` | Memory sizes (MB) to test — include 1769 (1 vCPU boundary) |
| `num` | Invocations per memory setting (50-100 for stable results) |
| `payload` | Event payload to pass to each invocation |
**Output:** The state machine returns the cheapest and fastest configurations with exact cost-per-invocation at each memory level. It also generates a visualization URL (data encoded client-side, nothing sent to external servers) showing the cost/performance tradeoff curve.
**When to use:**
- Before launching a new function to production — right-size from the start
- When `get_metrics` shows memory utilization is consistently very low or very high
- After significant code changes that affect compute profile
- Periodically (quarterly) for long-running production functions
## Cold Start Optimization
Cold starts affect latency on the first invocation after idle time or scaling events.
**Checklist:**
- [ ] Initialize SDK clients and database connections outside the handler function
- [ ] Use `lru_cache` or module-level variables for configuration that doesn't change
- [ ] Minimize deployment package size (exclude dev dependencies, use layers for shared code)
- [ ] Choose a fast-starting runtime (Python, Node.js) for latency-sensitive paths
- [ ] Consider `arm64` architecture for faster cold starts
- [ ] Use provisioned concurrency only for consistently latency-sensitive endpoints
**When to use provisioned concurrency:**
- API endpoints with strict latency SLAs
- Functions called synchronously where cold starts are user-visible
- Not recommended for asynchronous or batch processing workloads
## Lambda SnapStart
SnapStart reduces cold start latency by taking a snapshot of the initialized execution environment, then resuming from that snapshot on subsequent invocations. This provides sub-second startup performance with minimal code changes.
**Supported runtimes:** Java 11+, Python 3.12+, .NET 8+
**Not supported with:** provisioned concurrency, EFS, ephemeral storage > 512 MB, container images
**Enable in SAM template:**
```yaml
MyFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.12
SnapStart:
ApplyOn: PublishedVersions
```
SnapStart only works on **published versions**, not `$LATEST`. Always use an alias pointing to a published version.
**Critical — handle uniqueness correctly:**
```python
# WRONG: unique value captured in snapshot, reused across invocations
import uuid
CORRELATION_ID = str(uuid.uuid4())
# CORRECT: generate unique values inside the handler
def handler(event, context):
correlation_id = str(uuid.uuid4())
```
**Re-establish connections on restore:** Use `lambda_runtime_api_prepare_to_invoke.py` (Python) runtime hooks to reconnect databases or refresh credentials after snapshot restoration — connection state is not guaranteed.
**When to use SnapStart vs provisioned concurrency:**
| Scenario | Recommendation |
| ------------------------------------------ | ----------------------- |
| Tolerate ~100–200 ms restore time | SnapStart |
| Require < 10 ms latency | Provisioned concurrency |
| Java/Python/.NET with heavy initialization | SnapStart |
| Infrequently invoked functions | Neither |
**Pricing:** Free for Java. Python and .NET incur a caching charge (minimum 3 hours) plus a restoration charge.
## Lambda Managed Instances
Lambda Managed Instances run your function on dedicated EC2 instances from your own account, while AWS still manages OS patching, load balancing, and auto-scaling. Unlike regular Lambda, each instance handles **multiple concurrent requests**, so your code must be thread-safe.
**When to use:**
- Consistent high-throughput workloads (hundreds to thousands of requests per second)
- Workloads that benefit from warm connection pools shared across concurrent requests
- Scenarios where cold starts must be eliminated at lower cost than provisioned concurrency
**When NOT to use:**
- Bursty or unpredictable traffic — instances take tens of seconds to launch (vs. seconds for regular Lambda)
- Low-volume applications
- Any code that is not thread-safe (module-level mutable state, non-reentrant libraries)
**Cost model:** EC2 instance cost + 15% premium + $0.20 per million requests. Compatible with existing EC2 Savings Plans. GPU instances are not supported.
**Configure via AWS CLI** (SAM template support may vary — check latest CloudFormation docs):
```bash
aws lambda create-function \
--function-name my-function \
--runtime python3.12 \
--handler app.handler \
--role arn:aws:iam::123456789012:role/my-role \
--code S3Bucket=my-bucket,S3Key=my-code.zip \
--compute-config '{"Mode": "ManagedInstances"}'
```
**Thread safety requirement:** Because multiple requests execute concurrently in the same environment, any module-level state must be read-only after initialization. Use connection pools designed for concurrent access (e.g., psycopg3 AsyncConnectionPool, SQLAlchemy async pools).
## Cost Optimization
### Decision Framework
| Scenario | Recommendation |
| ------------------------ | ---------------------------------------------------------- |
| Unpredictable traffic | On-demand billing, no provisioned concurrency |
| Steady baseline + spikes | Provisioned concurrency for baseline, on-demand for spikes |
| Batch processing | Maximize batch size, optimize memory for cost |
| Infrequently called | Minimize memory, accept cold starts |
### Key Cost Levers
- **Memory**: Lower memory is cheaper per-ms, but if it makes duration longer, net cost may increase
- **Timeout**: Set to actual max expected duration + buffer, not the maximum 900s
- **Reserved concurrency**: Caps maximum concurrent executions to prevent runaway costs
- **Storage**: Use S3 lifecycle policies to transition objects to cheaper tiers
- **Logs**: Set CloudWatch log retention to the minimum needed (7-30 days for dev, longer for prod/compliance)
## API Gateway Optimization
- Enable caching for read-heavy GET endpoints (0.5 GB cache is the minimum size)
- Use request validation at the gateway level to reject bad requests before invoking Lambda
- Use HTTP APIs (v2) instead of REST APIs when you don't need REST API-specific features (cheaper, lower latency)
## Response Streaming
Response streaming sends data to the client incrementally rather than buffering the complete response. This dramatically reduces time-to-first-byte (TTFB) for workloads where output is generated progressively.
### When to Use Streaming
| Scenario | Benefit |
| -------------------------------------- | ------------------------------------------------------------- |
| LLM/Bedrock responses | Users see tokens appear in real time instead of waiting |
| Responses > 6 MB | Streaming bypasses the 6 MB sync payload limit (up to 200 MB) |
| Long-running operations (up to 15 min) | Keeps the HTTP connection alive and sends progress |
| Large file/dataset delivery | Stream directly without S3 pre-signed URL workarounds |
### Lambda Side — `streamifyResponse`
Wrap your handler with `awslambda.streamifyResponse()` (Node.js runtime). Use `HttpResponseStream.from()` to attach HTTP metadata before writing to the stream.
```typescript
const streamingHandler = async (event: any, responseStream: NodeJS.WritableStream) => {
const httpStream = awslambda.HttpResponseStream.from(responseStream, {
statusCode: 200,
headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' },
});
// Write tokens in Server-Sent Events format
for await (const chunk of someStream) {
httpStream.write(`data: ${JSON.stringify({ token: chunk })}\n\n`);
}
httpStream.write('data: [DONE]\n\n');
httpStream.end();
};
export const handler = awslambda.streamifyResponse(streamingHandler);
```
**Runtime support:** Node.js only for `streamifyResponse`. Python and other runtimes can stream via Lambda Function URLs using a different mechanism.
### API Gateway REST API — Enable Streaming
API Gateway REST API response streaming requires two changes in your OpenAPI definition:
1. Use the `/response-streaming-invocations` Lambda ARN path (not `/invocations`)
2. Set `responseTransferMode: STREAM` on the integration
```yaml
# openapi.yaml
x-amazon-apigateway-integration:
type: AWS_PROXY
httpMethod: POST
uri:
Fn::Sub: "arn:aws:apigateway:${AWS::Region}:lambda:path/2021-11-15/functions/${MyFunction.Arn}/response-streaming-invocations"
responseTransferMode: STREAM
passthroughBehavior: when_no_match
```
Compare to the standard (non-streaming) path:
```yaml
# Standard (buffered) integration
uri:
Fn::Sub: "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${MyFunction.Arn}/invocations"
```
This feature is available on all endpoint types — regional, private, and edge-optimized.
### Lambda Function URLs — Enable Streaming
Function URLs also support streaming without API Gateway:
```yaml
MyFunction:
Type: AWS::Serverless::Function
Properties:
FunctionUrlConfig:
AuthType: AWS_IAM
InvokeMode: RESPONSE_STREAM # default is BUFFERED
```
### Server-Sent Events (SSE) Pattern
For LLM and AI streaming, use SSE format — it's natively supported by browsers and easy to consume:
```text
data: {"token":"Hello"}\n\n
data: {"token":" world"}\n\n
data: [DONE]\n\n
```
Client-side consumption (JavaScript):
```javascript
const es = new EventSource('/streaming');
es.onmessage = (e) => {
if (e.data === '[DONE]') { es.close(); return; }
appendToken(JSON.parse(e.data).token);
};
```
### Key Limits
| Resource | Limit |
| -------------------------------------- | ---------- |
| Max streamed response size | 200 MB |
| Standard (buffered) sync response | 6 MB |
| Max integration timeout with streaming | 15 minutes |
## DynamoDB Optimization
- Use single-table design with composite keys (PK/SK) for efficient access patterns
- Use `Query` instead of `Scan` wherever possible
- Project only needed attributes to reduce read capacity usage
- Use ON_DEMAND billing for unpredictable workloads, PROVISIONED with auto-scaling for steady workloads
- Use GSIs with KEYS_ONLY projection when you only need to look up primary keys
## Event Source Mapping Tuning
Use `esm_optimize` to get source-specific recommendations. General guidelines:
| Source | Key Tuning Parameters |
| ---------------- | -------------------------------------------------------------------------------- |
| DynamoDB Streams | `BatchSize` (1-10000), `ParallelizationFactor` (1-10) |
| Kinesis | `BatchSize` (1-10000), `ParallelizationFactor` (1-10), `TumblingWindowInSeconds` |
| SQS | `BatchSize` (1-10000), `MaximumConcurrency`, `MaximumBatchingWindowInSeconds` |
| Kafka/MSK | `BatchSize` (1-10000), `MaximumBatchingWindowInSeconds` |
## Monitoring
For Lambda metrics, event source metrics, EventBridge metrics, alarm configuration, and dashboard setup, see [observability.md](observability.md). Use `get_metrics` to retrieve current values.
## AWS Lambda Powertools
For Powertools installation, core utilities reference, deep dives (Feature Flags, Parameters, Parser, Environment Variable Validation, Streaming), and structured logging guidance, see [powertools.md](powertools.md).
```
### references/powertools.md
```markdown
# AWS Lambda Powertools
Lambda Powertools is the recommended library for implementing observability and reliability patterns with minimal boilerplate. It is available for Python, TypeScript, Java, and .NET.
**Install:**
| Runtime | Command |
| ---------- | ----------------------------------------------------------------------------------------------------------- |
| Python | `pip install aws-lambda-powertools` |
| TypeScript | `npm i @aws-lambda-powertools/logger @aws-lambda-powertools/tracer @aws-lambda-powertools/metrics` |
| Java | Add `software.amazon.lambda:powertools-tracing`, `powertools-logging`, `powertools-metrics` to Maven/Gradle |
| .NET | `dotnet add package AWS.Lambda.Powertools.Logging` (plus `.Tracing`, `.Metrics`) |
**SAM init templates:**
- Python: `sam init --app-template hello-world-powertools-python`
- TypeScript: `sam init --app-template hello-world-powertools-typescript`
**Core utilities:**
| Utility | What it does |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------- |
| **Logger** | Structured JSON logging with Lambda context automatically injected |
| **Tracer** | X-Ray tracing with decorators; traces handler + downstream calls |
| **Metrics** | Custom CloudWatch metrics via Embedded Metric Format (EMF) — async, no API call overhead |
| **Idempotency** | Prevent duplicate execution using DynamoDB as idempotency store |
| **Batch** | Partial batch failure handling for SQS, Kinesis, DynamoDB Streams |
| **Parameters** | Cached retrieval from SSM Parameter Store, Secrets Manager, AppConfig |
| **Parser** | Event validation with typed models — Python uses Pydantic, TypeScript uses Zod |
| **Feature Flags** | Rule-based feature toggles backed by AppConfig |
| **Event Handler** | Route REST/HTTP API, GraphQL, and Bedrock Agent events with decorators (Python, TS, Java, .NET) |
| **Kafka Consumer** | Deserialize Kafka events (Avro, Protobuf, JSON Schema) for MSK and self-managed Kafka (Python, TS, Java, .NET) |
| **Data Masking** | Redact or encrypt sensitive fields for compliance (Python, TS) |
| **Event Source Data Classes** | Typed data classes for all Lambda event sources (Python) |
**Use Embedded Metric Format (EMF) for custom metrics** — zero latency overhead and no extra API cost. See [observability.md](observability.md) for setup and code examples.
**Parameters caching** reduces cold start impact from secrets retrieval. The Parameters utility caches values for a configurable TTL (default 5 seconds), avoiding an API call on every invocation.
**Key environment variables:**
| Variable | Purpose |
| ------------------------------- | -------------------------------------------------------- |
| `POWERTOOLS_PARAMETERS_MAX_AGE` | Cache TTL in seconds for Parameters utility (default: 5) |
For observability environment variables (`POWERTOOLS_SERVICE_NAME`, `POWERTOOLS_LOG_LEVEL`, `POWERTOOLS_METRICS_NAMESPACE`, `POWERTOOLS_DEV`), see [observability.md](observability.md).
**Parser validation:** Python uses Pydantic models; TypeScript uses Zod schemas. Both provide compile-time type safety and runtime validation for Lambda event payloads.
## Structured Logging
For structured logging best practices, Logger setup (Python and TypeScript), log level strategy, and Logs Insights queries, see [observability.md](observability.md).
## Deep Dives
### Feature Flags
Use Feature Flags for runtime configuration changes without redeployment — percentage rollouts, user-targeted flags, and kill switches. The backend is AWS AppConfig, which supports deployment strategies (linear, canary, all-at-once) with automatic rollback.
**Python:**
```python
from aws_lambda_powertools.utilities.feature_flags import AppConfigStore, FeatureFlags
app_config = AppConfigStore(environment="prod", application="my-app", name="features")
feature_flags = FeatureFlags(store=app_config)
def handler(event, context):
# Boolean flag with default
dark_mode = feature_flags.evaluate(name="dark_mode", default=False)
# Rules-based flag (percentage rollout, user targeting)
new_checkout = feature_flags.evaluate(
name="new_checkout",
context={"username": event["requestContext"]["authorizer"]["claims"]["sub"]},
default=False,
)
```
**TypeScript:**
```typescript
import { AppConfigProvider } from '@aws-lambda-powertools/parameters/appconfig';
import { FeatureFlags } from '@aws-lambda-powertools/parameters/feature-flags';
const provider = new AppConfigProvider({
environment: 'prod',
application: 'my-app',
name: 'features',
});
const featureFlags = new FeatureFlags({ provider });
export const handler = async (event: any) => {
const darkMode = await featureFlags.evaluate('dark_mode', false);
const newCheckout = await featureFlags.evaluate('new_checkout', false, {
username: event.requestContext.authorizer.claims.sub,
});
};
```
### Parameters
The Parameters utility provides cached retrieval from SSM Parameter Store, Secrets Manager, and AppConfig with a configurable TTL.
**Default TTL:** 5 seconds. Override globally with the `POWERTOOLS_PARAMETERS_MAX_AGE` environment variable (in seconds) or per-call with `max_age`.
**Python:**
```python
from aws_lambda_powertools.utilities.parameters import get_parameter, get_secret
# Cached for 300 seconds
db_host = get_parameter("/my-app/prod/db-host", max_age=300)
# Secrets Manager — decrypted and cached
db_password = get_secret("my-app/prod/db-password", max_age=300)
```
**TypeScript:**
```typescript
import { getParameter } from '@aws-lambda-powertools/parameters/ssm';
import { getSecret } from '@aws-lambda-powertools/parameters/secrets';
const dbHost = await getParameter('/my-app/prod/db-host', { maxAge: 300 });
const dbPassword = await getSecret('my-app/prod/db-password', { maxAge: 300 });
```
### Parser (Input Validation)
Validate event payloads at the system boundary using typed schemas. Catches malformed input before your business logic runs.
**Python (Pydantic):**
```python
from pydantic import BaseModel, field_validator
from aws_lambda_powertools.utilities.parser import event_parser
from aws_lambda_powertools.utilities.parser.envelopes import ApiGatewayEnvelope
class OrderRequest(BaseModel):
order_id: str
amount: float
currency: str = "USD"
@field_validator("amount")
def amount_must_be_positive(cls, v):
if v <= 0:
raise ValueError("amount must be positive")
return v
@event_parser(model=OrderRequest, envelope=ApiGatewayEnvelope)
def handler(event: OrderRequest, context):
# event is already validated and typed
return {"statusCode": 200, "body": f"Order {event.order_id}: {event.amount} {event.currency}"}
```
**TypeScript (Zod):**
```typescript
import { z } from 'zod';
import { APIGatewayProxyEvent, APIGatewayProxyResult, Context } from 'aws-lambda';
const OrderRequest = z.object({
orderId: z.string(),
amount: z.number().positive(),
currency: z.string().default('USD'),
});
type OrderRequest = z.infer<typeof OrderRequest>;
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
const parsed = OrderRequest.safeParse(JSON.parse(event.body ?? '{}'));
if (!parsed.success) {
return { statusCode: 400, body: JSON.stringify({ errors: parsed.error.issues }) };
}
const order = parsed.data;
return { statusCode: 200, body: JSON.stringify({ orderId: order.orderId }) };
};
```
### Environment Variable Validation
Validate required environment variables at module load time (outside the handler) so misconfigured functions fail immediately on cold start rather than producing cryptic errors mid-invocation.
**Python (Pydantic):**
```python
import os
from pydantic import BaseModel
class Config(BaseModel):
table_name: str
event_bus_arn: str
log_level: str = "INFO"
# Validated once at module load — fails fast if TABLE_NAME or EVENT_BUS_ARN is missing
config = Config(
table_name=os.environ["TABLE_NAME"],
event_bus_arn=os.environ["EVENT_BUS_ARN"],
log_level=os.environ.get("LOG_LEVEL", "INFO"),
)
```
**TypeScript (Zod):**
```typescript
import { z } from 'zod';
const Config = z.object({
TABLE_NAME: z.string(),
EVENT_BUS_ARN: z.string(),
LOG_LEVEL: z.string().default('INFO'),
});
// Validated once at module load
const config = Config.parse(process.env);
```
This pattern catches missing or malformed configuration at deploy time (via smoke tests) or immediately on first invocation, rather than on the specific code path that uses the variable.
### Streaming (Python)
For processing S3 objects larger than available Lambda memory, use the Powertools Streaming utility. It provides a stream-like interface that reads data in chunks without loading the entire object into memory.
```python
from aws_lambda_powertools.utilities.streaming import S3Object
def handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
s3_object = S3Object(bucket=bucket, key=key)
for line in s3_object:
process_line(line)
```
This is particularly useful for CSV/JSON-lines processing, log analysis, and any workload where the input file exceeds the function's memory allocation.
```
### references/troubleshooting.md
```markdown
# Troubleshooting Guide
## Symptom-Based Diagnosis
### High Latency
**Possible causes (check in order):**
1. Cold start — check if latency is only on first invocations after idle
2. Under-provisioned memory — more memory = more CPU
3. Slow external calls — database, HTTP APIs, other AWS services
4. Large deployment package — increases cold start time
**Diagnosis steps:**
- Use `get_metrics` to check duration (average vs p99) and memory utilization
- Enable X-Ray tracing to identify which segment is slow
- Check if function is in a VPC (adds ENI setup time on cold start)
**Resolution:**
- For cold starts: initialize SDK clients outside handler, reduce package size, consider provisioned concurrency
- For slow external calls: use connection reuse, add VPC endpoints, increase timeout
- For CPU-bound work: increase memory allocation
### Function Errors
**Possible causes:**
1. Unhandled exceptions in application code
2. Timeout exceeded
3. Out of memory (OOM)
4. Permission denied on AWS API calls
**Diagnosis steps:**
- Use `sam_logs` to retrieve recent CloudWatch logs for the function
- Look for `Task timed out`, `Runtime.ExitError`, or `AccessDeniedException` messages
- Check the error rate trend with `get_metrics`
**Resolution by error type:**
| Error Message | Cause | Fix |
| -------------------------------- | ---------------------------- | ------------------------------------------------------- |
| `Task timed out after X seconds` | Execution exceeded timeout | Increase timeout, increase memory, optimize code |
| `Runtime.ExitError` | OOM or process crash | Increase memory, check for memory leaks |
| `AccessDeniedException` | Missing IAM permission | Add the required action to the function's IAM role |
| `ResourceNotFoundException` | Wrong resource ARN or region | Verify the resource exists in the correct region |
| `TooManyRequestsException` | Concurrency limit reached | Increase reserved concurrency or request limit increase |
### Async Invocations and Throttling
**Common misconception:** Async Lambda invocations (`InvocationType: Event`) are subject to throttling like sync invocations.
**Reality:** Async invocations are **always accepted** — Lambda's Event Invoke Frontend queues the request without checking concurrency limits. The invocation call itself never returns a throttle error. Throttling is only checked later when the internal poller attempts to run the function synchronously. If throttled at that point, the event is returned to the internal queue and retried for **up to 6 hours**.
**Practical implications:**
- You do not need SNS or EventBridge as a throttle-protection buffer in front of async Lambda invocations — direct Lambda-to-Lambda async calls are safe
- If you need guaranteed delivery with a DLQ, configure one on the function directly
- Async invocations with reserved concurrency set to 0 will still be accepted but will fail during processing — they will retry and eventually go to the DLQ or on-failure destination
### Throttling and Concurrency Limits
**Symptoms:** `TooManyRequestsException`, 429 errors from API Gateway, `Throttles` metric rising
**Key concurrency facts:**
- Default account limit: **1,000 concurrent executions** per region (shared across all functions)
- **Concurrency ≠ requests per second**: Concurrency = avg_RPS × avg_duration_seconds
- Burst limit: Lambda can scale by **1,000 new concurrent executions per 10 seconds** (on-demand)
**Diagnosis:**
1. Check `ConcurrentExecutions` and `Throttles` metrics in CloudWatch with `get_metrics`
2. Check if a single function is consuming all available concurrency
3. Run `aws lambda get-account-settings` to see your account limit vs reserved allocations
**Resolution:**
- Set reserved concurrency on high-priority functions to guarantee capacity
- Set reserved concurrency on lower-priority functions to cap their usage
- Request a concurrency quota increase via AWS Service Quotas if the account limit is the bottleneck
- Use SQS as a buffer in front of Lambda to absorb traffic spikes without throttling
### Deployment Failures
**Common errors and solutions:**
| Error | Cause | Solution |
| ------------------------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------ |
| `Build Failed` | Missing dependencies or incompatible runtime | Run `sam_build` with `use_container: true`, verify `requirements.txt`/`package.json` |
| `CREATE_FAILED` on IAM role | Missing `CAPABILITY_IAM` | Add `capabilities = "CAPABILITY_IAM"` to samconfig.toml |
| `ROLLBACK_COMPLETE` | Resource creation failed | Check CloudFormation events for the specific resource failure |
| `No changes to deploy` | No diff from last deploy | Verify `sam_build` ran, check correct samconfig profile |
| `Stack is in ROLLBACK_COMPLETE state` | Previous deploy failed | Delete the stack with `aws cloudformation delete-stack`, then redeploy |
## API Gateway Issues
### CORS Errors
**Symptoms:** Browser blocking requests, `Access-Control-Allow-Origin` errors
**Checklist:**
- Verify CORS is configured on the API Gateway (AllowOrigin, AllowMethods, AllowHeaders)
- Check that OPTIONS method returns correct headers
- Ensure AllowOrigin matches the frontend domain (not `*` in production)
- Verify Lambda response includes CORS headers if using proxy integration
### 5xx Errors
**Symptoms:** API returning 500/502/503 errors
**Diagnosis:**
- 502 Bad Gateway: Lambda returned invalid response format. Check that response includes `statusCode` and `body`.
- 503 Service Unavailable: Lambda throttled. Check concurrency limits.
- 500 Internal Server Error: Check Lambda logs for unhandled exceptions.
## Event Source Mapping Issues
### When to Use Which Tool
| Symptom | Tool to Use |
| ---------------------------------- | -------------------------------- |
| Need to set up a new ESM | `esm_guidance` |
| ESM exists but performance is poor | `esm_optimize` |
| Kafka/MSK connection failing | `esm_kafka_troubleshoot` |
| Need IAM policy for ESM | `secure_esm_*` (source-specific) |
### DynamoDB Streams — High Iterator Age
**Symptoms:** `IteratorAge` metric increasing in CloudWatch
**Diagnosis steps:**
1. Check `ParallelizationFactor` — default is 1, maximum is 10
2. Check function duration — slow processing causes backlog
3. Check for poison records causing repeated retries
4. Check concurrency — throttling prevents scaling
**Resolution:**
- Increase `ParallelizationFactor` and `BatchSize`
- Enable `BisectBatchOnFunctionError` to isolate bad records
- Set `MaximumRetryAttempts` to limit retries on persistent failures
- Use `esm_optimize` for specific tuning recommendations
### Kinesis — Shard Throttling
**Symptoms:** `ReadProvisionedThroughputExceeded` errors
**Resolution:**
- Check if multiple consumers share the same shard (each shard supports 2 MB/s reads)
- Use enhanced fan-out for multiple consumers
- Consider switching to ON_DEMAND stream mode for automatic scaling
- Increase shard count for PROVISIONED mode
### SQS — Messages Going to DLQ
**Symptoms:** Messages accumulating in dead-letter queue
**Diagnosis:**
- Check that `VisibilityTimeout` on the queue is >= Lambda function timeout
- Check for partial batch failures: enable `ReportBatchItemFailures` in `FunctionResponseTypes`
- Check `maxReceiveCount` in the redrive policy (too low causes premature DLQ routing)
### Kafka/MSK — Connection Failures
**Symptoms:** ESM stays in `Creating` or `Failed` state
**Use `esm_kafka_troubleshoot` with the error message.** Common causes:
- Lambda not in same VPC as MSK cluster
- Security group missing inbound rule on ports 9092/9094
- IAM authentication not configured correctly
- SASL/SCRAM secret not in the correct format
## VPC and Networking
### Lambda Cannot Reach AWS Services
**Symptoms:** Timeouts when calling DynamoDB, S3, SQS from VPC-attached Lambda
**Cause:** Lambda in VPC private subnets cannot reach AWS service endpoints without a path.
**Resolution options (choose one):**
- Add VPC gateway endpoints for DynamoDB and S3 (free, recommended)
- Add VPC interface endpoints for other services (per-hour + per-GB cost)
- Add NAT Gateway in public subnet (higher cost, required for internet access)
- **Use IPv6 + Egress-Only Internet Gateway** for internet access — Lambda now supports IPv6, so if your VPC has IPv6 CIDR blocks and an Egress-Only Internet Gateway, Lambda functions can reach the internet without a NAT Gateway, eliminating NAT Gateway costs entirely
### ENI Exhaustion
**Symptoms:** Lambda functions fail to start, `ENILimitReached` errors
**Resolution:**
- Use multiple subnets across AZs (each /24 subnet provides ~250 IPs)
- Set reserved concurrency to cap the maximum ENI usage
- Lambda uses Hyperplane ENIs which are shared, but high concurrency can still exhaust IPs
## Lambda SnapStart Issues
SnapStart issues are specific to Java 11+, Python 3.12+, and .NET 8+ functions with `SnapStart: ApplyOn: PublishedVersions`.
### Stale unique values
**Symptom:** All invocations share the same UUID, timestamp, or random value
**Cause:** Value was generated during initialization (before snapshot), so all restored environments use the same value.
**Fix:** Move any call to `uuid.uuid4()`, `random`, `time.time()`, or similar into the handler function body, not at module level.
### Stale database connection after restore
**Symptom:** Database errors on the first call after a period of inactivity
**Cause:** The connection object was captured in the snapshot but the actual TCP connection is no longer valid after resume.
**Fix:** Validate the connection before use, or use a `lambda_runtime_api_prepare_to_invoke` hook to re-establish it after restoration.
### SnapStart not activating
**Symptom:** Cold starts still slow despite SnapStart enabled
**Check:**
- SnapStart only applies to **published versions** and aliases pointing to them, not `$LATEST`
- The function is not using provisioned concurrency, EFS, or ephemeral storage > 512 MB
- The runtime is Java 11+, Python 3.12+, or .NET 8+
## Debugging Workflow
When a function is failing and the cause is unclear, follow this sequence:
1. **Check logs**: Use `sam_logs` to get recent log output
2. **Check metrics**: Use `get_metrics` to identify error rate, duration, and throttle trends
3. **Check configuration**: Verify timeout, memory, VPC, and IAM settings in the SAM template
4. **Test locally**: Use `sam_local_invoke` with the failing event payload to reproduce
5. **Test deployed function**: Use `sam remote invoke` with the failing event to test directly in AWS — bypasses local environment differences
6. **Trace calls**: Enable X-Ray tracing to identify which downstream call is failing
7. **Check dependencies**: Verify external services (databases, APIs) are reachable and healthy
```