SkillHub ClubDesign ProductFull StackDesigner

cloud-solution-architect

Transform the agent into a Cloud Solution Architect following Azure Architecture Center best practices. Use when designing cloud architectures, reviewing system designs, selecting architecture styles, applying cloud design patterns, making technology choices, or conducting Well-Architected Framework reviews.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

1,779

Hot score

Updated

March 20, 2026

Overall rating

C4.8

Composite score

4.8

Best-practice grade

B73.6

Install command

npx @skill-hub/cli install microsoft-skills-cloud-solution-architect

Repository

microsoft/skills

Skill path: .github/skills/cloud-solution-architect

Open repository

Best for

Primary workflow: Design Product.

Technical facets: Full Stack, Designer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: microsoft.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install cloud-solution-architect into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/microsoft/skills before adding cloud-solution-architect to shared team environments
Use cloud-solution-architect for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: cloud-solution-architect
description: >-
  Transform the agent into a Cloud Solution Architect following Azure Architecture Center best practices.
  Use when designing cloud architectures, reviewing system designs, selecting architecture styles,
  applying cloud design patterns, making technology choices, or conducting Well-Architected Framework reviews.
---

# Cloud Solution Architect

## Overview

Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:

- **10 design principles** for Azure applications
- **6 architecture styles** with selection guidance
- **44 cloud design patterns** mapped to WAF pillars
- **Technology choice frameworks** for compute, storage, data, messaging
- **Performance antipatterns** to avoid
- **Architecture review workflow** for systematic design validation

---

## Ten Design Principles for Azure Applications

| # | Principle | Key Tactics |
|---|-----------|-------------|
| 1 | **Design for self-healing** | Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation |
| 2 | **Make all things redundant** | Eliminate single points of failure, use availability zones, deploy multi-region, replicate data |
| 3 | **Minimize coordination** | Decouple services, use async messaging, embrace eventual consistency, use domain events |
| 4 | **Design to scale out** | Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads |
| 5 | **Partition around limits** | Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content |
| 6 | **Design for operations** | Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code |
| 7 | **Use managed services** | Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling |
| 8 | **Use an identity service** | Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles |
| 9 | **Design for evolution** | Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags |
| 10 | **Build for business needs** | Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs |

---

## Architecture Styles

| Style | Description | When to Use | Key Services |
|-------|-------------|-------------|--------------|
| **N-tier** | Horizontal layers (presentation, business, data) | Traditional enterprise apps, lift-and-shift | App Service, SQL Database, VNets |
| **Web-Queue-Worker** | Web frontend → message queue → backend worker | Moderate-complexity apps with long-running tasks | App Service, Service Bus, Functions |
| **Microservices** | Small autonomous services, bounded contexts, independent deploy | Complex domains, independent team scaling | AKS, Container Apps, API Management |
| **Event-driven** | Pub/sub model, event producers/consumers | Real-time processing, IoT, reactive systems | Event Hubs, Event Grid, Functions |
| **Big data** | Batch + stream processing pipeline | Analytics, ML pipelines, large-scale data | Synapse, Data Factory, Databricks |
| **Big compute** | HPC, parallel processing | Simulations, modeling, rendering, genomics | Batch, CycleCloud, HPC VMs |

### Selection Criteria

- **Domain complexity** → Microservices (high), N-tier (low-medium)
- **Team autonomy** → Microservices (independent teams), N-tier (single team)
- **Data volume** → Big data (TB+), others (GB)
- **Latency requirements** → Event-driven (real-time), Web-Queue-Worker (tolerant)

---

## Cloud Design Patterns

44 patterns organized by primary concern. WAF pillar mapping: **R**=Reliability, **S**=Security, **CO**=Cost Optimization, **OE**=Operational Excellence, **PE**=Performance Efficiency.

### Messaging & Communication

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Asynchronous Request-Reply** | Decouple request/response with polling or callbacks | R, PE |
| **Claim Check** | Split large messages; store payload separately, pass reference | R, PE |
| **Choreography** | Services coordinate via events without central orchestrator | R, OE |
| **Competing Consumers** | Multiple consumers process messages from shared queue concurrently | R, PE |
| **Messaging Bridge** | Connect incompatible messaging systems | R, OE |
| **Pipes and Filters** | Decompose complex processing into reusable filter stages | R, OE |
| **Priority Queue** | Prioritize requests so higher-priority work is processed first | R, PE |
| **Publisher/Subscriber** | Decouple senders from receivers via topics/subscriptions | R, PE |
| **Queue-Based Load Leveling** | Buffer requests with a queue to smooth intermittent loads | R, PE |
| **Sequential Convoy** | Process related messages in order while allowing parallel groups | R, PE |

### Reliability & Resilience

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Bulkhead** | Isolate resources per workload to prevent cascading failure | R |
| **Circuit Breaker** | Stop calling a failing service; fail fast to protect resources | R |
| **Compensating Transaction** | Undo previously committed steps when a later step fails | R |
| **Health Endpoint Monitoring** | Expose health checks for load balancers and orchestrators | R, OE |
| **Leader Election** | Coordinate distributed instances by electing a leader | R |
| **Retry** | Handle transient faults by retrying with exponential backoff | R |
| **Saga** | Manage data consistency across microservices with compensating transactions | R |
| **Scheduler Agent Supervisor** | Coordinate distributed actions with retry and failure handling | R |

### Data Management

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Cache-Aside** | Load data on demand into cache from data store | PE |
| **CQRS** | Separate read and write models for independent scaling | PE, R |
| **Event Sourcing** | Store state as append-only sequence of domain events | R, OE |
| **Index Table** | Create indexes over frequently queried fields in data stores | PE |
| **Materialized View** | Pre-compute views over data for efficient queries | PE |
| **Sharding** | Distribute data across partitions for scale and performance | PE, R |
| **Static Content Hosting** | Serve static content from cloud storage/CDN directly | PE, CO |
| **Valet Key** | Grant clients limited direct access to storage resources | S, PE |

### Design & Structure

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Ambassador** | Offload cross-cutting concerns to a helper sidecar proxy | OE |
| **Anti-Corruption Layer** | Translate between new and legacy system models | OE, R |
| **Backends for Frontends** | Create separate backends per frontend type (mobile, web, etc.) | OE, PE |
| **Compute Resource Consolidation** | Combine multiple workloads into fewer compute instances | CO |
| **External Configuration Store** | Externalize configuration from deployment packages | OE |
| **Sidecar** | Deploy helper components alongside the main service | OE |
| **Strangler Fig** | Incrementally migrate legacy systems by replacing pieces | OE, R |

### Security & Access

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Federated Identity** | Delegate authentication to an external identity provider | S |
| **Gatekeeper** | Protect services using a dedicated broker that validates requests | S |
| **Quarantine** | Isolate and validate external assets before allowing use | S |
| **Rate Limiting** | Control consumption rate of resources by consumers | R, S |
| **Throttling** | Control resource consumption to sustain SLAs under load | R, PE |

### Deployment & Scaling

| Pattern | Summary | Pillars |
|---------|---------|---------|
| **Deployment Stamps** | Deploy multiple independent copies of application components | R, PE |
| **Edge Workload Configuration** | Configure workloads differently across diverse edge devices | OE |
| **Gateway Aggregation** | Aggregate multiple backend calls into a single client request | PE |
| **Gateway Offloading** | Offload shared functionality (SSL, auth) to a gateway | OE, S |
| **Gateway Routing** | Route requests to multiple backends using a single endpoint | OE |
| **Geode** | Deploy backends to multiple regions for active-active serving | R, PE |

See [Design Patterns Reference](./references/design-patterns.md) for detailed implementation guidance.

---

## Technology Choices

### Decision Framework

For each technology area, evaluate: **requirements → constraints → tradeoffs → select**.

| Area | Key Options | Selection Criteria |
|------|-------------|-------------------|
| **Compute** | App Service, Functions, Container Apps, AKS, VMs, Batch | Hosting model, scaling, cost, team skills |
| **Storage** | Blob Storage, Data Lake, Files, Disks, Managed Lustre | Access patterns, throughput, cost tier |
| **Data stores** | SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage | Consistency model, query patterns, scale |
| **Messaging** | Service Bus, Event Hubs, Event Grid, Queue Storage | Ordering, throughput, pub/sub vs queue |
| **Networking** | Front Door, Application Gateway, Load Balancer, Traffic Manager | Global vs regional, L4 vs L7, WAF |
| **AI services** | Azure OpenAI, AI Search, AI Foundry, Document Intelligence | Model needs, data grounding, orchestration |
| **Containers** | Container Apps, AKS, Container Instances | Operational control vs simplicity |

See [Technology Choices Reference](./references/technology-choices.md) for detailed decision trees.

---

## Best Practices

| Practice | Key Guidance |
|----------|-------------|
| **API design** | RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header |
| **API implementation** | Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching |
| **Autoscaling** | Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection |
| **Background jobs** | Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown |
| **Caching** | Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance |
| **CDN** | Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement |
| **Data partitioning** | Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution |
| **Partitioning strategies** | Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance |
| **Host name preservation** | Preserve original host header through proxies/gateways for cookies, redirects, auth flows |
| **Message encoding** | Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry |
| **Monitoring & diagnostics** | Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards |
| **Transient fault handling** | Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets |

See [Best Practices Reference](./references/best-practices.md) for implementation details.

---

## Performance Antipatterns

Avoid these common patterns that degrade performance under load:

| Antipattern | Problem | Fix |
|-------------|---------|-----|
| **Busy Database** | Offloading too much processing to the database | Move logic to application tier, use caching |
| **Busy Front End** | Resource-intensive work on frontend request threads | Offload to background workers/queues |
| **Chatty I/O** | Many small I/O requests instead of fewer large ones | Batch requests, use bulk APIs, buffer writes |
| **Extraneous Fetching** | Retrieving more data than needed | Project only required fields, paginate, filter server-side |
| **Improper Instantiation** | Recreating expensive objects per request | Use singletons, connection pooling, HttpClientFactory |
| **Monolithic Persistence** | Single data store for all data types | Polyglot persistence — right store for each workload |
| **No Caching** | Repeatedly fetching unchanged data | Cache-aside pattern, CDN, output caching, Redis |
| **Noisy Neighbor** | One tenant consuming all shared resources | Bulkhead isolation, per-tenant quotas, throttling |
| **Retry Storm** | Aggressive retries overwhelming a recovering service | Exponential backoff + jitter, circuit breaker, retry budgets |
| **Synchronous I/O** | Blocking threads on I/O operations | Async/await, non-blocking I/O, reactive streams |

---

## Mission-Critical Design

For workloads targeting **99.99%+ SLO**, address these design areas:

| Design Area | Key Considerations |
|-------------|-------------------|
| **Application platform** | Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy |
| **Application design** | Stateless services, idempotent operations, graceful degradation, bulkhead isolation |
| **Networking** | Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity |
| **Data platform** | Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution |
| **Deployment & testing** | Blue-green deployments, canary releases, chaos engineering, automated rollback |
| **Health modeling** | Composite health scores, dependency health tracking, automated remediation, SLI dashboards |
| **Security** | Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling |
| **Operational procedures** | Automated runbooks, incident response playbooks, game days, postmortems |

See [Mission-Critical Reference](./references/mission-critical.md) for detailed guidance.

---

## Well-Architected Framework (WAF) Pillars

Every architecture decision should be evaluated against all five pillars:

| Pillar | Focus | Key Questions |
|--------|-------|---------------|
| **Reliability** | Resiliency, availability, disaster recovery | What is the RTO/RPO? How does it handle failures? Is there redundancy? |
| **Security** | Threat protection, identity, data protection | Is identity managed? Is data encrypted? Are there network controls? |
| **Cost Optimization** | Cost management, efficiency, right-sizing | Is compute right-sized? Are there reserved instances? Is there waste? |
| **Operational Excellence** | Monitoring, deployment, automation | Is deployment automated? Is there observability? Are there runbooks? |
| **Performance Efficiency** | Scaling, load testing, performance targets | Can it scale horizontally? Are there performance baselines? Is caching used? |

### WAF Tradeoff Matrix

| Optimizing for... | May impact... |
|-------------------|---------------|
| Reliability (redundancy) | Cost (more resources) |
| Security (isolation) | Performance (added latency) |
| Cost (consolidation) | Reliability (shared failure domains) |
| Performance (caching) | Cost (cache infrastructure), Reliability (stale data) |

---

## Architecture Review Workflow

When reviewing or designing a system, follow this structured approach:

### Step 1: Identify Requirements

```
Functional: What must the system do?
Non-functional:
  - Availability target (e.g., 99.9%, 99.99%)
  - Latency requirements (p50, p95, p99)
  - Throughput (requests/sec, messages/sec)
  - Data residency and compliance
  - Recovery targets (RTO, RPO)
  - Cost constraints
```

### Step 2: Select Architecture Style

Match requirements to architecture style using the selection criteria table above.

### Step 3: Choose Technology Stack

Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.

### Step 4: Apply Design Patterns

Select relevant patterns from the 44 cloud design patterns based on identified concerns.

### Step 5: Address Cross-Cutting Concerns

- **Identity & access** — Microsoft Entra ID, managed identity, RBAC
- **Monitoring** — Application Insights, Azure Monitor, Log Analytics
- **Security** — Network segmentation, encryption at rest/in transit, Key Vault
- **CI/CD** — GitHub Actions, Azure DevOps Pipelines, infrastructure as code

### Step 6: Validate Against WAF Pillars

Review each pillar systematically. Document tradeoffs explicitly.

### Step 7: Document Decisions

Use Architecture Decision Records (ADRs):

```markdown
# ADR-NNN: [Decision Title]

## Status: [Proposed | Accepted | Deprecated]

## Context
[What is the issue we're addressing?]

## Decision
[What did we decide and why?]

## Consequences
[What are the positive and negative impacts?]
```

---

## References

- [Design Patterns Reference](./references/design-patterns.md) — Detailed pattern implementations
- [Technology Choices Reference](./references/technology-choices.md) — Decision trees for Azure services
- [Best Practices Reference](./references/best-practices.md) — Implementation guidance
- [Mission-Critical Reference](./references/mission-critical.md) — High-availability design

---

## Source

Content derived from the [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/) — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/design-patterns.md

```markdown
# Cloud Design Patterns

A comprehensive reference of all 44 cloud design patterns from the Azure Architecture Center, organized by concern area and detailed with problem context, usage scenarios, WAF pillar alignment, and related patterns.

---

## Patterns by Concern

### Availability / Reliability

Circuit Breaker · Compensating Transaction · Health Endpoint Monitoring · Leader Election · Queue-Based Load Leveling · Retry · Saga · Scheduler Agent Supervisor · Sequential Convoy · Bulkhead · Rate Limiting

### Data Management

Cache-Aside · CQRS · Event Sourcing · Index Table · Materialized View · Sharding · Valet Key · Claim Check

### Design / Implementation

Ambassador · Anti-Corruption Layer · Backends for Frontends · Compute Resource Consolidation · Deployment Stamps · External Configuration Store · Gateway Aggregation · Gateway Offloading · Gateway Routing · Sidecar · Strangler Fig · Federated Identity

### Messaging

Asynchronous Request-Reply · Choreography · Claim Check · Competing Consumers · Messaging Bridge · Pipes and Filters · Priority Queue · Publisher/Subscriber · Queue-Based Load Leveling · Sequential Convoy

### Performance / Scalability

Cache-Aside · Geode · Throttling · Deployment Stamps · CQRS

### Security

Gatekeeper · Quarantine · Valet Key · Federated Identity · Throttling

---

## Pattern Reference

### 1. Ambassador

**Create helper services that send network requests on behalf of a consumer service or application.**

**Problem:** Applications need common connectivity features such as monitoring, logging, routing, security (TLS), and resiliency patterns. Legacy or difficult-to-modify apps may not support these features natively. Network calls require substantial configuration for concerns such as circuit breaking, routing, metering, and telemetry.

**When to use:**

- You need a common set of client connectivity features across multiple languages or frameworks
- The connectivity concern is owned by infrastructure teams or another specialized team
- You need to reduce the age or complexity of legacy app networking without modifying source code
- You want to standardize observability and resiliency across polyglot services

**WAF Pillars:** Reliability, Security

**Related patterns:** Sidecar, Gateway Routing, Gateway Offloading

---

### 2. Anti-Corruption Layer

**Implement a façade or adapter layer between a modern application and a legacy system.**

**Problem:** During migration, the new system must often integrate with the legacy system's data model or API, which may use outdated schemas, protocols, or design conventions. Allowing the modern application to depend on legacy contracts contaminates its design and limits future evolution.

**When to use:**

- A migration is planned over multiple phases and the old and new systems must coexist
- The new system's domain model differs significantly from the legacy system
- You want to prevent legacy coupling from leaking into modern components
- Two bounded contexts (DDD) need to communicate but have incompatible models

**WAF Pillars:** Operational Excellence

**Related patterns:** Strangler Fig, Gateway Routing

---

### 3. Asynchronous Request-Reply

**Decouple backend processing from a frontend host, where backend processing needs to be asynchronous but the frontend still needs a clear response.**

**Problem:** In many architectures, the client expects an immediate acknowledgement while the actual work happens in the background. The client needs a way to learn the result of the background operation without holding a long-lived connection or repeatedly guessing when processing completes.

**When to use:**

- Backend processing may take seconds to minutes and the client should not block
- Client-side code such as a browser app cannot provide a callback endpoint
- You want to expose an HTTP API where the server initiates long-running work and the client polls for results
- You need an alternative to WebSocket or server-sent events for status updates

**WAF Pillars:** Performance Efficiency

**Related patterns:** Competing Consumers, Pipes and Filters, Queue-Based Load Leveling

---

### 4. Backends for Frontends

**Create separate backend services to be consumed by specific frontend applications or interfaces.**

**Problem:** A general-purpose backend API tends to accumulate conflicting requirements from different frontends (mobile, web, desktop, IoT). Over time the backend becomes bloated and changes for one frontend risk breaking another. Release cadences diverge and the single backend becomes a bottleneck.

**When to use:**

- A shared backend must be maintained with significant development overhead for multiple frontends
- You want to optimize each backend for the constraints of a specific client (bandwidth, latency, payload shape)
- Different frontend teams need independent release cycles for their backend logic
- A single backend would require complex, client-specific branching logic

**WAF Pillars:** Reliability, Security, Performance Efficiency

**Related patterns:** Gateway Aggregation, Gateway Offloading, Gateway Routing

---

### 5. Bulkhead

**Isolate elements of an application into pools so that if one fails, the others continue to function.**

**Problem:** A cloud-based application may call multiple downstream services. If a single downstream becomes slow or unresponsive, the caller's thread pool or connection pool can be exhausted, causing cascading failure that takes down unrelated functionality.

**When to use:**

- You need to protect critical consumers from failures in non-critical downstream dependencies
- A single noisy tenant or request type should not degrade service for others
- You want to limit the blast radius of a downstream fault
- The application calls multiple services with differing SLAs

**WAF Pillars:** Reliability, Security, Performance Efficiency

**Related patterns:** Circuit Breaker, Retry, Throttling, Queue-Based Load Leveling

---

### 6. Cache-Aside

**Load data on demand into a cache from a data store.**

**Problem:** Applications frequently read the same data from a data store. Repeated round trips increase latency and reduce throughput. Data stores may throttle or become expensive under high read loads.

**When to use:**

- The data store is read-heavy and the same data is requested frequently
- The data store does not natively provide caching
- Data can tolerate short periods of staleness
- You want to reduce cost and latency of repeated reads

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Materialized View, Event Sourcing, CQRS

---

### 7. Choreography

**Have each component of the system participate in the decision-making process about the workflow of a business transaction, instead of relying on a central point of control.**

**Problem:** A central orchestrator can become a single point of failure, a performance bottleneck, or a source of tight coupling. Changes to the workflow require changes to the orchestrator, which may be owned by a different team.

**When to use:**

- Services are independently deployable and owned by separate teams
- The workflow changes frequently and a central orchestrator would become a maintenance burden
- You want to avoid a single point of failure in workflow coordination
- Services need loose coupling and can react to events

**WAF Pillars:** Operational Excellence, Performance Efficiency

**Related patterns:** Publisher/Subscriber, Saga, Competing Consumers

---

### 8. Circuit Breaker

**Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource.**

**Problem:** Transient faults are handled by the Retry pattern, but when a downstream service is unavailable for an extended period, retries waste resources and block callers. Continuing to send requests to a failing service prevents it from recovering and wastes the caller's threads, connections, and compute.

**When to use:**

- A remote dependency experiences intermittent prolonged outages
- You need to fail fast rather than make callers wait for a timeout
- You want to give a failing downstream time to recover before sending more requests
- You want to surface degraded functionality instead of hard failures

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Retry, Bulkhead, Health Endpoint Monitoring, Ambassador

---

### 9. Claim Check

**Split a large message into a claim check and payload to avoid overwhelming the messaging infrastructure.**

**Problem:** Message brokers and queues often have size limits and charge per message. Sending large payloads (images, documents, datasets) through the messaging channel wastes bandwidth, increases cost, and may hit transport limits.

**When to use:**

- The message payload exceeds the messaging system's size limit
- You want to reduce messaging costs by keeping message bodies small
- Not all consumers need the full payload; some only need metadata
- You need to protect sensitive payload data by separating access control from the messaging channel

**WAF Pillars:** Reliability, Security, Cost Optimization, Performance Efficiency

**Related patterns:** Competing Consumers, Pipes and Filters, Publisher/Subscriber

---

### 10. Compensating Transaction

**Undo the work performed by a series of steps, which together define an eventually consistent operation.**

**Problem:** In distributed systems, multi-step operations cannot rely on traditional ACID transactions. If a later step fails, the previous steps have already committed. The system needs a mechanism to reverse or compensate for the work done by the completed steps.

**When to use:**

- Multi-step operations span multiple services or data stores that do not share a transaction coordinator
- You need to maintain consistency when a step in a distributed workflow fails
- Rolling back is semantically meaningful (refund a charge, release a reservation)
- The cost of inconsistency outweighs the complexity of compensation logic

**WAF Pillars:** Reliability

**Related patterns:** Saga, Retry, Scheduler Agent Supervisor

---

### 11. Competing Consumers

**Enable multiple concurrent consumers to process messages received on the same messaging channel.**

**Problem:** At peak load, a single consumer cannot keep up with the volume of incoming messages. Messages queue up, latency increases, and the system may breach its SLAs.

**When to use:**

- The workload varies significantly and you need to scale message processing dynamically
- You require high availability—if one consumer fails, others continue processing
- Multiple messages are independent and can be processed in parallel
- You want to distribute work across multiple instances or nodes

**WAF Pillars:** Reliability, Cost Optimization, Performance Efficiency

**Related patterns:** Queue-Based Load Leveling, Priority Queue, Publisher/Subscriber, Pipes and Filters

---

### 12. Compute Resource Consolidation

**Consolidate multiple tasks or operations into a single computational unit.**

**Problem:** Deploying each small task or component as a separate service introduces operational overhead: more deployments, monitoring endpoints, and infrastructure cost. Many lightweight components are underutilized most of the time, wasting allocated compute.

**When to use:**

- Several lightweight processes have low CPU or memory usage individually
- You want to reduce deployment and operational overhead
- Processes share the same scaling profile and lifecycle
- Communication between tasks benefits from being in-process rather than over the network

**WAF Pillars:** Cost Optimization, Operational Excellence, Performance Efficiency

**Related patterns:** Sidecar, Backends for Frontends

---

### 13. CQRS (Command and Query Responsibility Segregation)

**Segregate operations that read data from operations that update data by using separate interfaces.**

**Problem:** In traditional CRUD architectures, the same data model is used for reads and writes. This creates tension: read models want denormalized, query-optimized shapes while write models want normalized, consistency-optimized shapes. As the system grows, the shared model becomes a compromise that serves neither concern well.

**When to use:**

- Read and write workloads are asymmetric (far more reads than writes, or vice versa)
- Read and write models have different schema requirements
- You want to scale read and write sides independently
- The domain benefits from an event-driven or task-based style rather than CRUD

**WAF Pillars:** Performance Efficiency

**Related patterns:** Event Sourcing, Materialized View, Cache-Aside

---

### 14. Deployment Stamps

**Deploy multiple independent copies of application components, including data stores.**

**Problem:** A single shared deployment for all tenants or regions creates coupling. A fault in one tenant's workload can affect all tenants. Regulatory requirements may mandate data residency. Scaling the entire deployment for a single tenant's spike is wasteful.

**When to use:**

- You need to isolate tenants for compliance, performance, or fault isolation
- Your application must serve multiple geographic regions with data residency requirements
- You need independent scaling per tenant group or region
- You want blue/green or canary deployments at the stamp level

**WAF Pillars:** Operational Excellence, Performance Efficiency

**Related patterns:** Geode, Sharding, Throttling

---

### 15. Edge Workload Configuration

**Centrally configure workloads that run at the edge, managing configuration drift and deployment consistency across heterogeneous edge devices.**

**Problem:** Edge devices are numerous, heterogeneous, and often intermittently connected. Deploying and configuring workloads individually is error-prone. Configuration drift between devices causes inconsistent behavior and difficult debugging.

**When to use:**

- You manage a fleet of edge devices running the same workload with differing local parameters
- Edge devices have intermittent connectivity and must operate independently
- You need a central source of truth for configuration with local overrides
- Audit and compliance require tracking which configuration each device is running

**WAF Pillars:** Operational Excellence

**Related patterns:** External Configuration Store, Sidecar, Ambassador

---

### 16. Event Sourcing

**Use an append-only store to record the full series of events that describe actions taken on data in a domain.**

**Problem:** Traditional CRUD stores only keep current state—history is lost. Audit, debugging, temporal queries, and replays are impossible without supplementary logging, which is often incomplete or out of sync.

**When to use:**

- You need a complete, immutable audit trail of all changes
- The business logic benefits from replaying or projecting events into different views
- You want to decouple the write model from the read model (often combined with CQRS)
- You need to reconstruct past states for debugging or regulatory purposes

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** CQRS, Materialized View, Compensating Transaction, Saga

---

### 17. External Configuration Store

**Move configuration information out of the application deployment package to a centralized location.**

**Problem:** Configuration files deployed alongside the application binary require redeployment to change. Different environments (dev, staging, prod) need different values. Sharing configuration across multiple services is difficult when each has its own config file.

**When to use:**

- Multiple services share common configuration settings
- You need to update configuration without redeploying or restarting services
- You want centralized access control and audit logging for configuration
- Configuration must differ across environments but be managed in a single system

**WAF Pillars:** Operational Excellence

**Related patterns:** Edge Workload Configuration, Sidecar

---

### 18. Federated Identity

**Delegate authentication to an external identity provider.**

**Problem:** Building and maintaining your own identity store introduces security risks (password storage, credential rotation, MFA implementation). Users must manage separate credentials for each application, leading to password fatigue and weaker security posture.

**When to use:**

- Users already have identities in an enterprise directory or social provider
- You want to enable single sign-on (SSO) across multiple applications
- Business partners need to access your application with their own credentials
- You want to offload identity management (MFA, password policy) to a specialized provider

**WAF Pillars:** Reliability, Security, Performance Efficiency

**Related patterns:** Gatekeeper, Valet Key, Gateway Offloading

---

### 19. Gatekeeper

**Protect applications and services by using a dedicated host instance that acts as a broker between clients and the application or service.**

**Problem:** Services that expose public endpoints are vulnerable to malicious attacks. Placing validation, authentication, and sanitization logic inside the service mixes security concerns with business logic and increases the attack surface if the service is compromised.

**When to use:**

- Applications handle sensitive data or high-value transactions
- You need a centralized point for request validation and sanitization
- You want to limit the attack surface by isolating security checks from the trusted host
- Compliance requires an explicit security boundary between public and private tiers

**WAF Pillars:** Security

**Related patterns:** Valet Key, Gateway Routing, Gateway Offloading, Federated Identity

---

### 20. Gateway Aggregation

**Use a gateway to aggregate multiple individual requests into a single request.**

**Problem:** A client (especially mobile) may need data from multiple backend microservices to render a single page or view. Making many fine-grained calls from the client increases latency (multiple round trips), battery usage, and complexity. The client must understand the topology of the backend.

**When to use:**

- A client needs to make multiple calls to different backend services for a single operation
- Network latency between the client and backend is significant (mobile, IoT, remote clients)
- You want to reduce chattiness and simplify client code
- Backend services are fine-grained microservices and the client should not know their topology

**WAF Pillars:** Reliability, Security, Operational Excellence, Performance Efficiency

**Related patterns:** Backends for Frontends, Gateway Offloading, Gateway Routing

---

### 21. Gateway Offloading

**Offload shared or specialized service functionality to a gateway proxy.**

**Problem:** Cross-cutting concerns such as TLS termination, authentication, rate limiting, logging, and compression are duplicated across every service. Each team must implement, configure, and maintain these features independently, leading to inconsistency and wasted effort.

**When to use:**

- Multiple services share cross-cutting concerns (TLS, auth, rate limiting, logging)
- You want to standardize and centralize these features instead of duplicating them per service
- You need to reduce operational complexity for individual service teams
- The gateway is already in the request path and adding features there avoids extra hops

**WAF Pillars:** Reliability, Security, Cost Optimization, Operational Excellence, Performance Efficiency

**Related patterns:** Gateway Aggregation, Gateway Routing, Sidecar, Ambassador

---

### 22. Gateway Routing

**Route requests to multiple services using a single endpoint.**

**Problem:** Clients must know the addresses of multiple services. Adding, removing, or relocating a service requires client updates. Exposing internal service topology increases coupling and complicates DNS and load balancing.

**When to use:**

- You want to expose multiple services behind a single URL with path- or header-based routing
- You need to decouple client URLs from the internal service topology
- You want to simplify client configuration and DNS management
- You need to support versioned APIs or blue/green routing at the gateway level

**WAF Pillars:** Reliability, Operational Excellence, Performance Efficiency

**Related patterns:** Gateway Aggregation, Gateway Offloading, Backends for Frontends

---

### 23. Geode

**Deploy backend services into a set of geographical nodes, each of which can service any client request in any region.**

**Problem:** Users across the globe experience high latency when all traffic routes to a single region. A single-region deployment also creates a single point of failure and cannot meet data residency requirements for multiple jurisdictions.

**When to use:**

- Users are globally distributed and expect low-latency access
- You need active-active multi-region availability
- Data residency or sovereignty requirements mandate regional deployment
- A single-region failure should not take down the service globally

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Deployment Stamps, Sharding, Cache-Aside, Static Content Hosting

---

### 24. Health Endpoint Monitoring

**Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals.**

**Problem:** Without health checks, failures are detected only when users report them. Load balancers, orchestrators, and monitoring systems need a programmatic way to determine whether an instance is healthy, ready to accept traffic, and functioning correctly.

**When to use:**

- You use a load balancer or orchestrator (Kubernetes, App Service) that needs liveness and readiness signals
- You want automated alerting when a service degrades
- You need to verify downstream dependencies (database, cache, third-party API) are reachable
- You want to enable self-healing by removing unhealthy instances from rotation

**WAF Pillars:** Reliability, Operational Excellence, Performance Efficiency

**Related patterns:** Circuit Breaker, Retry, Ambassador

---

### 25. Index Table

**Create indexes over the fields in data stores that are frequently referenced by queries.**

**Problem:** Many NoSQL stores support queries efficiently only on the primary or partition key. Queries on non-key fields result in full scans, which are slow and expensive at scale.

**When to use:**

- Queries frequently filter or sort on non-key attributes
- The data store does not support secondary indexes natively
- You want to trade additional storage and write overhead for faster reads
- Read performance on secondary fields is critical to the user experience

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Materialized View, Sharding, CQRS

---

### 26. Leader Election

**Coordinate the actions performed by a collection of collaborating instances by electing one instance as the leader that assumes responsibility for managing the others.**

**Problem:** Multiple identical instances need to coordinate shared work (aggregation, scheduling, rebalancing). Without coordination, instances may duplicate effort, conflict, or corrupt shared state.

**When to use:**

- A group of peer instances needs exactly one to perform coordination tasks
- The leader must be automatically re-elected if it fails
- You want to avoid a statically assigned coordinator that becomes a single point of failure
- Distributed locks or consensus are needed for coordination

**WAF Pillars:** Reliability

**Related patterns:** Scheduler Agent Supervisor, Competing Consumers

---

### 27. Materialized View

**Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for required query operations.**

**Problem:** The normalized storage schema that is optimal for writes is often suboptimal for complex read queries. Joins across tables or services are expensive, slow, and sometimes impossible in NoSQL stores.

**When to use:**

- Read queries require joining data across multiple stores or tables
- The read pattern is well-known and relatively stable
- You can tolerate a short delay between a write and its appearance in the view
- You want to offload complex query logic from the application tier

**WAF Pillars:** Performance Efficiency

**Related patterns:** CQRS, Event Sourcing, Index Table, Cache-Aside

---

### 28. Messaging Bridge

**Build an intermediary to enable communication between messaging systems that are otherwise incompatible, due to protocol, format, or infrastructure differences.**

**Problem:** Organizations may operate multiple messaging systems (RabbitMQ, Azure Service Bus, Kafka, legacy MQ). Applications bound to different brokers cannot exchange messages natively, creating data silos and duplicated effort.

**When to use:**

- You need to integrate systems that use different messaging protocols or brokers
- A migration from one messaging platform to another must be gradual
- You want to decouple producers and consumers from a specific messaging technology
- Interoperability between cloud and on-premises messaging is required

**WAF Pillars:** Cost Optimization, Operational Excellence

**Related patterns:** Anti-Corruption Layer, Publisher/Subscriber, Strangler Fig

---

### 29. Pipes and Filters

**Decompose a task that performs complex processing into a series of separate elements that can be reused.**

**Problem:** Monolithic processing pipelines are difficult to test, scale, and reuse. A single stage's failure brings down the entire pipeline. Different stages may have different scaling needs that cannot be addressed independently.

**When to use:**

- Processing can be broken into discrete, independent steps
- Different steps have different scaling or deployment requirements
- You want to reorder, add, or remove processing stages without rewriting the pipeline
- Individual steps should be independently testable and reusable

**WAF Pillars:** Reliability

**Related patterns:** Competing Consumers, Queue-Based Load Leveling, Choreography

---

### 30. Priority Queue

**Prioritize requests sent to services so that requests with a higher priority are received and processed more quickly than those with a lower priority.**

**Problem:** A standard FIFO queue treats all messages equally. High-priority messages (payment confirmations, alerts) wait behind low-priority messages (reports, analytics), degrading the experience for critical operations.

**When to use:**

- The system serves multiple clients or tenants with different SLAs
- Certain message types must be processed within tighter time bounds
- You need to guarantee that premium or critical workloads are not starved
- Different priorities map to different processing costs or deadlines

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Competing Consumers, Queue-Based Load Leveling, Throttling

---

### 31. Publisher/Subscriber

**Enable an application to announce events to multiple interested consumers asynchronously, without coupling the senders to the receivers.**

**Problem:** A service that must notify multiple consumers directly becomes tightly coupled to each consumer. Adding a new consumer requires changing the producer. Synchronous calls from the producer to each consumer increase latency and reduce availability.

**When to use:**

- Multiple consumers need to react to the same event independently
- Producers and consumers should evolve independently
- You need to add new consumers without modifying the producer
- The system benefits from temporal decoupling (consumers process events at their own pace)

**WAF Pillars:** Reliability, Security, Cost Optimization, Operational Excellence, Performance Efficiency

**Related patterns:** Choreography, Competing Consumers, Event Sourcing, Queue-Based Load Leveling

---

### 32. Quarantine

**Ensure external assets meet a team-agreed quality level before being authorized for consumption.**

**Problem:** External artifacts (container images, packages, data files, IaC modules) may contain vulnerabilities, malware, or misconfigurations. Consuming them without validation exposes the system to supply-chain attacks and compliance violations.

**When to use:**

- You consume third-party container images, packages, or libraries
- Compliance requires scanning artifacts before deployment
- You want to gate promotion of artifacts through quality tiers (untested → tested → approved)
- You need traceability for which version of an external asset is deployed

**WAF Pillars:** Security, Operational Excellence

**Related patterns:** Gatekeeper, Gateway Offloading, Pipes and Filters

---

### 33. Queue-Based Load Leveling

**Use a queue that acts as a buffer between a task and a service it invokes, to smooth intermittent heavy loads.**

**Problem:** Spikes in demand can overwhelm a service, causing throttling, errors, or outages. Over-provisioning to handle peak load wastes resources during normal periods.

**When to use:**

- The workload is bursty but the backend service scales slowly or is expensive to over-provision
- You need to decouple the rate at which work is submitted from the rate at which it is processed
- You want to flatten traffic spikes without losing requests
- Temporary unavailability of the backend should not result in data loss

**WAF Pillars:** Reliability, Cost Optimization, Performance Efficiency

**Related patterns:** Competing Consumers, Priority Queue, Throttling, Bulkhead

---

### 34. Rate Limiting

**Control the rate of requests a client or service can send to or receive from another service, to prevent overconsumption of resources.**

**Problem:** Without rate limiting, a misbehaving or compromised client can exhaust backend resources, causing service degradation for all consumers. Uncontrolled traffic can also trigger cloud provider throttling or incur excessive costs.

**When to use:**

- You need to protect shared resources from overconsumption
- You want to enforce fair usage across tenants or clients
- The backend has known capacity limits and you want to stay below them
- You need to prevent cascade failures caused by sudden traffic surges

**WAF Pillars:** Reliability

**Related patterns:** Throttling, Bulkhead, Circuit Breaker, Queue-Based Load Leveling

---

### 35. Retry

**Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource, by transparently retrying a failed operation.**

**Problem:** Cloud environments experience transient faults—brief network glitches, service restarts, temporary throttling. If the application treats every failure as permanent, it unnecessarily degrades the user experience or triggers costly failovers.

**When to use:**

- The failure is likely transient (HTTP 429, 503, timeout)
- Retrying the same request is idempotent or safe
- You want to improve resilience without complex failover infrastructure
- The downstream service is expected to recover within a short window

**WAF Pillars:** Reliability

**Related patterns:** Circuit Breaker, Bulkhead, Ambassador, Health Endpoint Monitoring

---

### 36. Saga

**Manage data consistency across microservices in distributed transaction scenarios.**

**Problem:** Traditional distributed transactions (two-phase commit) are not feasible across microservices with independent databases. Without a coordination mechanism, partial failures leave the system in an inconsistent state.

**When to use:**

- A business operation spans multiple microservices, each with its own data store
- You need eventual consistency with well-defined compensating actions
- Two-phase commit is not available or would introduce unacceptable coupling
- Each step can be reversed by a compensating transaction

**WAF Pillars:** Reliability

**Related patterns:** Compensating Transaction, Choreography, Scheduler Agent Supervisor, Event Sourcing

---

### 37. Scheduler Agent Supervisor

**Coordinate a set of distributed actions as a single operation. If any of the actions fail, try to handle the failures transparently, or else undo the work that was performed.**

**Problem:** Distributed workflows involve multiple steps across different services. You need a way to track progress, detect failures, retry or compensate, and ensure the workflow reaches a terminal state.

**When to use:**

- A workflow spans multiple remote services that must be orchestrated
- You need centralized monitoring and control of workflow progress
- Failures should be retried or compensated automatically
- The workflow must guarantee completion or rollback

**WAF Pillars:** Reliability, Performance Efficiency

**Related patterns:** Saga, Compensating Transaction, Leader Election, Retry

---

### 38. Sequential Convoy

**Process a set of related messages in a defined order, without blocking processing of other groups of messages.**

**Problem:** Message ordering is required within a logical group (e.g., all events for a single order), but a strict global order would serialize the entire system and destroy throughput. Different groups are independent and can be processed in parallel.

**When to use:**

- Messages within a group must be processed in order (e.g., order events, session events)
- Different groups can be processed concurrently
- Out-of-order processing within a group would cause data corruption or incorrect results
- You need high throughput across groups while preserving per-group ordering

**WAF Pillars:** Reliability

**Related patterns:** Competing Consumers, Priority Queue, Queue-Based Load Leveling

---

### 39. Sharding

**Divide a data store into a set of horizontal partitions or shards.**

**Problem:** A single database server has finite storage, compute, and I/O capacity. As data volume and query load grow, vertical scaling becomes prohibitively expensive or hits hardware limits. A single server is also a single point of failure.

**When to use:**

- A single data store cannot handle the storage or throughput requirements
- You need to distribute data geographically for latency or compliance
- You want to scale horizontally by adding more nodes
- Different subsets of data have different access patterns or SLAs

**WAF Pillars:** Reliability, Cost Optimization

**Related patterns:** Index Table, Materialized View, Geode, Deployment Stamps

---

### 40. Sidecar

**Deploy components of an application into a separate process or container to provide isolation and encapsulation.**

**Problem:** Applications need cross-cutting functionality (logging, monitoring, configuration, networking). Embedding these directly into the application creates tight coupling, language lock-in, and shared failure domains.

**When to use:**

- The cross-cutting component must run alongside the main application but in isolation
- The main application and the sidecar can be developed in different languages
- You want to extend or add functionality without modifying the main application
- The sidecar has a different lifecycle or scaling requirement than the main application

**WAF Pillars:** Security, Operational Excellence

**Related patterns:** Ambassador, Gateway Offloading, Compute Resource Consolidation

---

### 41. Static Content Hosting

**Deploy static content to a cloud-based storage service that can deliver them directly to the client.**

**Problem:** Serving static files (images, scripts, stylesheets, documents) from the application server wastes compute that should be reserved for dynamic content. It also limits scalability and increases cost.

**When to use:**

- The application serves static files that do not change per request
- You want to reduce the load on your web servers
- A CDN can accelerate delivery to geographically distributed users
- You want cost-effective, highly available static file hosting

**WAF Pillars:** Cost Optimization

**Related patterns:** Valet Key, Geode, Cache-Aside

---

### 42. Strangler Fig

**Incrementally migrate a legacy system by gradually replacing specific pieces of functionality with new applications and services.**

**Problem:** Rewriting a large legacy system from scratch is risky, expensive, and often fails. The legacy system must continue operating during the migration. A big-bang cutover is not feasible due to risk and business continuity requirements.

**When to use:**

- You are replacing a monolithic legacy application incrementally
- You need the old and new systems to coexist during migration
- You want to reduce migration risk by delivering value incrementally
- The legacy system is too large or complex for a single-phase replacement

**WAF Pillars:** Reliability, Cost Optimization, Operational Excellence

**Related patterns:** Anti-Corruption Layer, Gateway Routing, Sidecar

---

### 43. Throttling

**Control the consumption of resources used by an instance of an application, an individual tenant, or an entire service.**

**Problem:** Sudden spikes in demand or a misbehaving tenant can exhaust shared resources, degrading the experience for all users. Over-provisioning to handle worst-case load is expensive and wasteful.

**When to use:**

- You need to enforce SLAs or quotas per tenant in a multi-tenant system
- The backend has known capacity limits and exceeding them causes instability
- You want to degrade gracefully under load rather than fail entirely
- You need to prevent a single workload from monopolizing shared resources

**WAF Pillars:** Reliability, Security, Cost Optimization, Performance Efficiency

**Related patterns:** Rate Limiting, Bulkhead, Queue-Based Load Leveling, Priority Queue

---

### 44. Valet Key

**Use a token or key that provides clients with restricted direct access to a specific resource or service.**

**Problem:** Routing all data transfers through the application server creates a bottleneck and increases cost. Giving clients direct access to the data store without any restriction is a security risk.

**When to use:**

- You want to minimize the load on the application server for data-intensive transfers (uploads/downloads)
- You need to grant time-limited, scope-limited access to a specific resource
- You want to offload transfer bandwidth to a storage service (Blob Storage, S3)
- Clients need temporary access without full credentials to the data store

**WAF Pillars:** Security, Cost Optimization, Performance Efficiency

**Related patterns:** Gatekeeper, Static Content Hosting, Federated Identity

---

> Source: [Azure Architecture Center — Cloud Design Patterns](https://learn.microsoft.com/en-us/azure/architecture/patterns/)

```

### references/technology-choices.md

```markdown
# Azure Technology Choice Decision Frameworks

Decision frameworks for selecting the right Azure service in each category. Use these tables to compare options based on scale, cost, complexity, and use case fit.

## Decision Approach

1. **Start with requirements** — workload type, scale needs, team expertise
2. **Use the comparison tables** — narrow to 2-3 candidates
3. **Follow the decision trees** — Azure Architecture Center provides flowcharts for compute, data store, load balancing, and messaging
4. **Validate with constraints** — budget, compliance, regional availability, existing infrastructure

---

## 1. Compute

Choose a compute service based on control needs, scaling model, and operational complexity.

| Service | Best For | Scale | Complexity | Cost Model |
|---|---|---|---|---|
| Azure VMs | Full control, lift-and-shift, custom OS | Manual/VMSS | High | Per-hour |
| App Service | Web apps, APIs, mobile backends | Built-in autoscale | Low | Per App Service plan |
| Azure Functions | Event-driven, short-lived processes | Consumption-based auto | Very Low | Per execution |
| AKS | Microservices, complex orchestration | Node/pod autoscaling | High | Per node VM |
| Container Apps | Serverless containers, microservices | KEDA-based autoscale | Medium | Per vCPU/memory/s |
| Container Instances | Simple containers, batch jobs | Per-instance | Very Low | Per second |

**Quick decision:**
- Need full OS control? → **VMs**
- Web app or API with minimal ops? → **App Service**
- Short-lived event-driven code? → **Functions**
- Complex microservices with K8s expertise? → **AKS**
- Microservices without K8s management? → **Container Apps**
- Run a container quickly, no orchestration? → **Container Instances**

---

## 2. Storage

Choose a storage service based on data structure, access patterns, and scale.

| Service | Best For | Access Pattern | Scale | Cost |
|---|---|---|---|---|
| Blob Storage | Unstructured data, media, backups | REST API, SDK | Massive | Per GB + operations |
| Azure Files | SMB/NFS file shares, lift-and-shift | File system mount | TB-scale | Per GB provisioned |
| Queue Storage | Simple message queuing | Pull-based | High throughput | Very low per message |
| Table Storage | NoSQL key-value data | REST API | TB-scale | Per GB + operations |
| Data Lake Storage | Big data analytics, hierarchical namespace | ABFS, REST | Massive | Per GB, tiered |

**Quick decision:**
- Blobs, images, videos, backups? → **Blob Storage**
- Need a mounted file share (SMB/NFS)? → **Azure Files**
- Simple async message queue? → **Queue Storage**
- Key-value NoSQL without Cosmos DB cost? → **Table Storage**
- Big data analytics with hierarchical namespace? → **Data Lake Storage**

---

## 3. Database

Choose a database based on data model, consistency needs, and scale requirements.

| Service | Best For | Consistency | Scale | Cost Model |
|---|---|---|---|---|
| Azure SQL | Relational, OLTP, enterprise apps | Strong (ACID) | Up to Hyperscale | DTU or vCore-based |
| Cosmos DB | Global distribution, multi-model, low latency | Tunable (5 levels) | Unlimited horizontal | RU/s + storage |
| Azure Database for PostgreSQL | Open-source relational, PostGIS, JSON | Strong (ACID) | Flexible Server auto | vCore-based |
| Azure Database for MySQL | Open-source relational, web apps | Strong (ACID) | Flexible Server auto | vCore-based |

**Quick decision:**
- Enterprise SQL Server workloads? → **Azure SQL**
- Global distribution or single-digit-ms latency? → **Cosmos DB**
- Open-source relational with spatial/JSON? → **PostgreSQL**
- Open-source relational for web apps? → **MySQL**

---

## 4. Messaging

Choose a messaging service based on delivery guarantees, throughput, and integration pattern.

| Service | Best For | Delivery | Throughput | Cost |
|---|---|---|---|---|
| Service Bus | Enterprise messaging, ordered delivery, transactions | At-least-once, at-most-once | Moderate-high | Per operation + unit |
| Event Hubs | Event streaming, telemetry, big data ingestion | At-least-once, partitioned | Very high (millions/s) | Per TU/PU + ingress |
| Event Grid | Event-driven reactive programming, webhooks | At-least-once | High | Per operation |
| Queue Storage | Simple async messaging, decoupling | At-least-once | Moderate | Very low per message |

**Quick decision:**
- Enterprise messaging with ordering/transactions? → **Service Bus**
- High-volume event streaming or telemetry? → **Event Hubs**
- Reactive event routing (resource events, webhooks)? → **Event Grid**
- Simple, cheap async decoupling? → **Queue Storage**

---

## 5. Networking

Choose a load balancing service based on traffic scope, protocol layer, and feature needs.

| Service | Best For | Scope | Layer | Features |
|---|---|---|---|---|
| Azure Front Door | Global HTTP(S) load balancing, CDN, WAF | Global | Layer 7 | CDN, WAF, SSL offload, caching |
| Application Gateway | Regional HTTP(S) load balancing, WAF | Regional | Layer 7 | WAF, URL routing, SSL termination |
| Azure Load Balancer | TCP/UDP traffic distribution | Regional | Layer 4 | High perf, zone redundant |
| Traffic Manager | DNS-based global traffic routing | Global | DNS | Failover, performance, geographic routing |

**Quick decision:**
- Global HTTP(S) with CDN and WAF? → **Front Door**
- Regional HTTP(S) with WAF? → **Application Gateway**
- Regional TCP/UDP load balancing? → **Load Balancer**
- DNS-based global failover? → **Traffic Manager**

---

## 6. AI Services

Choose an AI service based on customization needs and model type.

| Service | Best For | Complexity | Scale |
|---|---|---|---|
| Azure OpenAI | LLMs, GPT models, generative AI | Medium | API-based, token pricing |
| Azure AI Services | Pre-built AI (vision, speech, language) | Low | API-based, per transaction |
| Azure Machine Learning | Custom ML models, MLOps, training | High | Compute cluster-based |

**Quick decision:**
- Need GPT/LLM capabilities? → **Azure OpenAI**
- Pre-built vision, speech, or language? → **AI Services**
- Custom model training and MLOps? → **Azure Machine Learning**

---

## 7. Containers

Choose a container service based on orchestration needs and operational complexity.

| Service | Best For | Orchestration | Complexity | Cost |
|---|---|---|---|---|
| AKS | Full Kubernetes, complex workloads | Full K8s control plane | High | Per node VM |
| Container Apps | Serverless containers, microservices, event-driven | Managed (built on K8s) | Medium | Per vCPU/memory/s |
| Container Instances | Simple containers, sidecar groups, batch | None (per-instance) | Very Low | Per second |

**Quick decision:**
- Need full Kubernetes API and control? → **AKS**
- Serverless containers with event-driven scaling? → **Container Apps**
- Run a single container or batch job quickly? → **Container Instances**

---

## Related Decision Trees

The Azure Architecture Center provides detailed flowcharts for these decisions:

- [Choose a compute service](https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/compute-decision-tree)
- [Compare Container Apps with other options](https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/compute-decision-tree#compare-container-options)
- [Choose a data store](https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/data-store-overview)
- [Load balancing decision tree](https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/load-balancing-overview)
- [Compare messaging services](https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/messaging)

> Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)

```

### references/best-practices.md

```markdown
# Cloud Application Best Practices

Twelve best practices from the Azure Architecture Center for designing, building, and operating cloud applications.

---

## 1. API Design

Design RESTful web APIs that promote platform independence and loose coupling between clients and services.

### Key Recommendations

- Organize APIs around resources using nouns, not verbs, in URIs
- Use standard HTTP methods (GET, POST, PUT, PATCH, DELETE) with correct semantics
- Use plural nouns for collection endpoints (e.g., `/orders`, `/customers`)
- Support HATEOAS to enable client navigation of the API without prior knowledge
- Design coarse-grained operations to avoid chatty request patterns
- Do not expose internal database structure through the API surface
- Version APIs to manage breaking changes without disrupting existing clients
- Return appropriate HTTP status codes and consistent error response bodies

### WAF Pillar Alignment

Performance Efficiency · Operational Excellence

### Common Mistakes

- Using verbs in URIs (e.g., `/getOrders`) instead of resource-based paths
- Exposing database schema directly through API contracts
- Creating chatty APIs that require multiple round-trips for a single logical operation

---

## 2. API Implementation

Implement web APIs to be efficient, responsive, scalable, and available for consuming clients.

### Key Recommendations

- Make actions idempotent so retries are safe (especially PUT and DELETE)
- Support content negotiation via `Accept` and `Content-Type` headers
- Follow the HTTP specification for status codes, methods, and headers
- Handle exceptions gracefully and return meaningful error responses
- Support resource discovery through links and metadata
- Limit and paginate large result sets to minimize network traffic
- Handle large requests asynchronously using `202 Accepted` with status polling
- Compress responses where appropriate to reduce payload size

### WAF Pillar Alignment

Operational Excellence

### Common Mistakes

- Not handling large requests asynchronously, causing timeouts
- Not minimizing network traffic through pagination, filtering, or compression

---

## 3. Autoscaling

Dynamically allocate and deallocate resources to match performance requirements while optimizing cost.

### Key Recommendations

- Use Azure Monitor autoscale and built-in platform autoscaling features
- Scale based on metrics that directly correlate with load (CPU, queue length, request rate)
- Combine schedule-based and metric-based scaling for predictable traffic patterns
- Set appropriate minimum, maximum, and default instance counts
- Configure scale-in rules as carefully as scale-out rules
- Use cooldown periods to prevent oscillation (flapping)
- Plan for the delay between triggering a scale event and resources becoming available

### WAF Pillar Alignment

Performance Efficiency · Cost Optimization

### Common Mistakes

- Not setting appropriate minimum and maximum limits for scaling
- Not considering scale-in behavior, leading to premature resource removal
- Using metrics that do not accurately reflect application load

---

## 4. Background Jobs

Implement batch processing, long-running tasks, and workflows as background jobs decoupled from the user interface.

### Key Recommendations

- Use Azure platform services such as Functions, WebJobs, and Batch for hosting
- Trigger background jobs with events, schedules, or message queues
- Return results to calling tasks through queues, events, or shared storage
- Design jobs to be independently deployable, scalable, and versioned
- Handle partial failures and support safe restarts with checkpointing
- Monitor job health with logging, metrics, and alerting
- Implement graceful shutdown to allow in-progress work to complete

### WAF Pillar Alignment

Operational Excellence

### Common Mistakes

- Not handling failures or partial completion within long-running jobs
- Not monitoring background job health, missing silent failures

---

## 5. Caching

Copy frequently read, rarely modified data to fast storage close to the application to improve performance.

### Key Recommendations

- Cache data that is read often but changes infrequently
- Use Azure Cache for Redis for distributed, high-throughput caching
- Set appropriate expiration policies (TTL) to balance freshness and hit rates
- Handle cache misses gracefully with a cache-aside pattern
- Address concurrency issues when multiple processes update the same cached data
- Implement cache invalidation strategies aligned with data change patterns
- Pre-populate caches for known hot data during application startup

### WAF Pillar Alignment

Performance Efficiency

### Common Mistakes

- Caching highly volatile data that expires before it can be served
- Not handling cache invalidation, serving stale data to users
- Cache stampede — many concurrent requests regenerating the same expired entry

---

## 6. CDN (Content Delivery Network)

Use CDNs to deliver static and dynamic web content efficiently to users from edge locations worldwide.

### Key Recommendations

- Offload static assets (images, scripts, stylesheets) to the CDN to reduce origin load
- Configure appropriate cache-control headers for each content type
- Version static content via file names or query strings for reliable cache busting
- Use HTTPS and enforce TLS for secure delivery
- Plan for CDN fallback so the application degrades gracefully if the CDN is unavailable
- Handle deployment and versioning so users receive updated content promptly

### WAF Pillar Alignment

Performance Efficiency

### Common Mistakes

- Not versioning content, causing users to receive stale cached assets after deployments
- Setting improper cache headers, resulting in under-caching or over-caching

---

## 7. Data Partitioning

Divide data stores into partitions to improve scalability, availability, and performance while reducing contention and storage costs.

### Key Recommendations

- Choose partition keys that distribute data and load evenly across partitions
- Use horizontal (sharding), vertical, or functional partitioning based on access patterns
- Minimize cross-partition queries to avoid performance degradation
- Design partitions to match the most common query patterns
- Plan for rebalancing as data volume and access patterns evolve
- Consider partition limits and throughput caps of the target data store
- Reduce contention and storage costs by separating hot and cold data

### WAF Pillar Alignment

Performance Efficiency · Cost Optimization

### Common Mistakes

- Creating hotspots by selecting a partition key with skewed distribution
- Not considering the cost and latency of cross-partition queries

---

## 8. Data Partitioning Strategies (by Service)

Apply service-specific partitioning strategies across Azure SQL Database, Azure Table Storage, Azure Blob Storage, and other data services.

### Key Recommendations

- Shard Azure SQL Database to distribute data for horizontal scaling
- Design Azure Table Storage partition keys around query access patterns
- Organize Azure Blob Storage using virtual directories and naming conventions
- Align partition boundaries with the most frequent query predicates
- Reduce latency by co-locating related data within the same partition
- Monitor partition metrics and rebalance when skew is detected

### WAF Pillar Alignment

Performance Efficiency · Cost Optimization

### Common Mistakes

- Not aligning the partition strategy with actual query patterns, causing full scans
- Ignoring service-specific partition limits and throttling thresholds

---

## 9. Host Name Preservation

Preserve the original HTTP host name between reverse proxies and backend web applications to avoid issues with cookies, redirects, and CORS.

### Key Recommendations

- Forward the original `Host` header from the reverse proxy to the backend
- Configure Azure Front Door, Application Gateway, and API Management for host preservation
- Ensure cookies are set with the correct domain matching the original host name
- Verify redirect URLs reference the external host name, not the internal backend address
- Test CORS configurations end-to-end with the preserved host name
- Document host name flow across all network hops in the architecture

### WAF Pillar Alignment

Reliability

### Common Mistakes

- Not preserving host headers, causing redirect loops or incorrect absolute URLs
- Breaking session cookies because the cookie domain does not match the forwarded host

---

## 10. Message Encoding

Choose the right payload structure, encoding format, and serialization library for asynchronous messages exchanged between distributed components.

### Key Recommendations

- Evaluate JSON, Avro, Protobuf, and MessagePack based on performance and interoperability needs
- Use schema registries to enforce and version message contracts
- Validate incoming messages against their schemas before processing
- Prefer compact binary formats (Protobuf, Avro) for high-throughput, latency-sensitive paths
- Use JSON for human-readable messages and broad ecosystem compatibility
- Consider backward and forward compatibility when evolving message schemas

### WAF Pillar Alignment

Security

### Common Mistakes

- Using an inefficient encoding format for high-volume message streams
- Not validating message schemas, allowing malformed data into the processing pipeline

---

## 11. Monitoring and Diagnostics

Track system health, usage, and performance through a comprehensive monitoring pipeline that turns raw data into alerts, reports, and automated triggers.

### Key Recommendations

- Instrument applications with structured logging, metrics, and distributed tracing
- Use Azure Monitor, Application Insights, and Log Analytics as the monitoring backbone
- Define actionable alerts with clear thresholds, severity levels, and response procedures
- Detect and correct issues before they affect users by monitoring leading indicators
- Correlate telemetry across services using distributed trace context (e.g., correlation IDs)
- Establish performance baselines and track deviations over time
- Build dashboards for operational visibility across all tiers of the architecture
- Review and tune alert rules regularly to reduce noise

### WAF Pillar Alignment

Operational Excellence

### Common Mistakes

- Insufficient logging, making incident root-cause analysis slow or impossible
- Not using distributed tracing in microservice or multi-tier architectures
- Alert fatigue from poorly tuned thresholds that generate excessive false positives

---

## 12. Transient Fault Handling

Detect and handle transient faults caused by momentary loss of network connectivity, temporary service unavailability, or resource throttling.

### Key Recommendations

- Implement retry logic with exponential backoff and jitter for transient failures
- Use a circuit breaker pattern to stop retrying when failures are persistent
- Distinguish transient faults (e.g., HTTP 429, 503) from permanent errors (e.g., 400, 404)
- Leverage built-in retry policies in Azure SDKs before adding custom retry logic
- Avoid duplicating retry layers across middleware and application code
- Log every retry attempt for post-incident analysis
- Set a maximum retry count and total timeout to bound retry duration

### WAF Pillar Alignment

Reliability

### Common Mistakes

- Retrying non-transient faults (e.g., authentication failures, bad requests)
- Not using exponential backoff, overwhelming a recovering service with constant retries
- Retry storms caused by multiple layers retrying simultaneously without coordination

---

> Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)

```

### references/mission-critical.md

```markdown
# Mission-Critical Architecture on Azure

Guidance for designing mission-critical workloads on Azure that prioritize cloud-native capabilities to maximize reliability and operational effectiveness.

**Target SLO:** **99.99%** or higher — permitted annual downtime: **52 minutes 35 seconds**.

All encompassed design decisions are intended to accomplish this target SLO.

| SLO Target | Permitted Annual Downtime | Typical Use Case |
|---|---|---|
| 99.9% | 8 hours 45 minutes | Standard business apps |
| 99.95% | 4 hours 22 minutes | Important business apps |
| 99.99% | 52 minutes 35 seconds | Mission-critical workloads |
| 99.999% | 5 minutes 15 seconds | Safety-critical systems |

---

## Key Design Strategies

### 1. Redundancy in Layers

Deploy redundancy at every layer of the architecture to eliminate single points of failure.

- Deploy to multiple regions in an **active-active** model — application distributed across 2+ Azure regions handling active user traffic simultaneously
- Utilize **availability zones** for all considered services — distributing components across physically separate datacenters inside a region
- Choose resources that support **global distribution** natively
- Apply zone-redundant configurations for all stateful services
- Ensure data replication meets RPO requirements across regions

**Azure services:** Azure Front Door (global routing), Azure Traffic Manager (DNS failover), Azure Cosmos DB (multi-region writes), Azure SQL (geo-replication)

### 2. Deployment Stamps

Deploy regional stamps as scale units — a logical set of resources that can be independently provisioned to keep up with demand changes.

- Each stamp is a **self-contained scale unit** with its own compute, caching, and local state
- Multiple nested scale units within a stamp (e.g., Frontend APIs and Background processors scale independently)
- **No dependencies between scale units** — they only communicate with shared services outside the stamp
- Scale units are **temporary/ephemeral** — store persistent system-of-record data only in the replicated database
- Use stamps for blue-green deployments by rolling out new units, validating, and gradually shifting traffic

**Key benefit:** Compartmentalization enables independent scaling and fault isolation per region.

### 3. Reliable and Repeatable Deployments

Apply the principle of Infrastructure as Code (IaC) for version control and standardized operations.

- Use **Terraform** or **Bicep** for infrastructure definition with version control
- Implement **zero-downtime blue/green deployment** pipelines — build and release pipelines fully automated
- Apply **environment consistency** — use the same deployment pipeline code across production and pre-production environments
- Integrate **continuous validation** — automated testing as part of DevOps processes
- Include synchronized **load and chaos testing** to validate both application code and underlying infrastructure
- Deploy stamps as a **single operational unit** — never partially deploy a stamp

### 4. Operational Insights

Build comprehensive observability without introducing single points of failure.

- Use **federated workspaces** for observability data — monitoring data for global and regional resources stored independently
- A centralized observability store is **NOT recommended** (it becomes a single point of failure)
- Use **cross-workspace querying** to achieve a unified data sink and single pane of glass for operations
- Construct a **layered health model** mapping application health to a traffic light model for contextualizing
- Health scores calculated for each **individual component**, then **aggregated at user flow level**
- Combine with key non-functional requirements (performance) as coefficients to quantify application health

---

## Design Areas

Each design area must be addressed for a mission-critical architecture.

| Design Area | Description | Key Concerns |
|---|---|---|
| **Application platform** | Infrastructure choices and mitigations for potential failure cases | AKS vs App Service, availability zones, containerization |
| **Application design** | Design patterns that allow for scaling and error handling | Stateless services, async messaging, queue-based decoupling |
| **Networking and connectivity** | Network considerations for routing incoming traffic to stamps | Global load balancing, WAF, DDoS protection, private endpoints |
| **Data platform** | Choices in data store technologies | Volume, velocity, variety, veracity; active-active vs active-passive |
| **Deployment and testing** | Strategies for CI/CD pipelines and automation | Blue/green deployments, load testing, chaos testing |
| **Health modeling** | Observability through customer impact analysis | Correlated monitoring, traffic light model, health scores |
| **Security** | Mitigation of attack vectors | Microsoft Zero Trust model, identity-based access, encryption |
| **Operational procedures** | Processes related to runtime operations | Deployment SOPs, key management, patching, incident response |

---

## Active-Active Multi-Region Architecture

The core topology for mission-critical workloads distributes the application across multiple Azure regions.

### Architecture Characteristics

- Application distributed across **2+ Azure regions** handling active user traffic simultaneously
- Each region contains independent **deployment stamps** (scale units)
- **Azure Front Door** provides global routing, SSL termination, and WAF at the edge
- Scale units have **no cross-dependencies** — they communicate only with shared services (e.g., global database, DNS)
- Persistent data resides only in the **replicated database** — stamps store no durable local state
- When scale units are replaced or retired, applications reconnect transparently

### Data Replication Strategies

| Strategy | Writes | Reads | Consistency | Best For |
|---|---|---|---|---|
| Active-passive (Azure SQL) | Single primary region | All regions via read replicas | Strong | Relational data, ACID transactions |
| Active-active (Cosmos DB) | All regions | All regions | Tunable (5 levels) | Document/key-value data, global apps |
| Write-behind (Redis → SQL) | Redis first, async to SQL | Redis or SQL | Eventual | High-throughput writes, rate limiting |

### Regional Stamp Composition

Each stamp typically includes:

- **Compute tier** — App Service or AKS with multiple instances across availability zones
- **Caching tier** — Azure Managed Redis for session state, rate limiting, feature flags
- **Configuration** — Azure App Configuration for settings (capacity correlates with requests/second)
- **Secrets** — Azure Key Vault for certificates and secrets
- **Networking** — Virtual network with private endpoints, NSGs, and service endpoints

---

## Health Modeling and Traffic Light Approach

Health modeling provides the foundation for automated operational decisions.

### Building the Health Model

1. **Identify user flows** — map critical paths through the application (e.g., "user login", "checkout", "search")
2. **Decompose into components** — each flow depends on specific compute, data, and network components
3. **Assign health scores** — each component reports a health score based on metrics (latency, error rate, saturation)
4. **Aggregate per flow** — combine component scores weighted by criticality to produce a flow-level health score
5. **Apply traffic light** — map aggregate scores to **Green** (healthy), **Yellow** (degraded), **Red** (unhealthy)

### Health Score Coefficients

| Factor | Metric Examples | Weight Guidance |
|---|---|---|
| Availability | Error rate, HTTP 5xx ratio | High — directly impacts users |
| Performance | P95 latency, request duration | Medium — affects user experience |
| Saturation | CPU %, memory %, queue depth | Medium — indicates future problems |
| Freshness | Data replication lag, cache age | Lower — depends on consistency needs |

### Operational Actions by Health State

| State | Meaning | Automated Action |
|---|---|---|
| 🟢 Green | All components healthy | Normal operations |
| 🟡 Yellow | Degraded but functional | Alert on-call, increase monitoring frequency |
| 🔴 Red | Critical failure detected | Trigger failover, page on-call, block deployments |

---

## Zero-Downtime Deployment (Blue/Green)

Deployment must never cause downtime in a mission-critical system.

### Blue/Green Process

1. **Provision new stamp** — deploy a complete new scale unit ("green") alongside the existing one ("blue")
2. **Run validation** — execute automated smoke tests, integration tests, and synthetic transactions against the green stamp
3. **Canary traffic** — route a small percentage of production traffic (e.g., 5%) to the green stamp
4. **Monitor health** — compare health scores between blue and green stamps over a defined observation period
5. **Gradual shift** — increase traffic to green stamp in increments (5% → 25% → 50% → 100%)
6. **Decommission blue** — once green is fully validated, tear down the blue stamp

### Key Requirements

- Build and release pipelines must be **fully automated** — no manual deployment steps
- Use the **same pipeline code** for all environments (dev, staging, production)
- Each stamp deployed as a **single operational unit** — never partial
- Rollback is achieved by **shifting traffic back** to the previous stamp (still running during validation)
- **Continuous validation** runs throughout the deployment, not just at the end

---

## Chaos Engineering and Continuous Validation

Proactive failure testing ensures recovery mechanisms work before real incidents occur.

### Chaos Engineering Practices

- Use **Azure Chaos Studio** to run controlled experiments against production or pre-production environments
- Test failure modes: availability zone outage, network partition, dependency failure, CPU/memory pressure
- Run chaos experiments as part of the **CI/CD pipeline** — every deployment is validated under fault conditions
- **Synchronized load and chaos testing** — inject faults while the system is under realistic load

### Validation Checklist

- [ ] Health model detects injected faults within SLO-defined time windows
- [ ] Automated failover completes within target RTO
- [ ] No data loss exceeding target RPO during regional failover
- [ ] Application degrades gracefully (reduced functionality, not total failure)
- [ ] Alerts fire correctly and reach the on-call team
- [ ] Runbooks and automated remediation execute successfully

---

## Application Platform Considerations

### Platform Options

| Platform | Best For | Availability Zone Support | Complexity |
|---|---|---|---|
| **Azure App Service** | Web apps, APIs, PaaS-first approach | Yes (zone-redundant) | Low-Medium |
| **AKS** | Complex microservices, full K8s control | Yes (zone-redundant node pools) | High |
| **Container Apps** | Serverless containers, event-driven | Yes | Medium |

### Recommendations

- **Prioritize availability zones** for all production workloads — spread across physically separate datacenters
- **Containerize workloads** for reliability and portability between platforms
- Ensure all services in a scale unit support availability zones — don't mix zonal and non-zonal services
- For latency-sensitive or chatty workloads, consider tradeoffs of cross-zone traffic cost and latency

---

## Data Platform Considerations

### Choosing a Primary Database

| Scenario | Recommended Service | Deployment Model |
|---|---|---|
| Relational data, ACID transactions | **Azure SQL** | Active-passive with geo-replication |
| Global distribution, multi-model | **Azure Cosmos DB** | Active-active with multi-region writes |
| Multiple microservice databases | **Mixed (polyglot)** | Per-service database with appropriate model |

### Azure SQL in Mission-Critical

- Azure SQL does **not** natively support active-active concurrent writes in multiple regions
- Use **active-passive** strategy: single primary region for writes, read replicas in secondary regions
- **Partial active-active** possible at the application tier — route reads to local replicas, writes to primary
- Configure **auto-failover groups** for automated regional failover

### Azure Managed Redis in Mission-Critical

- Use within or alongside each scale unit for:
  - **Cache data** — rebuildable, repopulated on demand
  - **Session state** — user sessions during scale unit lifetime
  - **Rate limit counters** — per-user and per-tenant throttling
  - **Feature flags** — dynamic configuration without redeployment
  - **Coordination metadata** — distributed locks, leader election
- **Active geo-replication** enables Redis data to replicate asynchronously across regions
- Design cached data as either **rebuildable** (repopulate without availability impact) or **durable auxiliary state** (protected by persistence and geo-replication)

---

## Security in Mission-Critical

### Zero Trust Principles

- **Verify explicitly** — authenticate and authorize based on all available data points (identity, location, device, service)
- **Use least privilege access** — limit user access with Just-In-Time and Just-Enough-Access (JIT/JEA)
- **Assume breach** — minimize blast radius and segment access, verify end-to-end encryption, use analytics for threat detection

### Security Controls

| Layer | Control | Azure Service |
|---|---|---|
| Edge | DDoS protection, WAF | Azure Front Door, Azure DDoS Protection |
| Identity | Managed identities, RBAC | Microsoft Entra ID, Azure RBAC |
| Network | Private endpoints, NSGs | Azure Private Link, Virtual Network |
| Data | Encryption at rest and in transit | Azure Key Vault, TDE, TLS 1.2+ |
| Operations | Privileged access management | Microsoft Entra PIM, Azure Bastion |

---

## Operational Procedures

### Key Operational Processes

| Process | Description | Automation Level |
|---|---|---|
| **Deployment** | Blue/green with automated validation | Fully automated |
| **Scaling** | Stamp provisioning and decommissioning | Automated with manual approval gates |
| **Key rotation** | Certificate and secret rotation | Automated via Key Vault policies |
| **Patching** | OS and runtime updates | Automated via platform (PaaS) or pipeline (IaaS) |
| **Incident response** | Detection, triage, mitigation, resolution | Semi-automated (alert → runbook → human) |
| **Capacity planning** | Forecast demand, pre-provision stamps | Manual with data-driven analysis |

### Runbook Requirements

- All operational runbooks must be **tested in pre-production** with the same chaos/load scenarios as production
- **Automated remediation** preferred over manual intervention for known failure modes
- Runbooks must include **rollback procedures** for every change type
- **Post-incident reviews** (blameless) must feed back into health model and chaos experiment improvements

---

> Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)

```