Back to skills
SkillHub ClubResearch & OpsDevOpsBackendData / AI

building-with-envoy-gateway

This skill provides detailed guidance for implementing Envoy Gateway in Kubernetes, covering Gateway API, traffic policies, rate limiting, TLS, and AI-specific routing with Envoy AI Gateway. It includes concrete YAML examples, decision tables, and architecture diagrams for production deployments.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
159
Hot score
96
Updated
March 20, 2026
Overall rating
A8.3
Composite score
6.5
Best-practice grade
B71.9

Install command

npx @skill-hub/cli install panaversity-agentfactory-building-with-envoy-gateway
kubernetesapi-gatewaytraffic-managementllm-opscloud-native

Repository

panaversity/agentfactory

Skill path: .claude/skills/building-with-envoy-gateway

This skill provides detailed guidance for implementing Envoy Gateway in Kubernetes, covering Gateway API, traffic policies, rate limiting, TLS, and AI-specific routing with Envoy AI Gateway. It includes concrete YAML examples, decision tables, and architecture diagrams for production deployments.

Open repository

Best for

Primary workflow: Research & Ops.

Technical facets: DevOps, Backend, Data / AI.

Target audience: Platform engineers and SREs responsible for Kubernetes ingress, API gateway management, and AI agent infrastructure who need to implement production-grade traffic control..

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: panaversity.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install building-with-envoy-gateway into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/panaversity/agentfactory before adding building-with-envoy-gateway to shared team environments
  • Use building-with-envoy-gateway for devops workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: building-with-envoy-gateway
description: Build production traffic engineering for Kubernetes with Envoy Gateway, Gateway API, KEDA autoscaling, and Envoy AI Gateway. Use when implementing ingress, rate limiting, traffic routing, TLS, autoscaling, or LLM traffic management.
allowed-tools: Read, Grep, Glob, Edit, Write, Bash, WebSearch, WebFetch
model: claude-sonnet-4-20250514
---

# Traffic Engineering with Envoy Gateway

## Persona

You are a Platform Engineer specializing in Kubernetes traffic management and API gateway patterns. You've deployed Envoy Gateway in production for high-traffic AI agent platforms. You understand Gateway API as the new Kubernetes standard, Envoy Gateway's extension CRDs, KEDA event-driven autoscaling, and Envoy AI Gateway for LLM traffic. You follow CNCF best practices and can implement the full traffic stack: ingress routing, rate limiting, circuit breaking, TLS/mTLS, autoscaling, and AI-specific traffic management.

## When to Use This Skill

Activate when the user mentions:
- Envoy Gateway, Gateway API, GatewayClass
- HTTPRoute, GRPCRoute, TCPRoute, TLSRoute
- BackendTrafficPolicy, ClientTrafficPolicy, SecurityPolicy
- Rate limiting, circuit breaking, retries, load balancing
- TLS termination, mTLS, CertManager
- KEDA, ScaledObject, event-driven autoscaling
- Envoy AI Gateway, token-based rate limiting, provider fallback
- Ingress replacement, Traefik, Kong migration
- Canary deployments, blue-green, traffic splitting
- HPA, VPA, autoscaling for AI agents

## Core Concepts

### Gateway API: The New Kubernetes Standard

| Resource | Purpose | Scope |
|----------|---------|-------|
| **GatewayClass** | Defines gateway implementation (like StorageClass for networking) | Cluster |
| **Gateway** | Traffic entry point with listeners (ports, protocols, hostnames) | Namespace |
| **HTTPRoute** | L7 routing rules (path, headers, query params, methods) | Namespace |
| **GRPCRoute** | gRPC-specific routing with Protocol Buffers | Namespace |
| **ReferenceGrant** | Cross-namespace resource access control | Namespace |

### Envoy Gateway Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Control Plane                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   Gateway    │  │     xDS      │  │    Infra     │       │
│  │  Translator  │──│   Server     │──│   Manager    │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
│         │                 │                                  │
│         ▼                 │                                  │
│    Gateway API            │                                  │
│    + Extensions           │                                  │
└───────────────────────────│──────────────────────────────────┘
                            │ xDS Protocol
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                     Data Plane                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Envoy Proxy  │  │ Envoy Proxy  │  │ Envoy Proxy  │       │
│  │  (replica)   │  │  (replica)   │  │  (replica)   │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────┘
```

### Envoy Gateway Extension CRDs

| CRD | Purpose | Target | Key Features |
|-----|---------|--------|--------------|
| **BackendTrafficPolicy** | Gateway-to-backend traffic | HTTPRoute, Gateway | Rate limiting, retries, circuit breaker, load balancing |
| **ClientTrafficPolicy** | Client-to-gateway connections | Gateway | TLS, timeouts, keepalive, connection limits |
| **SecurityPolicy** | Authentication & authorization | HTTPRoute, Gateway | JWT, OIDC, Basic Auth, IP allowlist, CORS |
| **EnvoyProxy** | Proxy deployment config | GatewayClass | Replicas, resources, telemetry |
| **Backend** | Advanced endpoint config | - | FQDN, mTLS client certs |

## Decision Logic

### Which Policy for Which Scenario?

| Scenario | Policy | Configuration |
|----------|--------|---------------|
| Rate limit all traffic globally | BackendTrafficPolicy | `rateLimit.global` with Redis backend |
| Rate limit per-instance (cost-effective) | BackendTrafficPolicy | `rateLimit.local` |
| Retry transient failures | BackendTrafficPolicy | `retry.attempts`, `retry.retryOn` |
| Circuit breaker for unreliable backends | BackendTrafficPolicy | `healthChecks` + outlier detection |
| TLS termination at gateway | ClientTrafficPolicy | `tls.certificateRefs` |
| Client connection timeouts | ClientTrafficPolicy | `timeout.http` |
| JWT token validation | SecurityPolicy | `jwt.providers` with JWKS |
| SSO with identity provider | SecurityPolicy | `oidc.provider` |
| IP-based access control | SecurityPolicy | `authorization.rules` with `ipAddress` |

### Authentication Method Selection

```
Is enterprise SSO needed?
├── Yes → Use OIDC (delegate to identity provider)
└── No → Is stateless API auth acceptable?
    ├── Yes → Use JWT (validate JWKS locally)
    └── No → Is it simple internal API?
        ├── Yes → Use Basic Auth or API Key
        └── No → Use External Authorization service
```

### Rate Limiting Strategy

```
Need cross-instance coordination?
├── Yes → Global Rate Limit (requires Redis)
│         Use for: org-wide limits, preventing resource exhaustion
└── No → Local Rate Limit (per-proxy bucket)
         Use for: per-region limits, cost-effective protection
```

## Workflow: Full Traffic Stack Setup

### 1. Install Envoy Gateway via Helm

```bash
# Add Helm repo
helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.6.1 \
  -n envoy-gateway-system \
  --create-namespace

# Verify installation
kubectl wait --for=condition=Available deployment/envoy-gateway \
  -n envoy-gateway-system --timeout=120s

# Install Gateway API CRDs if not present
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
```

### 2. Create GatewayClass and Gateway

```yaml
# gateway-class.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
# gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: task-api-gateway
  namespace: default
spec:
  gatewayClassName: envoy-gateway
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: tls-cert
    allowedRoutes:
      namespaces:
        from: Same
```

### 3. Create HTTPRoute for Application

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: task-api-route
  namespace: default
spec:
  parentRefs:
  - name: task-api-gateway
  hostnames:
  - "api.example.com"
  rules:
  # API endpoints with versioning
  - matches:
    - path:
        type: PathPrefix
        value: /api/v1/tasks
    backendRefs:
    - name: task-api
      port: 8000

  # Health check endpoint
  - matches:
    - path:
        type: Exact
        value: /health
    backendRefs:
    - name: task-api
      port: 8000

  # Traffic splitting for canary
  - matches:
    - path:
        type: PathPrefix
        value: /api/v2/tasks
    backendRefs:
    - name: task-api-v2
      port: 8000
      weight: 10
    - name: task-api-v1
      port: 8000
      weight: 90
```

### 4. Apply Rate Limiting (BackendTrafficPolicy)

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: task-api-rate-limit
  namespace: default
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: task-api-route

  rateLimit:
    type: Global
    global:
      rules:
      # Per-user rate limit (distinct header)
      - clientSelectors:
        - headers:
          - type: Distinct
            name: x-user-id
        limit:
          requests: 100
          unit: Minute

      # Anonymous users (no x-user-id header)
      - clientSelectors:
        - headers:
          - name: x-user-id
            invert: true
        limit:
          requests: 10
          unit: Minute

  # Retry policy
  retry:
    numRetries: 3
    perRetryTimeout: 5s
    retryOn:
    - "5xx"
    - "reset"
    - "connect-failure"
    backoff:
      baseInterval: 100ms
      maxInterval: 10s
```

### 5. Configure Circuit Breaking

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: task-api-resilience
  namespace: default
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: task-api-route

  healthCheck:
    active:
      type: HTTP
      http:
        path: /health
        expectedStatuses:
        - 200
      interval: 10s
      timeout: 1s
      unhealthyThreshold: 3
      healthyThreshold: 1

  circuitBreaker:
    maxConnections: 100
    maxPendingRequests: 50
    maxRequests: 1000
```

### 6. Configure TLS with CertManager

```yaml
# Install CertManager first
# kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.0/cert-manager.yaml

# cluster-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          ingressClassName: envoy
---
# certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: default
spec:
  secretName: tls-cert
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - api.example.com
```

### 7. JWT Authentication (SecurityPolicy)

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: jwt-auth
  namespace: default
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: task-api-route

  jwt:
    providers:
    - name: auth0
      issuer: https://your-tenant.auth0.com/
      audiences:
      - https://api.example.com
      remoteJWKS:
        uri: https://your-tenant.auth0.com/.well-known/jwks.json
      claimToHeaders:
      - claim: sub
        header: x-user-id
      - claim: permissions
        header: x-user-permissions
```

### 8. Install KEDA for Autoscaling

```bash
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
```

### 9. Configure KEDA ScaledObject

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: task-api-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: task-api
    kind: Deployment
  minReplicaCount: 1
  maxReplicaCount: 20

  triggers:
  # Scale based on Prometheus metrics (request rate)
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_per_second
      query: sum(rate(envoy_http_downstream_rq_total{envoy_cluster_name="task-api"}[1m]))
      threshold: "100"

  # Scale based on Kafka consumer lag
  - type: kafka
    metadata:
      bootstrapServers: kafka.default:9092
      consumerGroup: task-processors
      topic: task-events
      lagThreshold: "50"
```

## Key Patterns

### Traffic Splitting for Canary Deployments

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: canary-route
spec:
  parentRefs:
  - name: api-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    # Stable version: 90%
    - name: api-stable
      port: 8000
      weight: 90
    # Canary version: 10%
    - name: api-canary
      port: 8000
      weight: 10
```

### Header-Based A/B Testing

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ab-test-route
spec:
  parentRefs:
  - name: api-gateway
  rules:
  # Beta users (header match)
  - matches:
    - headers:
      - name: x-beta-user
        value: "true"
    backendRefs:
    - name: api-v2
      port: 8000

  # All other users
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: api-v1
      port: 8000
```

### Envoy AI Gateway for LLM Traffic

```yaml
# For AI agent traffic management
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: llm-router
spec:
  backends:
  # Primary: OpenAI
  - name: openai
    priority: 0
    provider: openai
    model: gpt-4
    auth:
      type: APIKey
      apiKeyRef:
        name: openai-key

  # Fallback: Anthropic
  - name: anthropic
    priority: 1
    provider: anthropic
    model: claude-3-opus
    modelNameOverride: gpt-4
    auth:
      type: APIKey
      apiKeyRef:
        name: anthropic-key

  # Token-based rate limiting
  rateLimit:
    tokenBudget:
      perUser: 100000
      perMinute: 10000
```

## Safety & Guardrails

### NEVER
- Expose management endpoints (health checks, metrics) without authentication
- Use LocalRateLimit when cross-instance coordination is required
- Skip TLS for production traffic
- Set rate limits too high initially (start conservative, increase based on monitoring)
- Use weight 0 for all backends in traffic splitting (will fail)
- Deploy without health checks on backends

### ALWAYS
- Start with strict rate limits and loosen based on actual usage
- Use ReferenceGrant for cross-namespace access
- Configure health checks before enabling circuit breakers
- Test canary deployments with small traffic percentages first
- Monitor 429 (rate limit) and 503 (circuit breaker) responses
- Use mTLS for backend traffic in production
- Set appropriate timeouts (start with 30s, tune based on P99)

### Cost Engineering
- KEDA scale-to-zero saves 40-70% on idle workloads
- Token-based rate limiting prevents LLM cost overruns
- Local rate limiting avoids Redis costs when global isn't needed
- Schedule non-production gateways to scale down outside business hours

## TaskManager Example

Complete traffic engineering setup for Task API:

### Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  namespace: default
  labels:
    app: task-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: task-api
  template:
    metadata:
      labels:
        app: task-api
    spec:
      containers:
      - name: task-api
        image: task-api:latest
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: task-api
  namespace: default
spec:
  selector:
    app: task-api
  ports:
  - port: 8000
    targetPort: 8000
```

### Full Gateway Configuration

```yaml
# Gateway with TLS
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: task-gateway
spec:
  gatewayClassName: envoy-gateway
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: task-api-tls
---
# HTTPRoute with versioned paths
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: task-route
spec:
  parentRefs:
  - name: task-gateway
  hostnames:
  - tasks.example.com
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api/v1
    backendRefs:
    - name: task-api
      port: 8000
---
# Rate limiting + retries
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: task-traffic
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: task-route
  rateLimit:
    type: Global
    global:
      rules:
      - limit:
          requests: 100
          unit: Second
  retry:
    numRetries: 3
    retryOn: ["5xx", "reset"]
---
# JWT authentication
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: task-auth
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: task-route
  jwt:
    providers:
    - name: task-auth
      issuer: https://auth.example.com
      remoteJWKS:
        uri: https://auth.example.com/.well-known/jwks.json
---
# KEDA autoscaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: task-scaler
spec:
  scaleTargetRef:
    name: task-api
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      query: sum(rate(http_requests_total{app="task-api"}[1m]))
      threshold: "50"
```

## References

For detailed patterns, see:
- `references/gateway-api-patterns.md` - HTTPRoute matching examples
- `references/envoy-gateway-crds.md` - Full CRD reference
- `references/keda-scalers.md` - KEDA scaler configurations
- `references/ai-gateway.md` - Envoy AI Gateway patterns


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/gateway-api-patterns.md

```markdown
# Gateway API Patterns

## HTTPRoute Matching Reference

### Path Matching Types

| Type | Syntax | Matches | Example |
|------|--------|---------|---------|
| **PathPrefix** | `/api` | `/api`, `/api/`, `/api/users` | Most common for API routes |
| **Exact** | `/health` | Only `/health` exactly | Health checks, specific endpoints |
| **RegularExpression** | `^/api/v[0-9]+/.*` | `/api/v1/users`, `/api/v2/tasks` | Version patterns |

### Header Matching

```yaml
# Exact header match
matches:
- headers:
  - name: x-api-key
    value: secret123

# Header exists (any value)
matches:
- headers:
  - name: Authorization
    type: RegularExpression
    value: ".*"

# Header prefix match
matches:
- headers:
  - name: Content-Type
    type: RegularExpression
    value: "application/json.*"
```

### Query Parameter Matching

```yaml
# Exact match
matches:
- queryParams:
  - name: version
    value: "2"

# Regex match
matches:
- queryParams:
  - name: page
    type: RegularExpression
    value: "[0-9]+"
```

### Method Matching

```yaml
# Specific HTTP method
matches:
- method: POST
  path:
    type: PathPrefix
    value: /api/tasks

# Multiple methods
rules:
- matches:
  - method: GET
    path: {type: PathPrefix, value: /api/tasks}
  - method: POST
    path: {type: PathPrefix, value: /api/tasks}
  backendRefs:
  - name: task-api
    port: 8000
```

### Combined Matching (AND Logic)

```yaml
# All conditions must match
matches:
- path:
    type: PathPrefix
    value: /admin
  headers:
  - name: x-role
    value: admin
  method: DELETE
```

### Multiple Rules (OR Logic)

```yaml
rules:
# Rule 1: Admin users
- matches:
  - headers:
    - name: x-role
      value: admin
  backendRefs:
  - name: admin-api
    port: 8000

# Rule 2: Regular users (fallback)
- matches:
  - path:
      type: PathPrefix
      value: /api
  backendRefs:
  - name: user-api
    port: 8000
```

## Traffic Splitting Patterns

### Canary Deployment (Gradual Rollout)

```yaml
# Start with 5%, increase to 100%
rules:
- matches:
  - path: {type: PathPrefix, value: /api}
  backendRefs:
  - name: api-v1
    port: 8000
    weight: 95
  - name: api-v2
    port: 8000
    weight: 5
```

### Blue-Green with Header Switch

```yaml
# Blue (current production)
- matches:
  - path: {type: PathPrefix, value: /api}
  backendRefs:
  - name: api-blue
    port: 8000

# Green (new version) - header activated
- matches:
  - path: {type: PathPrefix, value: /api}
    headers:
    - name: x-version
      value: green
  backendRefs:
  - name: api-green
    port: 8000
```

### A/B Testing by User Segment

```yaml
# Beta users (10% of traffic)
- matches:
  - headers:
    - name: x-user-segment
      value: beta
  backendRefs:
  - name: api-experimental
    port: 8000

# Control group (90% of traffic)
- matches:
  - path: {type: PathPrefix, value: /}
  backendRefs:
  - name: api-stable
    port: 8000
```

## GRPCRoute Patterns

### Basic gRPC Service Routing

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: GRPCRoute
metadata:
  name: grpc-route
spec:
  parentRefs:
  - name: grpc-gateway
  hostnames:
  - grpc.example.com
  rules:
  - matches:
    - method:
        service: tasks.TaskService
        method: CreateTask
    backendRefs:
    - name: task-grpc
      port: 50051
```

### gRPC with Reflection

```yaml
rules:
# Reflection service (for grpcurl discovery)
- matches:
  - method:
      service: grpc.reflection.v1alpha.ServerReflection
  backendRefs:
  - name: task-grpc
    port: 50051

# Main service
- matches:
  - method:
      service: tasks.TaskService
  backendRefs:
  - name: task-grpc
    port: 50051
```

## Cross-Namespace Routing

### ReferenceGrant for Cross-Namespace Backend

```yaml
# In target namespace (where Service lives)
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-route-access
  namespace: backend-ns
spec:
  from:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    namespace: frontend-ns
  to:
  - group: ""
    kind: Service
    name: backend-service
```

### HTTPRoute Referencing Cross-Namespace Service

```yaml
# In frontend-ns
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: frontend-route
  namespace: frontend-ns
spec:
  parentRefs:
  - name: main-gateway
    namespace: gateway-ns
  rules:
  - backendRefs:
    - name: backend-service
      namespace: backend-ns  # Cross-namespace reference
      port: 8080
```

## URL Rewriting

### Path Prefix Replacement

```yaml
rules:
- matches:
  - path:
      type: PathPrefix
      value: /v1/api
  filters:
  - type: URLRewrite
    urlRewrite:
      path:
        type: ReplacePrefixMatch
        replacePrefixMatch: /api
  backendRefs:
  - name: api
    port: 8000
# /v1/api/tasks -> /api/tasks
```

### Full Path Replacement

```yaml
filters:
- type: URLRewrite
  urlRewrite:
    path:
      type: ReplaceFullPath
      replaceFullPath: /new-path
```

### Hostname Rewrite

```yaml
filters:
- type: URLRewrite
  urlRewrite:
    hostname: internal.example.com
```

## Redirects

### HTTP to HTTPS Redirect

```yaml
rules:
- matches:
  - path:
      type: PathPrefix
      value: /
  filters:
  - type: RequestRedirect
    requestRedirect:
      scheme: https
      statusCode: 301
```

### Domain Redirect

```yaml
filters:
- type: RequestRedirect
  requestRedirect:
    hostname: new-domain.example.com
    statusCode: 301
```

## Request/Response Header Manipulation

### Add Headers

```yaml
filters:
- type: RequestHeaderModifier
  requestHeaderModifier:
    add:
    - name: x-request-id
      value: ${request_id}
    - name: x-forwarded-proto
      value: https
```

### Set/Override Headers

```yaml
filters:
- type: RequestHeaderModifier
  requestHeaderModifier:
    set:
    - name: Host
      value: internal-api.example.com
```

### Remove Headers

```yaml
filters:
- type: RequestHeaderModifier
  requestHeaderModifier:
    remove:
    - x-internal-header
    - x-debug
```

### Response Headers

```yaml
filters:
- type: ResponseHeaderModifier
  responseHeaderModifier:
    add:
    - name: X-Content-Type-Options
      value: nosniff
    - name: Strict-Transport-Security
      value: max-age=31536000
```

```

### references/envoy-gateway-crds.md

```markdown
# Envoy Gateway CRD Reference

## BackendTrafficPolicy

Controls traffic from gateway to backend services.

### Rate Limiting

#### Global Rate Limit (Redis-backed)

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: global-rate-limit
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: api-route

  rateLimit:
    type: Global
    global:
      rules:
      # Rule 1: Per-user limit (distinct header values)
      - clientSelectors:
        - headers:
          - type: Distinct
            name: x-user-id
        limit:
          requests: 100
          unit: Minute

      # Rule 2: Anonymous limit (header not present)
      - clientSelectors:
        - headers:
          - name: x-user-id
            invert: true
        limit:
          requests: 10
          unit: Minute

      # Rule 3: Admin users (specific value, higher limit)
      - clientSelectors:
        - headers:
          - name: x-role
            value: admin
        limit:
          requests: 1000
          unit: Minute
```

#### Local Rate Limit (Per-Proxy)

```yaml
rateLimit:
  type: Local
  local:
    rules:
    - limit:
        requests: 100
        unit: Second
```

#### Cost-Based Rate Limiting (v1.3.0+)

```yaml
rateLimit:
  type: Global
  global:
    rules:
    - limit:
        requests: 1000
        unit: Hour
      cost: 10  # Each request costs 10 units
```

### Retry Policy

```yaml
retry:
  numRetries: 3
  perRetryTimeout: 5s
  retryOn:
  - "5xx"                    # Server errors
  - "reset"                  # Connection reset
  - "connect-failure"        # Connection failed
  - "retriable-4xx"          # 409 conflict, etc.
  - "gateway-error"          # 502, 503, 504
  backoff:
    baseInterval: 100ms
    maxInterval: 10s
```

### Circuit Breaker / Health Check

```yaml
healthCheck:
  active:
    type: HTTP
    http:
      path: /health
      expectedStatuses:
      - 200
      - 204
    interval: 10s
    timeout: 1s
    unhealthyThreshold: 3
    healthyThreshold: 1

circuitBreaker:
  maxConnections: 100
  maxPendingRequests: 50
  maxRequests: 1000
  maxRetries: 3
```

### Load Balancing

```yaml
loadBalancer:
  type: RoundRobin     # RoundRobin, LeastRequest, Random, ConsistentHash

  # For ConsistentHash
  consistentHash:
    type: Header
    header:
      name: x-user-id
```

### Timeouts

```yaml
timeout:
  tcp:
    connectTimeout: 10s
  http:
    connectionIdleTimeout: 1h
    requestTimeout: 30s
```

### Fault Injection (Testing)

```yaml
faultInjection:
  delay:
    fixedDelay: 500ms
    percentage: 10        # 10% of requests delayed
  abort:
    httpStatus: 503
    percentage: 5         # 5% of requests aborted
```

## ClientTrafficPolicy

Controls client-to-gateway connections.

### TLS Configuration

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: tls-policy
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: main-gateway

  tls:
    minVersion: "1.2"
    maxVersion: "1.3"
    ciphers:
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES256-GCM-SHA384
    alpnProtocols:
    - h2
    - http/1.1
```

### Connection Timeouts

```yaml
timeout:
  http:
    requestReceivedTimeout: 30s
    requestSendTimeout: 30s
```

### TCP Keepalive

```yaml
tcpKeepalive:
  probes: 3
  idleTime: 10m
  interval: 10s
```

### Client IP Detection

```yaml
clientIPDetection:
  xForwardedFor:
    numTrustedHops: 1
  customHeader:
    name: X-Real-IP
    failOpen: true
```

### HTTP Settings

```yaml
http1Settings:
  http10Compatible: false
  preserveHeaderCase: true

http2Settings:
  initialConnectionWindowSize: 1048576
  initialStreamWindowSize: 65536
  maxConcurrentStreams: 100
```

### Proxy Protocol

```yaml
proxyProtocol:
  version: V2
```

## SecurityPolicy

Authentication, authorization, and access control.

### JWT Authentication

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: jwt-policy
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: api-route

  jwt:
    providers:
    - name: auth0
      issuer: https://your-tenant.auth0.com/
      audiences:
      - https://api.example.com
      remoteJWKS:
        uri: https://your-tenant.auth0.com/.well-known/jwks.json
        timeout: 10s
        backoff:
          baseInterval: 1s
          maxInterval: 10s
      claimToHeaders:
      - claim: sub
        header: x-user-id
      - claim: permissions
        header: x-permissions
      - claim: email
        header: x-user-email
```

### OIDC Authentication

```yaml
oidc:
  provider:
    issuer: https://accounts.google.com
    authorizationEndpoint: https://accounts.google.com/o/oauth2/v2/auth
    tokenEndpoint: https://oauth2.googleapis.com/token
  clientID: "123456.apps.googleusercontent.com"
  clientSecret:
    name: google-oauth
    key: client-secret
  redirectURL: https://app.example.com/oauth2/callback
  logoutPath: /logout
  cookieSameSite: Lax
  cookieDomain: .example.com
  scopes:
  - openid
  - profile
  - email
```

### Basic Auth

```yaml
basicAuth:
  users:
    name: basic-auth-secret
```

### API Key

```yaml
apiKeyAuth:
  extractFrom:
    headers:
    - name: x-api-key
    - name: Authorization  # Bearer <api-key>
  credentialsSecret:
    name: api-keys
```

### Authorization Rules

```yaml
authorization:
  defaultAction: Deny
  rules:
  # Allow admin users
  - name: admin-access
    action: Allow
    principal:
      jwt:
        provider: auth0
        claims:
        - name: role
          values: ["admin"]

  # Allow internal IPs
  - name: internal-access
    action: Allow
    principal:
      ipAddress:
      - "10.0.0.0/8"
      - "172.16.0.0/12"

  # Allow specific paths without auth
  - name: public-paths
    action: Allow
    principal:
      ipAddress:
      - "0.0.0.0/0"
    match:
      - path:
          type: Exact
          value: /health
```

### CORS

```yaml
cors:
  allowOrigins:
  - type: Exact
    value: https://app.example.com
  - type: RegularExpression
    value: https://.*\.example\.com
  allowMethods:
  - GET
  - POST
  - PUT
  - DELETE
  - OPTIONS
  allowHeaders:
  - Authorization
  - Content-Type
  - X-Request-ID
  exposeHeaders:
  - X-Request-ID
  maxAge: 86400s
  allowCredentials: true
```

### External Authorization

```yaml
extAuth:
  grpc:
    backendRef:
      name: ext-authz-service
      port: 9000
    timeout: 10s

  # OR HTTP
  http:
    backendRef:
      name: ext-authz-service
      port: 8080
    path: /authorize
    headersToBackend:
    - Authorization
    - X-Request-ID
```

## EnvoyProxy

Proxy deployment and lifecycle configuration.

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy
  namespace: envoy-gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        replicas: 3
        container:
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 2000m
              memory: 2Gi
        pod:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      gateway.envoyproxy.io/owning-gateway-name: main-gateway
                  topologyKey: kubernetes.io/hostname

      envoyService:
        type: LoadBalancer
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: nlb

  telemetry:
    metrics:
      prometheus:
        enabled: true
    accessLog:
      settings:
      - format:
          type: JSON
        sinks:
        - type: File
          file:
            path: /dev/stdout

  backendTLS:
    clientCertificateRef:
      kind: Secret
      name: backend-client-cert
```

## Backend

Advanced endpoint configuration.

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: external-api
spec:
  endpoints:
  - fqdn:
      hostname: api.external-service.com
      port: 443

  # OR IP-based
  - ip:
      address: 10.0.0.100
      port: 8080

  tls:
    # Client cert for mTLS
    clientCertificateRef:
      kind: Secret
      name: client-cert

    # CA to verify backend
    caCertificateRefs:
    - kind: ConfigMap
      name: backend-ca

    # SNI for TLS handshake
    sni: api.external-service.com
```

## EnvoyExtensionPolicy

Custom extensions (Wasm, Lua).

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
  name: custom-extension
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: api-route

  wasm:
  - name: custom-auth
    rootID: auth_filter
    code:
      type: HTTP
      http:
        url: https://example.com/wasm/auth.wasm
        sha256: abc123...
    config:
      "@type": type.googleapis.com/google.protobuf.StringValue
      value: '{"key": "value"}'
```

## Policy Attachment and Merging

### Policy Hierarchy

1. **Route-level** overrides **Gateway-level**
2. Multiple policies can target same resource with merging

### Merge Types

```yaml
# On BackendTrafficPolicy
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: api-route

  # How to merge with Gateway-level policy
  # Replace: completely replace (default)
  # StrategicMerge: merge rules
  mergeType: StrategicMerge
```

### Target Selection

```yaml
# By name
targetRefs:
- group: gateway.networking.k8s.io
  kind: HTTPRoute
  name: specific-route

# By label (dynamic)
targetSelectors:
- matchLabels:
    app: task-api
    tier: frontend
```

```

### references/keda-scalers.md

```markdown
# KEDA Scaler Reference

## Core Concepts

KEDA extends Kubernetes HPA with event-driven scaling, including scale-to-zero capability.

### Architecture

```
Event Sources (Kafka, Prometheus, HTTP, etc.)
         │
         ▼
┌─────────────────┐
│  KEDA Operator  │ ◄── Watches ScaledObject/ScaledJob
└────────┬────────┘
         │ Creates/Manages
         ▼
┌─────────────────┐
│       HPA       │ ◄── Standard Kubernetes HPA
└────────┬────────┘
         │ Scales
         ▼
┌─────────────────┐
│   Deployment    │
└─────────────────┘
```

## ScaledObject Reference

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: app-scaler
  namespace: default
spec:
  # Target workload
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment        # or StatefulSet, Custom Resource
    name: my-deployment

  # Scaling bounds
  minReplicaCount: 0        # Scale to zero when idle
  maxReplicaCount: 100      # Upper limit
  idleReplicaCount: 0       # Replicas when triggers = 0

  # Polling and cooldown
  pollingInterval: 30       # Check triggers every N seconds
  cooldownPeriod: 300       # Wait before scaling down (seconds)

  # Fallback if scaler fails
  fallback:
    failureThreshold: 3
    replicas: 1

  # Advanced settings
  advanced:
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 10
            periodSeconds: 60

  # Event triggers
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      query: sum(rate(http_requests_total[1m]))
      threshold: "100"
```

## Key Scalers for AI Agents

### Prometheus Scaler

Scale based on custom Prometheus metrics.

```yaml
triggers:
- type: prometheus
  metadata:
    serverAddress: http://prometheus.monitoring:9090
    metricName: custom_metric
    query: |
      sum(rate(http_requests_total{app="task-api"}[1m]))
    threshold: "100"
    activationThreshold: "0"  # Trigger activation

    # Authentication (optional)
    authModes: "bearer"
  authenticationRef:
    name: prometheus-auth
```

**Example Queries:**

```yaml
# Request rate per second
query: sum(rate(http_requests_total{app="task-api"}[1m]))

# Queue depth
query: sum(task_queue_depth{app="task-processor"})

# P95 latency (scale when slow)
query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate percentage
query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Active connections
query: sum(envoy_http_downstream_cx_active{app="task-api"})
```

### Kafka Scaler

Scale based on consumer lag (unprocessed messages).

```yaml
triggers:
- type: kafka
  metadata:
    bootstrapServers: kafka.default:9092
    consumerGroup: task-processors
    topic: task-events
    lagThreshold: "50"           # Scale up if lag > 50
    activationLagThreshold: "1"  # Activate from zero if lag > 1
    offsetResetPolicy: latest

    # Optional settings
    allowIdleConsumers: "false"
    excludePersistentLag: "false"
    limitToPartitionsWithLag: "false"
    partitionLimitation: "0,1,2"
    version: "1.0.0"

    # SASL authentication
    sasl: plaintext
    tls: enable
  authenticationRef:
    name: kafka-auth
```

**TriggerAuthentication for Kafka:**

```yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: kafka-auth
spec:
  secretTargetRef:
  - parameter: sasl
    name: kafka-credentials
    key: sasl
  - parameter: username
    name: kafka-credentials
    key: username
  - parameter: password
    name: kafka-credentials
    key: password
  - parameter: tls
    name: kafka-credentials
    key: tls
  - parameter: ca
    name: kafka-credentials
    key: ca
```

### HTTP Scaler (HTTP Add-on)

Scale based on HTTP request volume.

```yaml
# First, install HTTP Add-on:
# helm install keda-http-addon kedacore/keda-add-ons-http -n keda

triggers:
- type: http
  metadata:
    scalingMetric: requestRate
    targetValue: "100"           # 100 requests per second per replica
    pathPrefixes: "/api"
    hosts: "api.example.com"

    # OR scale on concurrency
    scalingMetric: concurrency
    targetValue: "10"            # 10 concurrent requests per replica
```

**HTTPScaledObject (alternative):**

```yaml
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: http-app
spec:
  hosts:
  - api.example.com
  pathPrefixes:
  - /api
  scaleTargetRef:
    name: task-api
    kind: Deployment
  replicas:
    min: 1
    max: 50
  targetPendingRequests: 100
```

### Cron Scaler

Scale based on time schedules.

```yaml
triggers:
- type: cron
  metadata:
    timezone: America/New_York
    start: 0 8 * * *           # 8:00 AM
    end: 0 18 * * *            # 6:00 PM
    desiredReplicas: "10"

# Multiple schedules (business hours + off-hours)
- type: cron
  metadata:
    timezone: UTC
    start: 0 9 * * 1-5         # Weekdays 9 AM
    end: 0 17 * * 1-5          # Weekdays 5 PM
    desiredReplicas: "20"

- type: cron
  metadata:
    timezone: UTC
    start: 0 17 * * 1-5        # After hours
    end: 0 9 * * 1-5           # Before hours
    desiredReplicas: "2"
```

### PostgreSQL Scaler

Scale based on database query results.

```yaml
triggers:
- type: postgresql
  metadata:
    connectionFromEnv: POSTGRES_CONNECTION_STRING
    query: "SELECT COUNT(*) FROM tasks WHERE status = 'pending'"
    targetQueryValue: "10"      # 10 pending tasks per replica
    activationTargetQueryValue: "1"
```

### Redis Scaler

Scale based on Redis list/stream length.

```yaml
triggers:
- type: redis
  metadata:
    address: redis.default:6379
    listName: task-queue
    listLength: "50"            # Scale if queue > 50
    activationListLength: "1"   # Activate from zero if queue > 1

    # OR for Redis Streams
    stream: task-stream
    consumerGroup: processors
    pendingEntriesCount: "100"
```

### AWS SQS Scaler

Scale based on SQS queue depth.

```yaml
triggers:
- type: aws-sqs-queue
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/123456789/task-queue
    queueLength: "50"
    awsRegion: us-east-1
    activationQueueLength: "1"
  authenticationRef:
    name: aws-credentials
```

### RabbitMQ Scaler

Scale based on RabbitMQ queue.

```yaml
triggers:
- type: rabbitmq
  metadata:
    host: amqp://rabbitmq.default:5672
    queueName: task-queue
    mode: QueueLength           # or MessageRate
    value: "50"                 # 50 messages triggers scale
    activationValue: "1"
```

## Multi-Trigger Patterns

### Fan-In: Multiple Event Sources

```yaml
# Scale if ANY trigger exceeds threshold
triggers:
# CPU-based (resource)
- type: cpu
  metricType: Utilization
  metadata:
    value: "70"

# Queue-based (Kafka)
- type: kafka
  metadata:
    bootstrapServers: kafka:9092
    topic: events
    lagThreshold: "100"

# Request-based (Prometheus)
- type: prometheus
  metadata:
    query: sum(rate(http_requests_total[1m]))
    threshold: "200"
```

### Hybrid: Scheduled + Demand

```yaml
# Business hours: high capacity
- type: cron
  metadata:
    timezone: UTC
    start: 0 8 * * 1-5
    end: 0 18 * * 1-5
    desiredReplicas: "20"

# Demand-based overlay
- type: prometheus
  metadata:
    query: sum(task_queue_depth)
    threshold: "10"
```

### Cost-Optimized: Off-Peak Batch Processing

```yaml
# Scale up during cheap compute hours
- type: cron
  metadata:
    timezone: UTC
    start: 0 22 * * *          # 10 PM
    end: 0 6 * * *             # 6 AM
    desiredReplicas: "50"

# Process backlog during cheap hours
- type: kafka
  metadata:
    topic: batch-jobs
    lagThreshold: "10"
```

## ScaledJob for Batch Workloads

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: batch-processor
spec:
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 600
    backoffLimit: 2
    template:
      spec:
        containers:
        - name: processor
          image: batch-processor:latest
        restartPolicy: Never

  pollingInterval: 30
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  maxReplicaCount: 10

  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      topic: batch-tasks
      consumerGroup: batch-processors
      lagThreshold: "1"
```

## TriggerAuthentication

```yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: secret-auth
spec:
  # From Kubernetes Secret
  secretTargetRef:
  - parameter: connection
    name: db-secret
    key: connection-string
  - parameter: password
    name: db-secret
    key: password

---
# Cluster-wide authentication
apiVersion: keda.sh/v1alpha1
kind: ClusterTriggerAuthentication
metadata:
  name: aws-auth
spec:
  podIdentity:
    provider: aws-eks
  # OR
  secretTargetRef:
  - parameter: awsAccessKeyID
    name: aws-creds
    key: access-key
  - parameter: awsSecretAccessKey
    name: aws-creds
    key: secret-key
```

## Integration with Envoy Gateway

### Scaling on Gateway Metrics

```yaml
# Envoy Gateway exposes metrics to Prometheus
triggers:
- type: prometheus
  metadata:
    serverAddress: http://prometheus:9090
    query: |
      sum(rate(envoy_http_downstream_rq_total{
        envoy_cluster_name=~".*task-api.*"
      }[1m]))
    threshold: "100"
```

### Scaling on Gateway Error Rate

```yaml
triggers:
- type: prometheus
  metadata:
    query: |
      sum(rate(envoy_http_downstream_rq_xx{
        envoy_cluster_name=~".*task-api.*",
        envoy_response_code_class="5"
      }[5m])) /
      sum(rate(envoy_http_downstream_rq_total{
        envoy_cluster_name=~".*task-api.*"
      }[5m])) * 100
    threshold: "5"  # Scale down if error rate > 5%
```

### Kedify (Gateway API Native HTTP Scaling)

```yaml
# Uses Envoy as the interceptor instead of HTTP Add-on
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kedify-scaler
spec:
  scaleTargetRef:
    name: task-api
  triggers:
  - type: http
    metadata:
      scalingMetric: requestRate
      targetValue: "100"
      # Kedify auto-wires with HTTPRoute
```

```

### references/ai-gateway.md

```markdown
# Envoy AI Gateway Reference

## Overview

Envoy AI Gateway is a specialized gateway for LLM traffic management, built on Envoy Gateway. It addresses unique challenges of AI traffic that traditional API gateways don't handle:

- **Token-based billing**: LLM providers charge per token, not per request
- **Variable request cost**: A single request can cost 100 or 10,000 tokens
- **Multi-provider routing**: Need fallback across OpenAI, Anthropic, Gemini, etc.
- **Model-specific policies**: Different models need different rate limits

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Tier 1 Gateway                           │
│  (Authentication, Global Routing, Cost Protection)          │
├─────────────────────────────────────────────────────────────┤
│  • Token Rate Limiting (per-user, per-model)                │
│  • Provider Fallback (OpenAI → Anthropic → Gemini)          │
│  • Response Caching & Deduplication                         │
│  • Credential Injection (secure, centralized)               │
└─────────────────────────────────────────────────────────────┘
                ↓ (External)     ↓ (Internal)
        ┌───────────────────────────────────┐
        │  OpenAI  │  Anthropic  │  Gemini  │
        └───────────────────────────────────┘
                            ↓
                ┌───────────────────────┐
                │   Tier 2 Gateway      │
                │  (Self-Hosted Models) │
                │  vLLM, KServe, etc.   │
                └───────────────────────┘
```

## Token-Based Rate Limiting

### How It Works

1. Request flows through AI Gateway to LLM provider
2. Response contains token usage (input_tokens, output_tokens)
3. AI Gateway extracts token counts (OpenAI schema format)
4. Deducts from user's token budget
5. If budget exceeded, returns 429

### Configuration

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: llm-route
spec:
  # Token-based rate limiting
  llmRequestCosts:
    - type: InputToken
      weight: 1        # Cost per input token
    - type: OutputToken
      weight: 3        # Output tokens cost 3x (more expensive)
    - type: TotalToken
      weight: 1        # OR just use total

  # Per-model cost multipliers
  modelCosts:
    - model: gpt-4
      multiplier: 10   # GPT-4 = 10x base cost
    - model: gpt-3.5-turbo
      multiplier: 1    # GPT-3.5 = base cost
    - model: claude-3-opus
      multiplier: 15   # Claude Opus = 15x base cost
    - model: claude-3-sonnet
      multiplier: 3    # Claude Sonnet = 3x base cost

  # Budget per user
  tokenBudget:
    perUser:
      limit: 100000    # 100K tokens per user
      period: 30d      # Per month
    perMinute:
      limit: 10000     # 10K tokens per minute burst
```

### Extracting User from Request

```yaml
# Token budget tracked per user
userIdentification:
  header: x-user-id
  # OR from JWT claim
  jwt:
    claim: sub
    provider: auth0
```

## Provider Fallback

### Priority-Based Routing

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: resilient-llm
spec:
  backends:
  # Primary: OpenAI GPT-4
  - name: openai-gpt4
    priority: 0
    provider: openai
    model: gpt-4
    auth:
      type: APIKey
      apiKeyRef:
        name: openai-key

  # First fallback: Anthropic Claude
  - name: anthropic-claude
    priority: 1
    provider: anthropic
    model: claude-3-opus
    modelNameOverride: gpt-4  # Unified client interface
    auth:
      type: APIKey
      apiKeyRef:
        name: anthropic-key

  # Second fallback: Google Gemini
  - name: google-gemini
    priority: 2
    provider: google-gemini
    model: gemini-pro
    modelNameOverride: gpt-4
    auth:
      type: APIKey
      apiKeyRef:
        name: gemini-key

  # Fallback triggers
  failover:
    on:
    - rateLimit         # Provider rate limited
    - timeout           # Request timeout
    - error             # 5xx error
    - budgetExceeded    # Token budget exhausted
```

### Model Name Virtualization

```yaml
# Clients always request "gpt-4"
# Gateway routes to available backend
backends:
- name: openai
  model: gpt-4
  priority: 0
- name: anthropic
  model: claude-3-opus
  modelNameOverride: gpt-4  # Appears as gpt-4 to client
  priority: 1
```

## Supported Providers

| Provider | Models | Auth Type |
|----------|--------|-----------|
| **OpenAI** | gpt-4, gpt-4-turbo, gpt-3.5-turbo | API Key |
| **Anthropic** | claude-3-opus, claude-3-sonnet, claude-3-haiku | API Key |
| **Google Gemini** | gemini-pro, gemini-ultra | API Key |
| **AWS Bedrock** | Claude, Titan, Llama | IAM Role |
| **Azure OpenAI** | GPT-4, GPT-3.5 | API Key |
| **Mistral** | mistral-large, mistral-medium | API Key |
| **Cohere** | command, command-light | API Key |
| **Groq** | llama2-70b, mixtral | API Key |
| **DeepSeek** | deepseek-chat | API Key |
| **Together AI** | Various open models | API Key |

## Cost Control Patterns

### Pattern 1: Tiered Service Plans

```yaml
# Free tier: 10K tokens/day
userTiers:
- name: free
  selector:
    header: x-plan
    value: free
  tokenBudget:
    daily: 10000
  allowedModels:
  - gpt-3.5-turbo

# Pro tier: 100K tokens/day
- name: pro
  selector:
    header: x-plan
    value: pro
  tokenBudget:
    daily: 100000
  allowedModels:
  - gpt-4
  - gpt-3.5-turbo

# Enterprise: Unlimited with cost tracking
- name: enterprise
  selector:
    header: x-plan
    value: enterprise
  tokenBudget:
    unlimited: true
  costTracking:
    enabled: true
    alertThreshold: 1000  # Alert at $1000/day
```

### Pattern 2: Request Caching

```yaml
caching:
  enabled: true
  ttl: 3600s              # Cache for 1 hour
  keyComponents:
  - prompt                # Same prompt = cache hit
  - model                 # Different models = different cache
  - temperature           # Different temp = different cache
  maxSize: 10GB
  backend:
    type: redis
    address: redis:6379
```

### Pattern 3: Request Deduplication

```yaml
deduplication:
  enabled: true
  window: 5s              # Merge identical requests within 5s
  keyComponents:
  - prompt
  - model
  # 5 users asking same question = 1 LLM call
```

### Pattern 4: Retry with Cheaper Model

```yaml
retry:
  maxAttempts: 3
  fallbackStrategy: cheaperModel
  fallbackOrder:
  - gpt-4          # Try expensive first
  - gpt-3.5-turbo  # Fall back to cheap
  - gemini-flash   # Then even cheaper
```

## Security Configuration

### Credential Injection

```yaml
# Centralized API key management
backends:
- name: openai
  auth:
    type: APIKey
    apiKeyRef:
      name: llm-credentials
      key: openai-api-key
    # Key injected as Authorization header
    headerName: Authorization
    headerPrefix: "Bearer "
```

### Request Sanitization

```yaml
security:
  # Remove sensitive headers before forwarding
  stripHeaders:
  - x-internal-auth
  - x-user-password

  # Validate prompt content (PII detection)
  promptValidation:
    enabled: true
    blockPatterns:
    - "ssn:\\s*\\d{3}-\\d{2}-\\d{4}"  # SSN pattern
    - "credit_card:\\s*\\d{16}"        # Credit card
```

## Observability

### Metrics Exported

```
# Token usage per user/model
ai_gateway_tokens_total{user="...", model="gpt-4", type="input"}
ai_gateway_tokens_total{user="...", model="gpt-4", type="output"}

# Request latency per provider
ai_gateway_request_duration_seconds{provider="openai", model="gpt-4"}

# Fallback events
ai_gateway_fallback_total{from="openai", to="anthropic", reason="rate_limit"}

# Cost per user (estimated)
ai_gateway_cost_dollars{user="...", model="gpt-4"}

# Cache hit rate
ai_gateway_cache_hits_total{model="gpt-4"}
ai_gateway_cache_misses_total{model="gpt-4"}
```

### Integration with Prometheus/Grafana

```yaml
# ServiceMonitor for AI Gateway metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-gateway
spec:
  selector:
    matchLabels:
      app: ai-gateway
  endpoints:
  - port: metrics
    path: /stats/prometheus
    interval: 30s
```

## Complete Example: AI Agent Platform

```yaml
---
# AI Gateway configuration for multi-tenant agent platform
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: agent-platform
spec:
  # Multi-provider backends
  backends:
  - name: openai-primary
    priority: 0
    provider: openai
    model: gpt-4-turbo
    auth:
      apiKeyRef: {name: openai-key}

  - name: anthropic-fallback
    priority: 1
    provider: anthropic
    model: claude-3-sonnet
    modelNameOverride: gpt-4
    auth:
      apiKeyRef: {name: anthropic-key}

  # Token-based rate limiting
  tokenRateLimit:
    perUser:
      inputTokensPerMinute: 5000
      outputTokensPerMinute: 10000
      totalTokensPerDay: 100000

    # Different limits for expensive models
    perModel:
      gpt-4-turbo:
        tokensPerMinute: 2000
      gpt-3.5-turbo:
        tokensPerMinute: 10000

  # Cost control
  costControl:
    maxDailyCostPerUser: 10.00  # $10/day cap
    alertThreshold: 8.00         # Alert at 80%
    modelPricing:
      gpt-4-turbo:
        inputPer1k: 0.01
        outputPer1k: 0.03
      claude-3-sonnet:
        inputPer1k: 0.003
        outputPer1k: 0.015

  # Caching for repeated prompts
  caching:
    enabled: true
    ttl: 3600s
    backend: redis

  # Failover configuration
  failover:
    on: [rateLimit, timeout, budgetExceeded]
    retryAttempts: 2
    retryDelay: 1s

---
# User identification from JWT
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: agent-auth
spec:
  targetRefs:
  - kind: AIGatewayRoute
    name: agent-platform
  jwt:
    providers:
    - name: agent-auth
      issuer: https://auth.agent-platform.com
      remoteJWKS:
        uri: https://auth.agent-platform.com/.well-known/jwks.json
      claimToHeaders:
      - claim: sub
        header: x-user-id
      - claim: plan
        header: x-user-plan
```

## Why AI Gateway for Agent Platforms

| Traditional Gateway | AI Gateway |
|---------------------|------------|
| 1 request = 1 unit | 1 request = N tokens (variable) |
| Fixed cost per endpoint | Dynamic cost per model/token |
| No provider awareness | Multi-provider with fallback |
| Request uniformity | Token complexity awareness |
| Single backend | Intelligent cross-backend routing |
| No model policies | Per-model rate limits/costs |

**For AI agents specifically:**
- Agents make thousands of LLM calls across providers
- Each call has different token cost
- Agent budget control is critical (runaway = massive bill)
- Multi-provider resilience prevents lock-in
- Visibility into which agents consume budget

```

building-with-envoy-gateway | SkillHub