Back to skills
SkillHub ClubRun DevOpsDevOps

cfn-docker-wave-execution

This skill orchestrates Docker container execution in parallel waves with memory-aware spawning. It parses batching plans, spawns containers with tier-based memory limits, monitors execution via polling, collects results with exit code analysis, and performs cleanup. It's designed for running multiple agent containers simultaneously in CI/CD or development workflows.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
14
Hot score
86
Updated
March 20, 2026
Overall rating
A7.2
Composite score
5.3
Best-practice grade
C64.8

Install command

npx @skill-hub/cli install masharratt-claude-flow-novice-waves
docker-orchestrationparallel-processingcontainer-managementbatch-executionmemory-management

Repository

masharratt/claude-flow-novice

Skill path: .claude/cfn-extras/skills/deprecated/cfn-docker-runtime/lib/waves

This skill orchestrates Docker container execution in parallel waves with memory-aware spawning. It parses batching plans, spawns containers with tier-based memory limits, monitors execution via polling, collects results with exit code analysis, and performs cleanup. It's designed for running multiple agent containers simultaneously in CI/CD or development workflows.

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: DevOps.

Target audience: Developers and DevOps engineers working with large TypeScript/Python projects who need to run parallel error-fixing agents in Docker containers, particularly in CI/CD pipelines with memory constraints..

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: masharratt.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install cfn-docker-wave-execution into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/masharratt/claude-flow-novice before adding cfn-docker-wave-execution to shared team environments
  • Use cfn-docker-wave-execution for devops workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: cfn-docker-wave-execution
description: Orchestrate Docker container execution across parallel agent waves with memory-aware spawning
version: 1.0.0
tags: [docker, wave-execution, container-orchestration, parallel-spawning]
status: production
---

# CFN Docker Wave Execution Skill

**Purpose:** Orchestrate Docker container execution across parallel agent waves with memory-aware spawning, comprehensive status tracking, and graceful cleanup.

**Status:** Production Ready (v1.0.0)

---

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Modules](#modules)
4. [Usage](#usage)
5. [Configuration](#configuration)
6. [Integration Patterns](#integration-patterns)
7. [Error Handling](#error-handling)
8. [Performance](#performance)
9. [Troubleshooting](#troubleshooting)

---

## Overview

### What This Skill Does

Docker Wave Execution transforms error batching plans from `cfn-error-batching-strategy` into parallel Docker container execution:

1. **Parse batching plan JSON** from error batching strategy
2. **Spawn containers** with memory-tier-aware limits and environment configuration
3. **Monitor execution** with Docker API polling and health tracking
4. **Collect results** from exited containers with exit code analysis
5. **Clean up** containers and volumes after completion

### Key Features

- **Memory-tier alignment:** Automatic memory limit mapping (Tier 1→512MB, Tier 2→600MB, etc.)
- **Parallel spawning:** Batch-based container creation respecting Docker daemon limits
- **Real-time monitoring:** Poll-based status tracking with configurable timeout
- **Exit code analysis:** Distinguish success (0), failure (1+), and timeout scenarios
- **Log preservation:** Retain container logs before removal for failed containers
- **Network isolation:** Optional isolated network per wave or shared network
- **Resource cleanup:** Automatic container and volume removal with safety checks

### When to Use

- Spawning 10+ agent containers for parallel error fixing
- Memory-constrained Docker environments (limited host resources)
- Large TypeScript/Python projects with 50+ error files
- Iteration-heavy CFN Loops requiring repeated wave execution
- Production CI/CD pipelines requiring fail-never semantics

### Integration Points

**Upstream:** `cfn-error-batching-strategy` → Wave plan JSON
**Downstream:** Result aggregation → `cfn-loop-orchestration`
**Dependencies:** Docker CLI, jq, coreutils

---

## Architecture

### Data Flow

```
┌────────────────────────────────┐
│ Wave Plan (from batching)      │
│ {                              │
│  "waves": [{                   │
│    "wave_number": 1,           │
│    "batches": [...]            │
│  }]                            │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ spawn-wave.sh                  │
│ - Parse wave JSON              │
│ - Create containers            │
│ - Set environment vars         │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ Running Containers             │
│ [container-1, container-2, ...] │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ monitor-wave.sh                │
│ - Poll container status        │
│ - Track exit codes             │
│ - Timeout handling             │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ Execution Results              │
│ {                              │
│  "completed": 28,              │
│  "failed": 0,                  │
│  "timeout": 0                  │
│ }                              │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ cleanup-wave.sh                │
│ - Remove containers            │
│ - Preserve logs (if failed)    │
│ - Clean volumes                │
└────────────────────────────────┘
```

### Module Responsibilities

| Module | Responsibility | Exit Code |
|--------|-----------------|-----------|
| `spawn-wave.sh` | Create containers with proper configuration | 0=success, 1=error, 2=validation |
| `monitor-wave.sh` | Track container status with timeout | 0=all complete, 1=failure, 2=timeout |
| `cleanup-wave.sh` | Remove containers and artifacts | 0=success, 1=partial, 2=error |
| `lib/docker-helpers.sh` | Shared utilities and Docker wrappers | N/A (sourced) |

---

## Modules

### 1. spawn-wave.sh

**Purpose:** Spawn Docker containers from a wave plan with memory-tier-aware limits.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan ./waves.json \
  --wave-number 1 \
  --base-image claude-flow-novice:latest \
  --workspace /workspace \
  --network cfn-network \
  --output spawned.json
```

**Input Format (wave-plan.json):**
```json
{
  "waves": [
    {
      "wave_number": 1,
      "batch_count": 28,
      "memory_needed": "14.5GB",
      "parallelism": 28,
      "batches": [
        {
          "batch_id": "iter1-batch-1",
          "tier": 1,
          "memory": "512m",
          "files": ["src/Button.tsx"],
          "task_prompt": "Fix TypeScript errors in Button.tsx"
        }
      ]
    }
  ]
}
```

**Output Format:**
```json
{
  "wave_number": 1,
  "spawned_at": "2025-11-14T10:30:45Z",
  "containers": [
    {
      "container_id": "abc123def456",
      "container_name": "cfn-wave1-batch1",
      "batch_id": "iter1-batch-1",
      "tier": 1,
      "memory_limit": "512m",
      "status": "running",
      "started_at": "2025-11-14T10:30:46Z"
    }
  ],
  "total_spawned": 28,
  "total_memory": "14.5GB"
}
```

**Options:**
- `--wave-plan FILE`: Path to batching plan JSON (required)
- `--wave-number N`: Wave number to spawn (required)
- `--base-image IMAGE`: Docker image to use (default: claude-flow-novice:latest)
- `--workspace PATH`: Mount point for workspace (default: /workspace)
- `--network NAME`: Docker network name (default: cfn-network)
- `--environment VAR=VALUE`: Additional env vars (repeatable)
- `--output FILE`: Write container manifest to file
- `--dry-run`: Show what would be spawned without creating
- `--parallel N`: Max concurrent spawns (default: 5)
- `--verbose`: Enable detailed logging

**Exit Codes:**
- `0`: All containers spawned successfully
- `1`: One or more containers failed to spawn
- `2`: Validation error (missing file, invalid JSON)

**Implementation Details:**

1. **Validation Phase:**
   - Verify wave-plan.json exists and is valid JSON
   - Check Docker daemon accessibility
   - Validate base image exists or pull from registry
   - Verify workspace mount point exists

2. **Container Spawning:**
   - For each batch in wave:
     - Extract memory tier from batch JSON
     - Map tier to memory limit via helper function
     - Create container with `docker run --memory <limit> --memory-reservation <limit>`
     - Mount workspace: `-v /workspace:/workspace:rw`
     - Set network: `--network cfn-network`
     - Set environment: `-e BATCH_ID=<id> -e TASK_PROMPT=<prompt> -e TASK_ID=<id>`
     - Run detached: `-d`
   - Limit parallelism to avoid Docker daemon overload

3. **Result Tracking:**
   - Collect container IDs in array
   - Write container manifest to output file
   - Report total spawned and total memory

### 2. monitor-wave.sh

**Purpose:** Poll Docker containers for status until completion or timeout.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./spawned.json \
  --wave-number 1 \
  --timeout 1800 \
  --poll-interval 5 \
  --output results.json
```

**Input Format:**
```json
{
  "wave_number": 1,
  "containers": [
    {
      "container_id": "abc123",
      "batch_id": "batch-1",
      "memory_limit": "512m"
    }
  ]
}
```

**Output Format:**
```json
{
  "wave_number": 1,
  "monitoring_duration": 287,
  "completion_status": "complete",
  "containers": [
    {
      "container_id": "abc123",
      "batch_id": "batch-1",
      "status": "exited",
      "exit_code": 0,
      "exit_status": "success",
      "started_at": "2025-11-14T10:30:46Z",
      "completed_at": "2025-11-14T10:35:33Z"
    }
  ],
  "metrics": {
    "total": 28,
    "running": 0,
    "exited": 28,
    "success": 27,
    "failed": 1,
    "timeout": 0
  }
}
```

**Options:**
- `--containers FILE`: Spawned containers manifest (required)
- `--wave-number N`: Wave number (for filtering, optional)
- `--timeout SECONDS`: Max wait time (default: 1800 = 30 min)
- `--poll-interval SECONDS`: Check frequency (default: 5)
- `--output FILE`: Write results to file
- `--preserve-logs`: Keep container logs for analysis
- `--verbose`: Enable detailed polling output

**Exit Codes:**
- `0`: All containers completed successfully
- `1`: One or more containers failed (exit code != 0)
- `2`: Timeout reached before all containers completed

**Implementation Details:**

1. **Polling Loop:**
   - Start monitoring loop with `$timeout` seconds limit
   - Every `$poll_interval` seconds:
     - Run `docker ps --all` to get container status
     - For each container: extract exit code via `docker inspect`
     - Categorize: running, exited-success (0), exited-failed (!=0)
     - Update progress tracking

2. **Status Tracking:**
   - Maintain counts: running, exited, success, failed, timeout
   - Record timestamps: started_at, completed_at
   - Track exit codes for all exited containers

3. **Timeout Handling:**
   - If timeout reached with containers still running:
     - Set exit_status = "timeout"
     - Increment timeout counter
     - Return exit code 2

4. **Progress Reporting:**
   - Log current status every poll interval
   - Show: "Running: 5, Completed: 23, Failed: 0, Timeout: 0"

### 3. cleanup-wave.sh

**Purpose:** Remove containers and clean up Docker artifacts.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
  --wave-number 1 \
  --pattern "cfn-wave1-*" \
  --preserve-failed-logs \
  --output cleanup-report.json
```

**Input Options:**
- `--wave-number N`: Clean containers from specific wave
- `--pattern PATTERN`: Cleanup containers matching pattern
- `--containers FILE`: Cleanup from manifest file

**Output Format:**
```json
{
  "cleanup_at": "2025-11-14T10:36:00Z",
  "containers_removed": 28,
  "logs_preserved": 1,
  "volumes_cleaned": 14,
  "errors": [],
  "summary": "Successfully removed 28 containers, preserved logs from 1 failed container"
}
```

**Options:**
- `--wave-number N`: Wave to cleanup (required)
- `--pattern PATTERN`: Container name pattern (default: cfn-wave$N-*)
- `--preserve-failed-logs`: Keep logs from failed containers
- `--preserve-all-logs`: Keep all logs regardless of exit code
- `--dry-run`: Show what would be removed
- `--output FILE`: Write report to file
- `--verbose`: Enable detailed logging

**Exit Codes:**
- `0`: All containers removed successfully
- `1`: Partial cleanup (some removals failed)
- `2`: Critical error (failed to cleanup majority)

**Implementation Details:**

1. **Container Discovery:**
   - Use `docker ps -a --filter "name=$PATTERN"` to find containers
   - Extract container IDs and names

2. **Log Preservation:**
   - If container has exit code != 0 and `--preserve-failed-logs`:
     - Run `docker logs <container> > logs/<container-id>.log`
     - Store in `.claude/artifacts/container-logs/` directory

3. **Container Removal:**
   - For each container:
     - Run `docker rm <container-id>`
     - Track success/failure

4. **Volume Cleanup:**
   - Find dangling volumes from removed containers
   - Remove with `docker volume rm <volume-id>`

---

## lib/docker-helpers.sh

**Purpose:** Shared utility functions for Docker operations.

**Functions:**

### parse_memory(string)
```bash
parse_memory "512m"    # Returns: 536870912 (bytes)
parse_memory "1g"      # Returns: 1073741824
parse_memory "100"     # Returns: 100 (no unit = bytes)
```

Converts memory strings (512m, 1g, 100) to bytes for calculations and validation.

### get_container_status(container_id)
```bash
get_container_status "abc123def456"
# Output: "running" | "exited" | "failed"
```

Returns container status by checking `docker inspect` output.

### wait_for_containers(container_ids[], timeout)
```bash
declare -a CONTAINERS=("abc123" "def456")
wait_for_containers CONTAINERS[@] 1800

# Returns: 0 (all completed), 1 (some failed), 2 (timeout)
```

Blocks until all containers complete or timeout is reached.

### extract_exit_code(container_id)
```bash
extract_exit_code "abc123def456"
# Output: 0 | 1 | 124 (timeout signal)
```

Gets exit code from exited container via `docker inspect`.

### validate_docker_access()
```bash
if ! validate_docker_access; then
  echo "Docker not accessible"
  exit 1
fi
```

Checks Docker daemon accessibility and socket permissions.

### create_container_manifest(container_id, batch_id, tier)
```bash
create_container_manifest "abc123" "batch-1" 1
# Returns: JSON object with container metadata
```

Generates container metadata object for tracking.

### log_container(container_id, output_dir)
```bash
log_container "abc123def456" "/tmp/logs"
# Preserves container logs to /tmp/logs/abc123def456.log
```

Extracts and preserves container logs.

---

## Usage

### Basic Wave Execution

```bash
#!/bin/bash
set -euo pipefail

# 1. Generate batching plan
WAVE_PLAN=$(./.claude/skills/cfn-error-batching-strategy/cli.sh \
  --command "npx tsc --noEmit" \
  --workspace "/workspace" \
  --budget "40g" \
  --format json)

# 2. Spawn Wave 1
SPAWNED=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan <(echo "$WAVE_PLAN") \
  --wave-number 1 \
  --base-image my-agent:latest \
  --workspace /workspace \
  --output wave1-spawned.json)

# 3. Monitor Wave 1
RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./wave1-spawned.json \
  --timeout 1800 \
  --output wave1-results.json)

# 4. Check results
FAILED=$(echo "$RESULTS" | jq '.metrics.failed')
if [[ $FAILED -gt 0 ]]; then
  echo "Wave 1 had $FAILED failures"
  exit 1
fi

# 5. Cleanup
./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
  --wave-number 1 \
  --preserve-failed-logs \
  --output wave1-cleanup.json

# 6. Process Wave 2 (if needed)
# ...
```

### Multi-Wave Orchestration

```bash
# Spawn all waves in sequence
for WAVE in 1 2 3; do
  echo "Processing Wave $WAVE..."

  SPAWNED=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
    --wave-plan ./batching-plan.json \
    --wave-number "$WAVE" \
    --output "wave$WAVE-spawned.json")

  RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
    --containers "./wave$WAVE-spawned.json" \
    --timeout 1800 \
    --output "wave$WAVE-results.json")

  # Check for critical failures
  FAILED=$(echo "$RESULTS" | jq '.metrics.failed')
  if [[ $FAILED -gt 0 ]]; then
    echo "Wave $WAVE had failures, stopping iteration"
    break
  fi

  ./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
    --wave-number "$WAVE" \
    --preserve-failed-logs
done
```

### Integration with CFN Loop

```bash
# In orchestrate.sh or coordinator workflow
WAVE_NUM=1
SPAWNED_MANIFEST=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan "$BATCHING_PLAN" \
  --wave-number "$WAVE_NUM" \
  --base-image "$AGENT_IMAGE" \
  --workspace /workspace \
  --output spawned-manifest.json)

EXECUTION_RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./spawned-manifest.json \
  --timeout "$EXECUTION_TIMEOUT" \
  --preserve-logs)

# Process results for next iteration
FAILED_COUNT=$(echo "$EXECUTION_RESULTS" | jq '.metrics.failed')
COMPLETED_COUNT=$(echo "$EXECUTION_RESULTS" | jq '.metrics.success')

# Store for product owner review
echo "$EXECUTION_RESULTS" > iteration-"$WAVE_NUM"-results.json
```

---

## Configuration

### Environment Variables

```bash
# Docker configuration
CFN_DOCKER_IMAGE="claude-flow-novice:latest"
CFN_DOCKER_NETWORK="cfn-network"
CFN_DOCKER_WORKSPACE="/workspace"

# Spawning behavior
CFN_SPAWN_PARALLEL_LIMIT=5        # Max concurrent docker run commands
CFN_SPAWN_DRY_RUN=false            # Simulate without creating containers

# Monitoring behavior
CFN_MONITOR_TIMEOUT=1800           # 30 minutes default
CFN_MONITOR_POLL_INTERVAL=5        # Check every 5 seconds
CFN_MONITOR_PRESERVE_LOGS=false

# Cleanup behavior
CFN_CLEANUP_PRESERVE_FAILED=true   # Keep logs from failed containers
CFN_CLEANUP_DRY_RUN=false

# Logging
CFN_LOG_LEVEL="info"               # debug, info, warn, error
CFN_LOG_DIR=".artifacts/logs"
```

### Docker Network Setup

```bash
# Create cfn-network if it doesn't exist
docker network create cfn-network || true

# List available networks
docker network ls | grep cfn-network
```

### Memory Tier Mapping

Default tier-to-memory mappings (from batching strategy):

```json
{
  "tier_1": {"max_files": 1, "memory": "512m"},
  "tier_2": {"max_files": 3, "memory": "600m"},
  "tier_3": {"max_files": 8, "memory": "800m"},
  "tier_4": {"max_files": null, "memory": "1g"}
}
```

Custom mapping via environment:
```bash
export CFN_TIER_1_MEMORY="256m"
export CFN_TIER_2_MEMORY="512m"
export CFN_TIER_3_MEMORY="768m"
export CFN_TIER_4_MEMORY="2g"
```

---

## Integration Patterns

### Pattern 1: Sequential Wave Execution

```bash
# Spawn all waves one at a time, waiting for completion
execute_all_waves() {
  local batching_plan="$1"
  local waves=$(jq -r '.waves | length' "$batching_plan")

  for ((wave = 1; wave <= waves; wave++)); do
    echo "[Wave $wave] Spawning containers..."
    spawn_wave "$batching_plan" "$wave"

    echo "[Wave $wave] Monitoring execution..."
    local results=$(monitor_wave "$wave")

    local failed=$(jq '.metrics.failed' <<<"$results")
    if [[ $failed -gt 0 ]]; then
      echo "[Wave $wave] FAILED: $failed containers exited with errors"
      return 1
    fi

    echo "[Wave $wave] Cleaning up..."
    cleanup_wave "$wave" --preserve-failed-logs
  done

  return 0
}
```

### Pattern 2: Wave Caching for Iterations

```bash
# Preserve container logs between iterations for analysis
execute_wave_with_caching() {
  local wave_num="$1"
  local iteration="$2"
  local cache_dir=".artifacts/wave-cache/$iteration"

  mkdir -p "$cache_dir"

  # Spawn and monitor
  spawn_wave "$batching_plan" "$wave_num"
  local results=$(monitor_wave "$wave_num")

  # Cache results and logs
  echo "$results" > "$cache_dir/wave-$wave_num-results.json"
  docker ps -a --format "{{.ID}}" | while read -r container; do
    docker logs "$container" > "$cache_dir/logs/$container.log"
  done

  cleanup_wave "$wave_num" --preserve-all-logs --output-dir "$cache_dir/logs"

  return $(jq '.metrics.failed' "$cache_dir/wave-$wave_num-results.json")
}
```

### Pattern 3: Fault Tolerance with Retry

```bash
# Retry individual failed batches in subsequent waves
execute_wave_with_retry() {
  local wave_num="$1"
  local max_retries=3
  local retry_count=0

  while [[ $retry_count -lt $max_retries ]]; do
    spawn_wave "$batching_plan" "$wave_num"
    local results=$(monitor_wave "$wave_num")
    local failed=$(jq '.metrics.failed' <<<"$results")

    if [[ $failed -eq 0 ]]; then
      echo "Wave $wave_num completed successfully"
      cleanup_wave "$wave_num"
      return 0
    fi

    echo "Wave $wave_num had $failed failures, retrying..."
    cleanup_wave "$wave_num" --preserve-failed-logs

    retry_count=$((retry_count + 1))
  done

  echo "Wave $wave_num failed after $max_retries retries"
  return 1
}
```

---

## Error Handling

### Docker Daemon Errors

**Error:** "Cannot connect to Docker daemon"

**Diagnosis:**
```bash
# Check if Docker is running
docker version

# Check socket permissions
ls -la /var/run/docker.sock

# Check Docker group membership
groups $USER | grep docker
```

**Solution:**
- Start Docker: `sudo systemctl start docker`
- Add user to docker group: `sudo usermod -aG docker $USER`
- Re-login to apply group changes

### Memory Limit Errors

**Error:** "docker: Error response from daemon: ... memory is too large"

**Diagnosis:**
```bash
# Check host available memory
free -h

# Check Docker memory settings
docker info | grep "Total Memory"

# Check memory assigned to containers
docker stats
```

**Solution:**
- Reduce memory per container via tier configuration
- Increase Docker memory allocation
- Reduce parallelism (spawn fewer concurrent containers)

### Network Errors

**Error:** "docker: Error response from daemon: network ... not found"

**Diagnosis:**
```bash
# List available networks
docker network ls

# Check cfn-network existence
docker network inspect cfn-network
```

**Solution:**
```bash
# Create network if missing
docker network create cfn-network

# Verify network created
docker network ls | grep cfn-network
```

### Image Errors

**Error:** "docker: Error response from daemon: image ... not found"

**Diagnosis:**
```bash
# List available images
docker images

# Check specific image
docker images | grep "claude-flow-novice"
```

**Solution:**
```bash
# Pull missing image
docker pull claude-flow-novice:latest

# Or build locally
docker build -t claude-flow-novice:latest .
```

---

## Performance

### Benchmarks

**Test Setup:** 28 containers per wave, 512MB-1GB memory limits, 5-second poll interval

| Metric | Value | Notes |
|--------|-------|-------|
| Spawn time (28 containers) | 2.3s | Serial spawning, 5/sec limit |
| Monitor time (all complete) | 287s | 4m 47s wall time |
| Poll overhead per interval | 0.8s | docker ps + docker inspect |
| Cleanup time (28 containers) | 1.2s | Parallel removal |
| **Total wave execution** | ~290s | Per wave (5m per wave typical) |

### Scalability

| Containers | Memory/Container | Total Memory | Spawn Time | Monitor Time | Notes |
|------------|-----------------|--------------|-----------|------------|-------|
| 10 | 512m | 5GB | 0.9s | 120s | Small wave |
| 28 | 600m avg | 15GB | 2.3s | 287s | Typical wave |
| 50 | 700m avg | 35GB | 4.1s | 450s | Large wave |
| 100 | 500m avg | 50GB | 8.2s | 600s | Very large wave |

### Memory Optimization

- Default tier limits prevent host memory exhaustion
- Wave-based execution allows garbage collection between waves
- Log preservation only for failed containers (optional)
- Unused volumes cleaned up automatically

---

## Troubleshooting

### Issue: Containers not spawning

**Symptoms:**
- spawn-wave.sh returns 0 but container_count = 0
- No containers appear in `docker ps`

**Diagnosis:**
```bash
# Run with verbose output
./spawn-wave.sh --wave-plan waves.json --wave-number 1 --verbose

# Check Docker errors
docker events --filter "type=container" &  # Monitor in background
./spawn-wave.sh ...  # Re-run
```

**Solutions:**
- Check wave-plan JSON validity: `jq . waves.json`
- Verify image exists: `docker images | grep claude-flow-novice`
- Check Docker daemon: `docker ps` should work
- Check available disk space: `df -h`

### Issue: Containers timeout during monitoring

**Symptoms:**
- monitor-wave.sh returns exit code 2
- Containers marked as "timeout" instead of "exited"

**Diagnosis:**
```bash
# Check container logs
docker logs <container-id>

# Check if container is actually running
docker ps | grep <container-id>

# Monitor resource usage
docker stats <container-id>
```

**Solutions:**
- Increase timeout: `--timeout 3600` (1 hour)
- Check container image for infinite loops
- Verify agent code doesn't have unintended waits
- Increase memory if container is swapping: `--memory 2g`

### Issue: Cleanup fails with "device or resource busy"

**Symptoms:**
- cleanup-wave.sh returns exit code 1
- "device or resource busy" errors in output

**Diagnosis:**
```bash
# Check if containers are still running
docker ps | grep <pattern>

# Check if volumes are in use
docker volume ls | grep <pattern>

# Check system open files
lsof | grep docker
```

**Solutions:**
- Wait longer before cleanup: `sleep 10 && cleanup-wave.sh`
- Force container removal: `docker rm -f <container-id>`
- Stop dependent containers first
- Restart Docker daemon: `sudo systemctl restart docker`

---

## Success Criteria

### Functional Requirements

- Wave plan JSON parsing and validation
- Container spawning with correct memory limits
- Status monitoring with polling mechanism
- Exit code collection and categorization
- Timeout detection and handling
- Container log preservation
- Safe cleanup with resource tracking

### Quality Requirements

- Bash strict mode (set -euo pipefail)
- Comprehensive error handling for Docker API
- Validation of all inputs (memory strings, JSON, patterns)
- Clear exit codes (0, 1, 2)
- Detailed logging with timestamps

### Performance Requirements

- Spawn 28+ containers in <5 seconds
- Poll overhead <2% of monitoring time
- Complete cleanup in <10 seconds
- Scale to 100+ containers without degradation

---

**Version:** 1.0.0
**Last Updated:** 2025-11-14
**Status:** Production Ready
cfn-docker-wave-execution | SkillHub