SkillHub ClubResearch & OpsFull Stack

autonomous-operations

Autonomous homelab operations using OODA loop (Observe, Orient, Decide, Act) - use when reviewing system state, planning autonomous actions, or investigating operational issues

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C3.5

Composite score

3.5

Best-practice grade

B75.9

Install command

npx @skill-hub/cli install vonrobak-fedora-homelab-containers-autonomous-operations

Repository

vonrobak/fedora-homelab-containers

Skill path: .claude/skills/autonomous-operations

Autonomous homelab operations using OODA loop (Observe, Orient, Decide, Act) - use when reviewing system state, planning autonomous actions, or investigating operational issues

Open repository

Best for

Primary workflow: Research & Ops.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: vonrobak.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install autonomous-operations into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/vonrobak/fedora-homelab-containers before adding autonomous-operations to shared team environments
Use autonomous-operations for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: autonomous-operations
description: Autonomous homelab operations using OODA loop (Observe, Orient, Decide, Act) - use when reviewing system state, planning autonomous actions, or investigating operational issues
---

# Autonomous Operations Engine

## Overview

The autonomous operations engine implements an OODA (Observe, Orient, Decide, Act) loop that orchestrates all homelab components into a self-managing system. It leverages existing infrastructure:

- **Context Framework** - Historical patterns and preferences
- **Auto-Remediation** - Playbook-based fixes
- **Predictive Analytics** - Resource exhaustion forecasting
- **Backup Integration** - Snapshot-protected operations
- **Homelab Intelligence** - Health scoring and diagnostics

**Philosophy:** Proactive over reactive. Predict and prevent rather than detect and fix.

## When to Use

**Always use for:**
- Reviewing current autonomous operations state
- Understanding what actions the system would take
- Investigating why an action was/wasn't taken
- Adjusting autonomy settings
- Querying the decision log

**Triggers:**
- User asks "what would the system do about..."
- User asks about autonomous actions taken
- User wants to review pending actions
- User asks to adjust autonomy level

## Architecture: OODA Loop

```
┌─────────────────────────────────────────────────────────────────────┐
│                           OBSERVE                                   │
│  • Predictive analytics (resource exhaustion forecasts)             │
│  • Health scoring (homelab-intel.sh)                                │
│  • Drift detection (check-drift.sh)                                 │
│  • Backup health verification                                       │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           ORIENT                                    │
│  • Historical patterns (issue-history.json)                         │
│  • User preferences (preferences.yml)                               │
│  • Service overrides (traefik: no auto-restart)                     │
│  • Current state (autonomous-state.json)                            │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           DECIDE                                    │
│  Confidence = (prediction × 0.30) + (historical × 0.30)             │
│             + (impact × 0.20) + (rollback × 0.20)                   │
│                                                                     │
│  ┌────────────────────┬──────────────────────────────────┐          │
│  │ Confidence + Risk  │ Action                           │          │
│  ├────────────────────┼──────────────────────────────────┤          │
│  │ >90% + Low Risk    │ AUTO-EXECUTE                     │          │
│  │ >80% + Med Risk    │ NOTIFY + EXECUTE                 │          │
│  │ >70% + Any Risk    │ QUEUE (for approval)             │          │
│  │ <70%               │ ALERT ONLY                       │          │
│  └────────────────────┴──────────────────────────────────┘          │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                            ACT                                      │
│  1. Create pre-action snapshot (BTRFS)                              │
│  2. Execute via appropriate skill/playbook                          │
│  3. Validate outcome (health check)                                 │
│  4. Auto-rollback if validation fails                               │
│  5. Log decision and outcome                                        │
└─────────────────────────────────────────────────────────────────────┘
```

## Core Scripts

### autonomous-check.sh

Evaluates current system state and outputs recommended actions:

```bash
# Run assessment (outputs JSON)
~/containers/scripts/autonomous-check.sh

# Verbose output
~/containers/scripts/autonomous-check.sh --verbose

# Output to file
~/containers/scripts/autonomous-check.sh --output /tmp/assessment.json
```

**Output structure:**
```json
{
  "timestamp": "2025-11-29T22:00:00+01:00",
  "health_score": 94,
  "observations": [...],
  "recommended_actions": [
    {
      "id": "action-001",
      "type": "disk-cleanup",
      "reason": "Disk predicted 90% full in 8 days",
      "confidence": 0.92,
      "risk": "low",
      "decision": "auto-execute"
    }
  ],
  "pending_queue": []
}
```

### autonomous-execute.sh

Executes approved actions from the queue:

```bash
# Execute pending low-risk actions
~/containers/scripts/autonomous-execute.sh

# Execute specific action
~/containers/scripts/autonomous-execute.sh --action-id action-001

# Dry run (simulate only)
~/containers/scripts/autonomous-execute.sh --dry-run
```

### Emergency Controls

```bash
# Emergency stop all autonomous operations
~/containers/scripts/autonomous-execute.sh --stop

# Pause (stop acting, keep observing)
~/containers/scripts/autonomous-execute.sh --pause

# Resume operations
~/containers/scripts/autonomous-execute.sh --resume

# View current state
~/containers/scripts/autonomous-execute.sh --status
```

## State Files

### autonomous-state.json

Location: `~/.claude/context/autonomous-state.json`

Tracks operational state:
```json
{
  "enabled": true,
  "mode": "active",
  "paused": false,
  "circuit_breaker": {
    "triggered": false,
    "consecutive_failures": 0,
    "last_failure": null
  },
  "last_check": "2025-11-29T22:00:00+01:00",
  "pending_actions": [],
  "cooldowns": {
    "jellyfin.restart": "2025-11-29T20:00:00+01:00"
  },
  "statistics": {
    "actions_24h": 2,
    "actions_7d": 8,
    "success_rate": 0.95
  }
}
```

### decision-log.json

Location: `~/.claude/context/decision-log.json`

Audit trail of all decisions:
```json
{
  "decisions": [
    {
      "id": "decision-001",
      "timestamp": "2025-11-29T03:00:00+01:00",
      "action_type": "disk-cleanup",
      "trigger": "prediction: disk 90% in 9 days",
      "confidence": 0.92,
      "risk": "low",
      "decision": "auto-execute",
      "outcome": "success",
      "details": "Freed 15 GB, new prediction: 21 days",
      "duration_seconds": 45
    }
  ]
}
```

## Integration Points

### With Existing Playbooks

The engine routes actions to existing remediation playbooks:

| Action Type | Playbook | Risk |
|-------------|----------|------|
| disk-cleanup | `.claude/remediation/playbooks/disk-cleanup.yml` | Low |
| service-restart | `.claude/remediation/playbooks/service-restart.yml` | Low |
| drift-reconciliation | `.claude/remediation/playbooks/drift-reconciliation.yml` | Medium |
| resource-pressure | `.claude/remediation/playbooks/resource-pressure.yml` | Medium |

### With Preferences

Respects all settings in `~/.claude/context/preferences.yml`:

- `risk_tolerance` - Affects confidence thresholds
- `service_overrides` - Per-service restrictions (traefik: no auto-restart)
- `safety.max_restarts_per_hour` - Rate limiting
- `safety.restart_cooldown_seconds` - Cooldown periods

### With Backups

Every action is protected:

1. Pre-action snapshot created via `btrfs-snapshot-backup.sh`
2. Snapshot tagged: `autonomous-<action>-<timestamp>`
3. Auto-rollback if validation fails
4. Snapshot retained for 7 days minimum

## Decision Matrix

### Risk Classification

| Action | Data Risk | Availability Risk | Overall |
|--------|-----------|-------------------|---------|
| Disk cleanup | None (snapshot) | None | **LOW** |
| Service restart | None (stateless) | 2-5s downtime | **LOW** |
| Config reconciliation | None (snapshot) | Potential failure | **MEDIUM** |
| Multi-service change | Medium | Cascading risk | **HIGH** |

### Confidence Calculation

```
confidence = (
    prediction_confidence × 0.30 +    # How sure about the problem?
    historical_success × 0.30 +       # Has this fix worked before?
    impact_certainty × 0.20 +         # Are we sure about outcome?
    rollback_feasibility × 0.20       # Can we undo if it fails?
)
```

**Example: Disk cleanup**
```
prediction_confidence = 0.89  # Linear trend, high confidence
historical_success = 1.00     # 12/12 successful cleanups
impact_certainty = 0.95       # Known outcome
rollback_feasibility = 1.00   # Instant snapshot restore

confidence = 0.89×0.3 + 1.0×0.3 + 0.95×0.2 + 1.0×0.2
           = 0.267 + 0.300 + 0.190 + 0.200
           = 0.957 (95.7%)

Decision: AUTO-EXECUTE (>90% + low risk)
```

## Safety Controls

### Circuit Breaker

Automatically pauses after 3 consecutive failures:

```json
{
  "circuit_breaker": {
    "threshold": 3,
    "triggered": true,
    "consecutive_failures": 3,
    "last_failure": "2025-11-29T04:00:00+01:00",
    "reason": "service-restart failed 3 times"
  }
}
```

**Recovery:** Manual reset via `--resume` or automatically after 24 hours.

### Cooldowns

Prevent action storms:

- Service restart: 5 minute cooldown between restarts
- Disk cleanup: 1 hour cooldown
- Drift reconciliation: 15 minute cooldown

### Service Overrides

Critical services have special handling:

```yaml
# From preferences.yml
service_overrides:
  traefik:
    auto_restart: false         # Never auto-restart
    requires_confirmation: true
  authelia:
    auto_restart: false         # SSO is critical
    requires_confirmation: true
  prometheus:
    auto_restart: true          # Can auto-restart
    restart_timeout_seconds: 90
```

## Verification Feedback Loop

**NEW: Actions are verified and confidence scores are updated based on outcomes.**

This creates a learning system - successful actions increase autonomy, failed actions increase caution.

### Enhanced ACT Phase

The ACT phase now includes verification and confidence learning:

```
┌─────────────────────────────────────────────────────────────────────┐
│                       ACT (Enhanced)                                │
│                                                                     │
│  1. Create pre-action snapshot (BTRFS)                              │
│  2. Capture before-state (homelab-intel.sh --json)                  │
│  3. Execute via appropriate skill/playbook                          │
│  4. Wait for stabilization (10-30s depending on action)             │
│                                                                     │
│  5. VERIFY OUTCOME (NEW)                                            │
│     ├─ Run verify-autonomous-outcome.sh                             │
│     ├─ Compare before/after state                                   │
│     ├─ For services: run service-validator                          │
│     └─ Calculate verification confidence                            │
│                                                                     │
│  6. UPDATE CONFIDENCE (NEW)                                         │
│     ├─ Verified success: confidence +5%                             │
│     ├─ Warnings: confidence +2%                                     │
│     ├─ Failed verification: confidence -10%                         │
│     └─ Track in decision log                                        │
│                                                                     │
│  7. Auto-rollback if verification fails                             │
│  8. Log decision with verification details                          │
│  9. Update circuit breaker state                                    │
└─────────────────────────────────────────────────────────────────────┘
```

### Verification by Action Type

**disk-cleanup:**
```bash
# Verify disk space actually freed (>3% improvement)
~/containers/scripts/verify-autonomous-outcome.sh \
  disk-cleanup \
  /tmp/before-state.json \
  30  # wait 30s

# Success: Freed 12% → confidence +5%
# Minimal: Freed 1% → confidence +2% (warning)
# Failed: No space freed → confidence -10%, rollback
```

**service-restart:**
```bash
# Verify service healthy after restart
~/containers/scripts/verify-autonomous-outcome.sh \
  service-restart \
  /tmp/before-state-jellyfin.json \
  30

# Also run service-validator for comprehensive checks
~/.claude/skills/homelab-deployment/scripts/verify-deployment.sh \
  jellyfin \
  https://jellyfin.patriark.org \
  true

# Success: Service up + all checks pass → confidence +5%
# Warning: Service up but warnings → confidence +2%
# Failed: Service crashed or checks fail → confidence -10%, rollback
```

**drift-reconciliation:**
```bash
# Verify drift actually resolved
~/containers/scripts/verify-autonomous-outcome.sh \
  drift-reconciliation \
  /tmp/before-state.json \
  30

# Success: Drift status = MATCH → confidence +5%
# Failed: Drift still present → confidence -10%, rollback
```

**resource-pressure:**
```bash
# Verify memory/CPU pressure relieved
~/containers/scripts/verify-autonomous-outcome.sh \
  resource-pressure \
  /tmp/before-state.json \
  30

# Success: Pressure reduced >5% → confidence +5%
# Minimal: Pressure reduced <5% → confidence +2%
# Failed: Pressure not reduced → confidence -10%
```

### Confidence Learning

Confidence scores are tracked per action type and updated based on verification outcomes:

**Before verification (historical only):**
```json
{
  "action_type": "disk-cleanup",
  "base_confidence": 0.92,
  "historical_success_rate": 12/12,
  "recent_executions": []
}
```

**After 3 verified successes:**
```json
{
  "action_type": "disk-cleanup",
  "base_confidence": 0.97,  // +5% after each success
  "historical_success_rate": 15/15,
  "recent_executions": [
    {"date": "2026-01-01", "verified": true, "delta": +5},
    {"date": "2026-01-02", "verified": true, "delta": +5},
    {"date": "2026-01-03", "verified": true, "delta": +5}
  ],
  "trend": "increasing"
}
```

**After 1 failed verification:**
```json
{
  "action_type": "service-restart",
  "base_confidence": 0.79,  // -10% after failure
  "historical_success_rate": 14/15,
  "recent_executions": [
    {"date": "2026-01-01", "verified": true, "delta": +5},
    {"date": "2026-01-02", "verified": true, "delta": +5},
    {"date": "2026-01-03", "verified": false, "delta": -10}
  ],
  "trend": "stable (recent failure)"
}
```

### State File Updates

**autonomous-state.json** now includes verification tracking:

```json
{
  "enabled": true,
  "last_check": "2026-01-04T06:30:00+01:00",
  "verification": {
    "enabled": true,
    "last_verification": "2026-01-04T06:35:00+01:00",
    "verification_pass_rate_7d": 0.95,
    "failures_requiring_rollback": 0
  },
  "confidence_trends": {
    "disk-cleanup": {
      "current": 0.97,
      "trend": "increasing",
      "last_updated": "2026-01-03T06:30:00+01:00"
    },
    "service-restart": {
      "current": 0.89,
      "trend": "stable",
      "last_updated": "2026-01-02T06:30:00+01:00"
    },
    "drift-reconciliation": {
      "current": 0.82,
      "trend": "stable",
      "last_updated": "2026-01-01T06:30:00+01:00"
    }
  },
  "circuit_breaker": {
    "threshold": 3,
    "triggered": false,
    "consecutive_failures": 0
  }
}
```

**decision-log.json** now includes verification results:

```json
{
  "decisions": [
    {
      "id": "decision-042",
      "timestamp": "2026-01-04T06:30:00+01:00",
      "action_type": "service-restart",
      "service": "jellyfin",
      "trigger": "health check failing",
      "confidence": 0.87,
      "risk": "low",
      "decision": "auto-execute",
      "outcome": "success",
      "verification": {
        "status": "VERIFIED",
        "confidence": 95,
        "checks_passed": 6,
        "checks_warned": 1,
        "checks_failed": 0,
        "report_path": "/tmp/verification-jellyfin-1704351000.txt",
        "verification_time_seconds": 24
      },
      "confidence_delta": 5,
      "new_confidence": 0.92,
      "details": "Service restarted successfully, all verification checks passed",
      "duration_seconds": 45
    }
  ]
}
```

### Benefits of Verification Feedback

**1. Learning from Experience**
- Actions that consistently verify successfully → higher confidence → more autonomy
- Actions that fail verification → lower confidence → more caution/user approval

**2. Earlier Problem Detection**
- Verification catches issues immediately after action
- Auto-rollback prevents prolonged outages
- User sees verification reports in decision log

**3. Improved Decision Making**
- Confidence scores reflect actual success rates (not just historical)
- Recent failures appropriately reduce autonomy
- Recent successes appropriately increase autonomy

**4. Auditable Outcomes**
- Every action has verification report
- Can review why action succeeded/failed
- Track verification pass rates over time

### Example: Confidence Evolution

**Week 1: Initial deployment**
```
disk-cleanup confidence: 0.85 (based on historical data)
→ Execute action
→ Verify: 15% disk freed ✓
→ New confidence: 0.90 (+5%)
```

**Week 2: Continued success**
```
disk-cleanup confidence: 0.90
→ Execute action
→ Verify: 12% disk freed ✓
→ New confidence: 0.95 (+5%)
```

**Week 3: First failure**
```
disk-cleanup confidence: 0.95
→ Execute action
→ Verify: 0% disk freed ✗ (logs already rotated elsewhere)
→ Rollback triggered
→ New confidence: 0.85 (-10%)
```

**Week 4: Recovery**
```
disk-cleanup confidence: 0.85
→ Execute action
→ Verify: 8% disk freed ✓
→ New confidence: 0.90 (+5%)
```

**Result:** System learned that disk-cleanup success rate is ~75%, not 100%. Confidence stabilizes at ~0.90, which is more accurate than initial 0.95 assumption.

## Systemd Integration

### Timer: autonomous-check.timer

Runs daily assessment at 06:00 (after backups):

```ini
[Timer]
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=300
Persistent=true
```

### Service: autonomous-execute.service

Executes approved actions after check completes.

## Query Interface

### Via Claude Code

```
"What autonomous actions were taken this week?"
"Show me pending actions"
"Why didn't the system restart Jellyfin?"
"What's the current autonomous operations status?"
```

### Via Scripts

```bash
# Query decision log
~/containers/.claude/context/scripts/query-decisions.sh --last 7d

# Query pending actions
~/containers/.claude/context/scripts/query-decisions.sh --pending

# Query by action type
~/containers/.claude/context/scripts/query-decisions.sh --type disk-cleanup
```

## Reporting

### Weekly Intelligence Report Integration

The weekly report includes autonomous operations section:

```
## Autonomous Operations (Last 7 Days)

Actions Taken: 8
  - disk-cleanup: 2 (100% success)
  - service-restart: 5 (100% success)
  - drift-reconciliation: 1 (100% success)

Success Rate: 100%
Average Confidence: 91%

Pending Actions: 0
Circuit Breaker: Not triggered
```

### Discord Notifications

- Action execution: Notify on execute (configurable)
- Failures: Always notify
- Weekly summary: Included in intelligence report

## Troubleshooting

### Check Current State

```bash
# View state file
cat ~/.claude/context/autonomous-state.json | jq '.'

# View recent decisions
cat ~/.claude/context/decision-log.json | jq '.decisions[-5:]'

# Check if paused
cat ~/.claude/context/autonomous-state.json | jq '.paused'
```

### Circuit Breaker Triggered

```bash
# Check why
cat ~/.claude/context/autonomous-state.json | jq '.circuit_breaker'

# View failing action
cat ~/.claude/context/decision-log.json | jq '.decisions | map(select(.outcome == "failure")) | .[-3:]'

# Reset manually
~/containers/scripts/autonomous-execute.sh --resume
```

### Action Not Executing

Check in order:
1. Is autonomous operations enabled? (`enabled: true`)
2. Is it paused? (`paused: false`)
3. Is circuit breaker triggered?
4. Is the service in overrides with `auto_restart: false`?
5. Is there a cooldown active?
6. Is confidence below threshold?

## Success Metrics

- **Proactive ratio:** 80%+ issues prevented before impact
- **Success rate:** >95% of autonomous actions succeed
- **MTTR reduction:** Minutes instead of hours
- **Manual interventions:** <10% of issues require human action

---

**This skill orchestrates all homelab components into a self-managing system.**