Back to skills
SkillHub ClubWrite Technical DocsFull StackTech Writer

operating-production-services

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
22
Hot score
88
Updated
March 20, 2026
Overall rating
C1.8
Composite score
1.8
Best-practice grade
A88.4

Install command

npx @skill-hub/cli install mjunaidca-mjs-agent-skills-operating-production-services

Repository

mjunaidca/mjs-agent-skills

Skill path: .claude/skills/operating-production-services

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: mjunaidca.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install operating-production-services into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/mjunaidca/mjs-agent-skills before adding operating-production-services to shared team environments
  • Use operating-production-services for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: operating-production-services
description: |
  SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response.
  Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing
  on-call practices. NOT for initial service development (use scaffolding skills instead).
---

# Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

## Quick Reference

| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |

---

## SLOs & Error Budgets

### The Hierarchy

```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```

### Common SLIs

```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```

### SLO Targets Reality Check

| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |

**Don't aim for 100%.** Each nine costs exponentially more.

### Error Budget

```
Error Budget = 1 - SLO Target
```

**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month

**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |

See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.

---

## Postmortem Templates

### The Blameless Principle

| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |

### When to Write Postmortems

- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes

### Standard Template

```markdown
# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```

### Quick Template (Minor Incidents)

```markdown
# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```

---

## Postmortem Meeting Guide

### Structure (60 min)

1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners

### Facilitation Tips

- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants

---

## Anti-Patterns

| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |

---

## Verification

Run: `python scripts/verify.py`

## References

- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboards


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/slo-alerting.md

```markdown
# SLO Alerting Patterns

## Prometheus Recording Rules

```yaml
groups:
  - name: sli_recording
    interval: 30s
    rules:
      # Availability SLI (28-day window)
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_recording
    interval: 5m
    rules:
      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Burn rates for different windows
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

      - record: slo:http_availability:burn_rate_1h
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          )) / (1 - 0.999)

      - record: slo:http_availability:burn_rate_6h
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          )) / (1 - 0.999)
```

## Multi-Window Burn Rate Alerts

Why multi-window? Single-window alerts are either too noisy (short window) or too slow (long window). Combining windows reduces false positives.

```yaml
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 14.4x rate over 1 hour
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn - {{ $value | printf \"%.1f\" }}x rate"
          description: "At current rate, error budget exhausted in {{ printf \"%.1f\" (div 100 $value) }} hours"

      # Slow burn: 6x rate over 6 hours
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn - {{ $value | printf \"%.1f\" }}x rate"

      # Budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget: {{ $value | printf \"%.1f\" }}%"
```

## Burn Rate Reference

| Burn Rate | Budget Consumed/Hour | Time to Exhaust |
|-----------|---------------------|-----------------|
| 1x | 0.14% | 28 days |
| 6x | 0.86% | ~5 days |
| 14.4x | 2% | ~2 days |
| 36x | 5% | ~19 hours |

## Grafana Dashboard

```
┌────────────────────────────────────┐
│ SLO Status                          │
│ Current: 99.95% | Target: 99.9%    │
│ Status: ✓ Meeting SLO              │
├────────────────────────────────────┤
│ Error Budget                        │
│ Remaining: 65%                      │
│ ████████████░░░░░░░░ 65%           │
│ ~18 days at current burn rate      │
├────────────────────────────────────┤
│ Burn Rate (by window)              │
│ 5m:  1.2x  ░                       │
│ 1h:  0.8x  ░                       │
│ 6h:  0.5x  ░                       │
│ 28d: 0.3x  ░                       │
└────────────────────────────────────┘
```

### Key Queries

```promql
# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until exhausted at current burn
slo:http_availability:error_budget_remaining / 100 * 28
/ max(slo:http_availability:burn_rate_1h, 1)
```

## SLO Definition Template

```yaml
slos:
  - name: api_availability
    description: "API requests complete successfully"
    target: 99.9
    window: 28d
    sli:
      good: sum(rate(http_requests_total{status!~"5.."}[28d]))
      total: sum(rate(http_requests_total[28d]))
    alerts:
      fast_burn:
        burn_rate: 14.4
        short_window: 5m
        long_window: 1h
        severity: critical
      slow_burn:
        burn_rate: 6
        short_window: 30m
        long_window: 6h
        severity: warning

  - name: api_latency_p95
    description: "API requests complete within 500ms"
    target: 99.0
    window: 28d
    sli:
      good: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      total: sum(rate(http_request_duration_seconds_count[28d]))
```

## Common Mistakes

| Mistake | Problem | Fix |
|---------|---------|-----|
| Single window alerts | Too noisy or too slow | Use multi-window |
| Missing burn rate | Don't know velocity | Add burn rate recording rules |
| 100% SLO target | No error budget | Accept 99.9% or lower |
| No dashboard | Can't see trends | Build SLO dashboard first |
| Alert on SLI directly | Missing context | Alert on burn rate instead |

```

### scripts/verify.py

```python
#!/usr/bin/env python3
"""Verify operating-production-services skill structure."""
import os
import sys

def main():
    skill_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

    required = [
        "SKILL.md",
        "references/slo-alerting.md",
    ]

    missing = [f for f in required if not os.path.exists(os.path.join(skill_dir, f))]

    if missing:
        print(f"X Missing: {', '.join(missing)}")
        sys.exit(1)

    # Check SKILL.md has key sections
    skill_path = os.path.join(skill_dir, "SKILL.md")
    with open(skill_path, 'r') as f:
        content = f.read()

    required_sections = ["SLOs", "Error Budget", "Postmortem", "5 Whys"]
    missing_sections = [s for s in required_sections if s not in content]

    if missing_sections:
        print(f"X Missing sections: {', '.join(missing_sections)}")
        sys.exit(1)

    print("OK operating-production-services skill ready")
    sys.exit(0)

if __name__ == "__main__":
    main()

```

operating-production-services | SkillHub