SkillHub ClubRun DevOpsFull StackDevOpsTesting

k8s-troubleshoot

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

847

Hot score

Updated

March 20, 2026

Overall rating

C4.4

Composite score

4.4

Best-practice grade

B77.6

Install command

npx @skill-hub/cli install rohitg00-kubectl-mcp-server-k8s-troubleshoot

Repository

rohitg00/kubectl-mcp-server

Skill path: kubernetes-skills/claude/k8s-troubleshoot

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, DevOps, Testing.

Target audience: everyone.

License: Apache-2.0.

Original source

Catalog source: SkillHub Club.

Repository owner: rohitg00.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install k8s-troubleshoot into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/rohitg00/kubectl-mcp-server before adding k8s-troubleshoot to shared team environments
Use k8s-troubleshoot for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: k8s-troubleshoot
description: Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.
license: Apache-2.0
metadata:
  author: rohitg00
  version: "1.0.0"
  tools: 15
  category: observability
---

# Kubernetes Troubleshooting

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

## When to Apply

Use this skill when:
- User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
- Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
- Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
- Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check pod status first | CRITICAL | `get_pods`, `describe_pod` |
| 2 | View recent events | CRITICAL | `get_events` |
| 3 | Inspect logs (including previous) | HIGH | `get_pod_logs` |
| 4 | Check resource metrics | HIGH | `get_pod_metrics` |
| 5 | Verify endpoints | MEDIUM | `get_endpoints` |
| 6 | Review network policies | MEDIUM | `get_network_policies` |
| 7 | Examine node status | LOW | `get_nodes`, `describe_node` |

## Quick Reference

| Symptom | First Tool | Next Steps |
|---------|------------|------------|
| Pod Pending | `describe_pod` | Check events, node capacity, resource requests |
| CrashLoopBackOff | `get_pod_logs(previous=True)` | Check exit code, resources, liveness probes |
| ImagePullBackOff | `describe_pod` | Verify image name, registry auth, network |
| OOMKilled | `get_pod_metrics` | Increase memory limits, check for memory leaks |
| ContainerCreating | `describe_pod` | Check PVC binding, secrets, configmaps |
| Terminating (stuck) | `describe_pod` | Check finalizers, PDBs, preStop hooks |

## Diagnostic Workflows

### Pod Not Starting

```
1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops
```

### Common Pod States

| State | Likely Cause | Tools to Use |
|-------|-------------|--------------|
| Pending | Scheduling issues | `describe_pod`, `get_nodes`, `get_events` |
| ImagePullBackOff | Registry/auth | `describe_pod`, check image name |
| CrashLoopBackOff | App crash | `get_pod_logs(previous=True)` |
| OOMKilled | Memory limit | `get_pod_metrics`, adjust limits |
| ContainerCreating | Volume/network | `describe_pod`, `get_pvc` |

### Node Issues

```
1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs
```

## Deep Debugging Workflows

### CrashLoopBackOff Investigation

```
1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace
```

### Networking Issues

```
1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()
```

### Storage Problems

```
1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes
```

### DNS Resolution

```
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")
```

## Multi-Cluster Debugging

All tools support `context` parameter for targeting different clusters:

```python
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")
```

## Diagnostic Scripts

For comprehensive diagnostics, run the bundled scripts:
- See [scripts/diagnose-pod.py](scripts/diagnose-pod.py) for automated pod analysis
- See [scripts/health-check.sh](scripts/health-check.sh) for cluster health checks

## Decision Tree

See [references/DECISION-TREE.md](references/DECISION-TREE.md) for visual troubleshooting flowcharts.

## Common Errors Reference

See [references/COMMON-ERRORS.md](references/COMMON-ERRORS.md) for error message explanations and fixes.

## Related Tools

### Core Diagnostics
- `get_pods`, `describe_pod`, `get_pod_logs`, `get_pod_metrics`
- `get_events`, `get_nodes`, `describe_node`
- `get_resource_usage`, `compare_namespaces`

### Advanced (Ecosystem)
- Cilium: `cilium_endpoints_list_tool`, `hubble_flows_query_tool`
- Istio: `istio_proxy_status_tool`, `istio_analyze_tool`

## Related Skills

- [k8s-diagnostics](../k8s-diagnostics/SKILL.md) - Metrics and health checks
- [k8s-incident](../k8s-incident/SKILL.md) - Emergency runbooks
- [k8s-networking](../k8s-networking/SKILL.md) - Network troubleshooting


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/diagnose-pod.py

```python
#!/usr/bin/env python3
"""
Pod Diagnostic Script
Collects comprehensive diagnostics for a pod.

Usage within Claude Code:
    This script is called by the k8s-troubleshoot skill to gather
    pod diagnostics in a structured format.
"""

import json
import sys
from typing import Any


def diagnose_pod(name: str, namespace: str, context: str = "") -> dict[str, Any]:
    """
    Collect comprehensive diagnostics for a pod.

    Args:
        name: Pod name
        namespace: Kubernetes namespace
        context: Optional kubeconfig context

    Returns:
        Dictionary with diagnostic information
    """
    diagnostics = {
        "pod": name,
        "namespace": namespace,
        "context": context or "current",
        "checks": [],
        "issues": [],
        "recommendations": []
    }

    # Note: In actual usage, Claude will call the MCP tools directly.
    # This script structure shows what diagnostics to collect.

    diagnostics["checks"] = [
        {
            "name": "pod_status",
            "tool": "get_pods",
            "params": {"namespace": namespace, "context": context},
            "description": "Get pod status and phase"
        },
        {
            "name": "pod_details",
            "tool": "describe_pod",
            "params": {"name": name, "namespace": namespace, "context": context},
            "description": "Get detailed pod description"
        },
        {
            "name": "pod_logs",
            "tool": "get_pod_logs",
            "params": {"name": name, "namespace": namespace, "previous": True, "context": context},
            "description": "Get logs (including previous container)"
        },
        {
            "name": "pod_events",
            "tool": "get_events",
            "params": {"namespace": namespace, "field_selector": f"involvedObject.name={name}", "context": context},
            "description": "Get events related to this pod"
        },
        {
            "name": "pod_metrics",
            "tool": "get_pod_metrics",
            "params": {"name": name, "namespace": namespace, "context": context},
            "description": "Get resource usage metrics"
        }
    ]

    return diagnostics


def analyze_pod_state(status: str) -> dict[str, Any]:
    """
    Analyze pod state and provide recommendations.

    Args:
        status: Pod status from describe

    Returns:
        Analysis with issues and recommendations
    """
    analysis = {
        "issues": [],
        "recommendations": []
    }

    # Common patterns
    patterns = {
        "CrashLoopBackOff": {
            "issue": "Container is crashing repeatedly",
            "checks": [
                "Check logs with get_pod_logs(previous=True)",
                "Check exit code in describe output",
                "Verify resource limits aren't too restrictive"
            ],
            "common_causes": [
                "Application error - check logs",
                "OOMKilled - increase memory limit",
                "Missing dependencies - check init containers"
            ]
        },
        "ImagePullBackOff": {
            "issue": "Cannot pull container image",
            "checks": [
                "Verify image name and tag",
                "Check imagePullSecrets",
                "Test registry accessibility"
            ],
            "common_causes": [
                "Wrong image name or tag",
                "Private registry without credentials",
                "Registry rate limiting"
            ]
        },
        "Pending": {
            "issue": "Pod cannot be scheduled",
            "checks": [
                "Check node resources",
                "Verify node selectors",
                "Check for taints/tolerations"
            ],
            "common_causes": [
                "Insufficient CPU/memory on nodes",
                "No nodes match selectors",
                "PVC not bound"
            ]
        },
        "ContainerCreating": {
            "issue": "Container stuck creating",
            "checks": [
                "Check events for mount errors",
                "Verify PVCs are bound",
                "Check image pull status"
            ],
            "common_causes": [
                "Volume mount failure",
                "Slow image pull",
                "Network plugin issue"
            ]
        }
    }

    for pattern, info in patterns.items():
        if pattern.lower() in status.lower():
            analysis["issues"].append(info["issue"])
            analysis["recommendations"].extend(info["checks"])
            analysis["common_causes"] = info["common_causes"]
            break

    return analysis


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: diagnose-pod.py <pod-name> <namespace> [context]")
        sys.exit(1)

    pod_name = sys.argv[1]
    namespace = sys.argv[2]
    context = sys.argv[3] if len(sys.argv) > 3 else ""

    result = diagnose_pod(pod_name, namespace, context)
    print(json.dumps(result, indent=2))

```

### scripts/health-check.sh

```bash
#!/bin/bash
# Kubernetes Cluster Health Check Script
#
# This script provides a quick cluster health overview.
# Used by the k8s-troubleshoot skill for rapid triage.

set -euo pipefail

CONTEXT="${1:-}"
NAMESPACE="${2:-}"

# Color codes (disabled in non-interactive mode)
if [ -t 1 ]; then
    RED='\033[0;31m'
    GREEN='\033[0;32m'
    YELLOW='\033[1;33m'
    NC='\033[0m'
else
    RED=''
    GREEN=''
    YELLOW=''
    NC=''
fi

log_ok() { echo -e "${GREEN}[OK]${NC} $1"; }
log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }

KUBECTL_OPTS=""
if [ -n "$CONTEXT" ]; then
    KUBECTL_OPTS="--context=$CONTEXT"
fi

echo "=== Kubernetes Cluster Health Check ==="
echo "Context: ${CONTEXT:-current}"
echo "Namespace: ${NAMESPACE:-all}"
echo ""

# Check node status
echo "--- Node Status ---"
NOT_READY=$(kubectl $KUBECTL_OPTS get nodes --no-headers 2>/dev/null | grep -v " Ready" | wc -l | tr -d ' ')
TOTAL_NODES=$(kubectl $KUBECTL_OPTS get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')

if [ "$NOT_READY" -eq 0 ]; then
    log_ok "All $TOTAL_NODES nodes are Ready"
else
    log_error "$NOT_READY of $TOTAL_NODES nodes are NOT Ready"
    kubectl $KUBECTL_OPTS get nodes | grep -v " Ready"
fi
echo ""

# Check system pods
echo "--- System Pods (kube-system) ---"
FAILED_SYSTEM=$(kubectl $KUBECTL_OPTS get pods -n kube-system --no-headers 2>/dev/null | grep -v "Running\|Completed" | wc -l | tr -d ' ')

if [ "$FAILED_SYSTEM" -eq 0 ]; then
    log_ok "All system pods are healthy"
else
    log_error "$FAILED_SYSTEM system pods are unhealthy"
    kubectl $KUBECTL_OPTS get pods -n kube-system | grep -v "Running\|Completed"
fi
echo ""

# Check namespace pods if specified
if [ -n "$NAMESPACE" ]; then
    echo "--- Pods in $NAMESPACE ---"
    FAILED_NS=$(kubectl $KUBECTL_OPTS get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -v "Running\|Completed" | wc -l | tr -d ' ')

    if [ "$FAILED_NS" -eq 0 ]; then
        log_ok "All pods in $NAMESPACE are healthy"
    else
        log_warn "$FAILED_NS pods in $NAMESPACE are unhealthy"
        kubectl $KUBECTL_OPTS get pods -n "$NAMESPACE" | grep -v "Running\|Completed"
    fi
    echo ""
fi

# Check for pending PVCs
echo "--- Storage (Pending PVCs) ---"
PENDING_PVCS=$(kubectl $KUBECTL_OPTS get pvc --all-namespaces --no-headers 2>/dev/null | grep -v "Bound" | wc -l | tr -d ' ')

if [ "$PENDING_PVCS" -eq 0 ]; then
    log_ok "No pending PVCs"
else
    log_warn "$PENDING_PVCS PVCs are pending"
    kubectl $KUBECTL_OPTS get pvc --all-namespaces | grep -v "Bound"
fi
echo ""

# Check recent events
echo "--- Recent Warning Events (last 10) ---"
kubectl $KUBECTL_OPTS get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -10 || true
echo ""

echo "=== Health Check Complete ==="

```

### references/DECISION-TREE.md

```markdown
# Troubleshooting Decision Trees

Visual flowcharts for diagnosing Kubernetes issues.

## Pod Not Running

```
Pod Status?
├── Pending
│   ├── Events show "Insufficient cpu/memory"
│   │   └── Scale cluster or reduce requests
│   ├── Events show "no nodes available"
│   │   └── Check node taints, affinity rules
│   ├── Events show "PersistentVolumeClaim not found"
│   │   └── Create PVC or check storage class
│   └── No events
│       └── Check scheduler pods in kube-system
│
├── CrashLoopBackOff
│   ├── get_pod_logs(previous=True)
│   ├── Exit Code 137 (OOMKilled)
│   │   └── Increase memory limits
│   ├── Exit Code 1 (App Error)
│   │   └── Check application logs, config
│   └── Exit Code 127 (Command Not Found)
│       └── Check entrypoint/command in spec
│
├── ImagePullBackOff
│   ├── "unauthorized"
│   │   └── Create/update imagePullSecrets
│   ├── "not found"
│   │   └── Verify image name and tag
│   └── "timeout"
│       └── Check network, registry availability
│
├── ContainerCreating (stuck)
│   ├── Volume issues
│   │   └── Check PVC status, storage class
│   ├── ConfigMap/Secret not found
│   │   └── Create missing resources
│   └── Network issues
│       └── Check CNI pods
│
└── Running but not ready
    ├── Readiness probe failing
    │   └── Check probe config, app health
    └── Init containers not complete
        └── Check init container logs
```

## Service Not Accessible

```
Service unreachable?
├── get_endpoints(namespace) empty?
│   ├── Yes
│   │   ├── Pods exist with matching labels?
│   │   │   ├── No → Fix selector labels
│   │   │   └── Yes → Pods not ready
│   │   │       └── Fix pod readiness
│   │   └── Service selector correct?
│   │       └── Update service spec
│   └── No (endpoints exist)
│       ├── Check NetworkPolicy
│       │   └── get_network_policies(namespace)
│       ├── Check Service type
│       │   ├── ClusterIP → Only internal access
│       │   ├── NodePort → Access via node:port
│       │   └── LoadBalancer → Check cloud LB
│       └── DNS resolution working?
│           └── Test from inside pod
```

## Node Issues

```
Node Not Ready?
├── describe_node(name)
├── Conditions show:
│   ├── MemoryPressure
│   │   └── Eviction happening, free memory
│   ├── DiskPressure
│   │   └── Clean up images, logs
│   ├── PIDPressure
│   │   └── Kill zombie processes
│   └── NetworkUnavailable
│       └── Check CNI, kubelet
├── Kubelet not running?
│   └── Check systemctl status kubelet
└── Node cordoned?
    └── kubectl uncordon node
```

## Storage Issues

```
PVC Pending?
├── describe_pvc(name, namespace)
├── Events show:
│   ├── "no persistent volumes available"
│   │   ├── Dynamic provisioning enabled?
│   │   │   └── Check StorageClass exists
│   │   └── Static PV exists with matching spec?
│   │       └── Check access modes, capacity
│   ├── "waiting for first consumer"
│   │   └── Normal with WaitForFirstConsumer
│   └── "provisioning failed"
│       └── Check storage backend, quotas
```

## Deployment Not Progressing

```
Deployment stuck?
├── rollout_status(name, namespace)
├── Shows "waiting for rollout to finish"
│   ├── New pods starting?
│   │   ├── No → Check pod issues above
│   │   └── Yes but failing
│   │       └── Check pod logs
│   ├── Old pods not terminating?
│   │   ├── Check finalizers
│   │   └── Check PDBs (PodDisruptionBudget)
│   └── Deadline exceeded?
│       └── Increase progressDeadlineSeconds
```

## Quick Commands Reference

| Issue | First Command |
|-------|--------------|
| Pod not starting | `describe_pod(name, namespace)` |
| Logs needed | `get_pod_logs(name, namespace, previous=True)` |
| Events check | `get_events(namespace)` |
| Node issues | `describe_node(name)` |
| Service debug | `get_endpoints(namespace)` |
| Storage issues | `describe_pvc(name, namespace)` |

```

### references/COMMON-ERRORS.md

```markdown
# Common Kubernetes Error Messages

Quick reference for error messages and their solutions.

## Pod Errors

### CrashLoopBackOff

**Meaning:** Container keeps crashing and restarting.

**Causes:**
- Application crash on startup
- Missing configuration or secrets
- Resource exhaustion (OOM)
- Failing health checks

**Fix:**
```python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
```

### ImagePullBackOff / ErrImagePull

**Meaning:** Cannot pull container image.

**Causes:**
- Image doesn't exist
- Private registry without credentials
- Network issues
- Rate limiting (Docker Hub)

**Fix:**
```python
describe_pod(name, namespace)
```

### OOMKilled (Exit Code 137)

**Meaning:** Container exceeded memory limit.

**Causes:**
- Memory limit too low
- Memory leak in application
- Large data processing

**Fix:**
```python
get_pod_metrics(name, namespace)
```

### Exit Code 1

**Meaning:** Application error.

**Fix:** Check application logs for stack trace.

### Exit Code 127

**Meaning:** Command not found.

**Fix:** Check container image has required binaries.

### Exit Code 128+N

**Meaning:** Container killed by signal N.

| Exit Code | Signal | Meaning |
|-----------|--------|---------|
| 130 | SIGINT (2) | Interrupt |
| 137 | SIGKILL (9) | Killed (OOM) |
| 143 | SIGTERM (15) | Terminated |

## Scheduling Errors

### 0/N nodes are available

**Causes:**
- Insufficient resources
- Node selector/affinity not matching
- Taints without tolerations
- All nodes cordoned

**Fix:**
```python
describe_pod(name, namespace)
get_nodes()
```

### PodExceedsFreeCPU / PodExceedsFreeMemory

**Fix:** Reduce resource requests or add capacity.

### NodeNotReady

**Fix:**
```python
describe_node(name)
```

## Storage Errors

### PersistentVolumeClaim not found

**Fix:** Create the PVC before the pod.

### Unable to attach or mount volumes

**Causes:**
- PV already attached elsewhere
- Storage class misconfigured
- Node has no access to storage

**Fix:**
```python
describe_pvc(name, namespace)
get_storage_classes()
```

## Network Errors

### Connection refused

**Causes:**
- Service not exposed on expected port
- Application not listening
- Firewall/NetworkPolicy blocking

**Fix:**
```python
get_endpoints(namespace)
get_network_policies(namespace)
```

### No route to host

**Causes:**
- Node network issues
- CNI problems

**Fix:** Check CNI pods in kube-system.

### DNS resolution failed

**Fix:**
```python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
```

## RBAC Errors

### Forbidden: User cannot...

**Causes:**
- Missing Role/ClusterRole
- Missing RoleBinding/ClusterRoleBinding
- ServiceAccount not assigned

**Fix:**
```python
get_cluster_roles()
get_role_bindings(namespace)
```

## Admission Controller Errors

### Admission webhook denied

**Causes:**
- Policy violation (Kyverno/Gatekeeper)
- Webhook unavailable

**Fix:**
```python
kyverno_clusterpolicies_list_tool()
gatekeeper_constraints_list_tool()
```

## Resource Limit Errors

### Forbidden: exceeded quota

**Fix:**
```python
get_resource_quotas(namespace)
```

### LimitRange rejection

**Fix:** Adjust resource requests/limits to match LimitRange.

## Probe Failures

### Liveness probe failed

**Meaning:** Container will be restarted.

### Readiness probe failed

**Meaning:** Pod removed from service endpoints.

**Common causes:**
- Wrong port in probe config
- Application slow to start
- Dependencies unavailable

**Fix:**
```python
describe_pod(name, namespace)
get_pod_logs(name, namespace)
```

## Quick Lookup

| Error | First Tool |
|-------|------------|
| CrashLoopBackOff | `get_pod_logs(previous=True)` |
| ImagePullBackOff | `describe_pod` |
| Pending (scheduling) | `describe_pod`, `get_events` |
| PVC Pending | `describe_pvc` |
| Connection refused | `get_endpoints` |
| RBAC Forbidden | `get_role_bindings` |

```

### ../k8s-diagnostics/SKILL.md

```markdown
---
name: k8s-diagnostics
description: Kubernetes diagnostics for metrics, health checks, resource comparisons, and cluster analysis. Use when analyzing cluster health, comparing environments, or gathering diagnostic data.
license: Apache-2.0
metadata:
  author: rohitg00
  version: "1.0.0"
  tools: 10
  category: observability
---

# Kubernetes Diagnostics

Analyze cluster health and compare resources using kubectl-mcp-server's diagnostic tools.

## When to Apply

Use this skill when:
- User mentions: "metrics", "health check", "compare", "analysis", "capacity"
- Operations: cluster health assessment, environment comparison, resource analysis
- Keywords: "how much", "usage", "difference between", "capacity planning"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check metrics-server before using metrics | CRITICAL | `get_resource_metrics` |
| 2 | Run health check before deployments | HIGH | `cluster_health_check` |
| 3 | Compare staging vs prod before release | MEDIUM | `compare_namespaces` |
| 4 | Document baseline metrics | LOW | `get_nodes_summary` |

## Quick Reference

| Task | Tool | Example |
|------|------|---------|
| Cluster health | `cluster_health_check` | `cluster_health_check()` |
| Pod metrics | `get_resource_metrics` | `get_resource_metrics(namespace)` |
| Node summary | `get_nodes_summary` | `get_nodes_summary()` |
| Compare envs | `compare_namespaces` | `compare_namespaces(ns1, ns2, type)` |
| List CRDs | `list_crds` | `list_crds()` |

## Resource Metrics

```python
get_resource_metrics(namespace="default")

get_node_metrics()

get_top_pods(namespace="default", sort_by="cpu")

get_top_pods(namespace="default", sort_by="memory")
```

## Cluster Health Check

```python
cluster_health_check()

get_cluster_info()
```

## Compare Environments

```python
compare_namespaces(
    namespace1="staging",
    namespace2="production",
    resource_type="deployment"
)

compare_namespaces(
    namespace1="default",
    namespace2="default",
    resource_type="deployment",
    context1="staging-cluster",
    context2="prod-cluster"
)
```

## API Discovery

```python
get_api_versions()

check_crd_exists(crd_name="certificates.cert-manager.io")

list_crds()
```

## Resource Analysis

```python
get_nodes_summary()

kubeconfig_view()

list_contexts_tool()
```

## Diagnostic Workflows

### Cluster Overview

```python
cluster_health_check()
get_nodes_summary()
get_events(namespace="")
list_crds()
```

### Pre-deployment Check

```python
get_resource_metrics(namespace="production")
get_nodes_summary()
compare_namespaces(namespace1="staging", namespace2="prod", resource_type="deployment")
```

### Post-incident Analysis

```python
get_events(namespace)
get_pod_logs(name, namespace, previous=True)
get_resource_metrics(namespace)
describe_node(name)
```

## Related Skills

- [k8s-troubleshoot](../k8s-troubleshoot/SKILL.md) - Debug issues
- [k8s-cost](../k8s-cost/SKILL.md) - Cost analysis
- [k8s-incident](../k8s-incident/SKILL.md) - Incident response

```

### ../k8s-incident/SKILL.md

```markdown
---
name: k8s-incident
description: Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.
license: Apache-2.0
metadata:
  author: rohitg00
  version: "1.0.0"
  tools: 15
  category: observability
---

# Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

## When to Apply

Use this skill when:
- User mentions: "incident", "outage", "emergency", "down", "not working"
- Operations: emergency response, production issues, service degradation
- Keywords: "urgent", "broken", "fix", "restore", "recover"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check control plane first | CRITICAL | `get_pods(namespace="kube-system")` |
| 2 | Assess node health | CRITICAL | `get_nodes` |
| 3 | Gather events before changes | HIGH | `get_events` |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | `rollback_deployment` |

## Quick Reference

| Incident | First Tool | Next Steps |
|----------|------------|------------|
| Pod failure | `get_pod_logs(previous=True)` | `describe_pod`, `get_events` |
| Node down | `describe_node` | Check kubelet logs |
| Service unreachable | `get_endpoints` | `get_network_policies` |
| Control plane | `get_pods(namespace="kube-system")` | Check API server logs |

## Incident Triage

### Quick Health Check

```python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
```

### Severity Assessment

| Indicator | Severity | Action |
|-----------|----------|--------|
| Multiple nodes NotReady | Critical | Escalate immediately |
| kube-system pods failing | Critical | Control plane issue |
| Single pod CrashLoop | Medium | Debug pod |
| High latency | Medium | Check resources |

## Runbook: Pod Failures

### CrashLoopBackOff

```python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
```

**Common Causes:**
- OOMKilled → Increase memory limits
- Exit code 1 → Application error in logs
- Exit code 137 → Killed by OOM or SIGKILL
- Exit code 143 → Graceful SIGTERM

### ImagePullBackOff

```python
describe_pod(name, namespace)
get_secrets(namespace)
```

### Pending Pod

```python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
```

## Runbook: Node Issues

### Node NotReady

```python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
```

### Node DiskPressure

```python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
```

## Runbook: Network Issues

### Service Not Accessible

```python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
```

### DNS Resolution Failures

```python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
```

### With Cilium

```python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
```

### With Istio

```python
istio_analyze_tool(namespace)
istio_proxy_status_tool()
```

## Runbook: Storage Issues

### PVC Pending

```python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
```

### Pod Stuck in ContainerCreating

```python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
```

## Runbook: Control Plane Issues

### API Server Unavailable

```python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
```

### etcd Issues

```python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
```

## Emergency Actions

### Force Delete Pod

```python
delete_pod(name, namespace, grace_period=0, force=True)
```

### Rollback Deployment

```python
rollback_deployment(name, namespace, revision=0)
```

### Helm Rollback

```python
rollback_helm_release(name, namespace, revision=1)
```

## Diagnostic Collection Script

For comprehensive incident diagnostics, see [scripts/collect-diagnostics.py](scripts/collect-diagnostics.py).

## Multi-Cluster Incident Response

Check all clusters:

```python
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)
```

## Post-Incident

### Document Timeline

1. When did the incident start?
2. What was the impact?
3. What was the root cause?
4. What fixed it?

### Prevent Recurrence

- Add monitoring/alerting
- Improve resource limits
- Add readiness probes
- Document runbook

## Related Skills

- [k8s-troubleshoot](../k8s-troubleshoot/SKILL.md) - Detailed debugging
- [k8s-security](../k8s-security/SKILL.md) - Security incidents

```

### ../k8s-networking/SKILL.md

```markdown
---
name: k8s-networking
description: Kubernetes networking management for services, ingresses, endpoints, and network policies. Use when configuring connectivity, load balancing, or network isolation.
license: Apache-2.0
metadata:
  author: rohitg00
  version: "1.0.0"
  tools: 8
  category: networking
---

# Kubernetes Networking

Manage Kubernetes networking resources using kubectl-mcp-server's networking tools.

## When to Apply

Use this skill when:
- User mentions: "service", "ingress", "endpoint", "network policy", "load balancer"
- Operations: exposing applications, configuring routing, network isolation
- Keywords: "connectivity", "DNS", "traffic", "port", "firewall"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check endpoints before troubleshooting services | CRITICAL | `get_endpoints` |
| 2 | Verify service selector matches pod labels | HIGH | `get_services`, `get_pods` |
| 3 | Review network policies for isolation | HIGH | `get_network_policies` |
| 4 | Test DNS resolution from within pods | MEDIUM | `kubectl_exec` |

## Quick Reference

| Task | Tool | Example |
|------|------|---------|
| List services | `get_services` | `get_services(namespace)` |
| Check backends | `get_endpoints` | `get_endpoints(namespace)` |
| List ingresses | `get_ingresses` | `get_ingresses(namespace)` |
| Network policies | `get_network_policies` | `get_network_policies(namespace)` |

## Services

```python
get_services(namespace="default")

describe_service(name="my-service", namespace="default")

create_service(
    name="my-service",
    namespace="default",
    selector={"app": "my-app"},
    ports=[{"port": 80, "targetPort": 8080}]
)

create_service(
    name="my-lb",
    namespace="default",
    type="LoadBalancer",
    selector={"app": "my-app"},
    ports=[{"port": 443, "targetPort": 8443}]
)
```

## Endpoints

```python
get_endpoints(namespace="default")
```

## Ingress

```python
get_ingresses(namespace="default")

describe_ingress(name="my-ingress", namespace="default")

kubectl_apply(manifest="""
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  namespace: default
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80
""")
```

## Network Policies

```python
get_network_policies(namespace="default")

describe_network_policy(name="deny-all", namespace="default")

kubectl_apply(manifest="""
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
""")

kubectl_apply(manifest="""
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: web
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - port: 80
""")
```

## Troubleshooting Connectivity

```python
get_endpoints(namespace="default")

get_network_policies(namespace="default")

kubectl_exec(
    pod="debug-pod",
    namespace="default",
    command="nslookup my-service.default.svc.cluster.local"
)
```

## Related Skills

- [k8s-service-mesh](../k8s-service-mesh/SKILL.md) - Istio traffic management
- [k8s-cilium](../k8s-cilium/SKILL.md) - Cilium network policies

```