SkillHub ClubRun DevOpsFull StackDevOpsTesting

debug-methodology

Systematic debugging and problem-solving methodology. Activate when encountering unexpected errors, service failures, regression bugs, deployment issues, or when a fix attempt has failed twice. Also activate when proposing ANY fix to verify it addresses root cause (not a workaround). Prevents patch-chaining, wrong-environment restarts, workaround addiction, and "drunk man" random fixes.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

3,129

Hot score

Updated

March 20, 2026

Overall rating

C4.0

Composite score

4.0

Best-practice grade

B77.6

Install command

npx @skill-hub/cli install openclaw-skills-debug-methodology

Repository

openclaw/skills

Skill path: skills/abczsl520/debug-methodology

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, DevOps, Testing.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install debug-methodology into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/openclaw/skills before adding debug-methodology to shared team environments
Use debug-methodology for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: debug-methodology
description: Systematic debugging and problem-solving methodology. Activate when encountering unexpected errors, service failures, regression bugs, deployment issues, or when a fix attempt has failed twice. Also activate when proposing ANY fix to verify it addresses root cause (not a workaround). Prevents patch-chaining, wrong-environment restarts, workaround addiction, and "drunk man" random fixes.
---

# Debug Methodology

Systematic approach to debugging and problem-solving. Distilled from real production incidents and industry best practices.

## ⚠️ The Root Cause Imperative

**Every fix MUST target the root cause. Workarounds are forbidden unless explicitly approved.**

Before proposing ANY solution, pass the Root Cause Gate:

```
┌─────────────────────────────────────────────┐
│            ROOT CAUSE GATE                  │
│                                             │
│  1. What is the ACTUAL problem?             │
│  2. WHY does it happen? (not just WHAT)     │
│  3. Does my fix eliminate the WHY?           │
│     YES → proceed                           │
│     NO  → this is a workaround → STOP       │
│                                             │
│  Workaround test:                           │
│  "If I remove my fix, does the bug return?" │
│     YES → workaround (fix the cause instead)│
│     NO  → genuine fix ✅                    │
└─────────────────────────────────────────────┘
```

### The 5 Whys — Mandatory for Non-Obvious Problems

```
Problem: API returns 524 timeout
  Why? → Cloudflare cuts connections >100s
  Why? → The API call takes >100s
  Why? → Using non-streaming request, server holds connection silent
  Why? → Code uses regular fetch, not streaming
  Fix: → Use streaming (server sends data continuously, Cloudflare won't cut)

  ❌ WRONG: Switch to faster model (workaround — avoids the timeout instead of fixing it)
  ✅ RIGHT: Use streaming API (root cause — Cloudflare needs ongoing data)
```

### Common Workaround Traps

| Problem | Workaround (❌) | Root Cause Fix (✅) |
|---------|----------------|-------------------|
| API timeout | Switch to faster model | Use streaming / fix the slow query |
| Data precision loss | Search by name instead of ID | Fix BigInt parsing |
| Search returns nothing | Try different search strategy | Fix the search implementation |
| Dependency conflict | Downgrade / pin version | Use correct environment (venv) |
| Feature doesn't work | Remove the feature | Debug why it fails |

**Self-check question**: "Am I solving the problem, or avoiding it?"

## Phase 1: STOP — Assess Before Acting

Before ANY fix attempt:

```
□ What is the EXACT symptom? (error message, behavior, screenshot)
□ When did it last work? What changed since then?
□ How is the service running? (process, env, startup command)
```

For running services:
```bash
ps -p <PID> -o command=        # How was it started?
ls .venv/ venv/ env/           # Virtual environment?
which python3 && python3 --version
which node && node --version
```

**NEVER restart a service without first recording its original startup command.**

## Phase 2: Hypothesize — Form ONE Theory

Priority order:
1. **Did I change something?** → diff/revert first
2. **Did the environment change?** → versions, deps, configs
3. **Did external inputs change?** → API responses, data formats
4. **Genuine new bug?** → only after ruling out 1-3

## Phase 3: Test — One Change at a Time

```
Change X → Test → Works? → Done
                → Fails? → REVERT X → new hypothesis
```

**Do NOT stack changes.**

## Phase 4: Patch-Chain Detection

**2 fix attempts failed → STOP. Revert ALL. Back to Phase 1.**

You are likely:
- Fixing symptoms of a wrong fix
- In the wrong environment entirely
- Misunderstanding the architecture

## Phase 5: Post-Fix Verification

After any fix, verify:
```
□ Does it solve the ORIGINAL problem? (not just silence the error)
□ Did I introduce new issues? (regression check)
□ Would removing my fix bring the bug back? (confirms causality)
□ Is the fix in the right layer? (not patching symptoms upstream)
```

## Anti-Patterns

### 🚨 Workaround Addiction (NEW — Most Common!)
Bypassing the problem instead of fixing it. "It's slower but works" / "Use a different approach".
→ **Ask: "Am I solving or avoiding?"** If avoiding → find the real fix.
→ Workarounds are ONLY acceptable when: (1) explicitly approved by user, (2) clearly labeled as temporary, (3) a TODO is created for the real fix.

### 🚨 Drunk Man Anti-Pattern
Randomly changing things until the problem disappears.
→ Each change needs a hypothesis.

### 🚨 Streetlight Anti-Pattern
Looking where comfortable, not where the problem is.
→ "Is this where the bug IS, or where I KNOW HOW TO LOOK?"

### 🚨 Cargo Cult Fix
Copying a fix without understanding why it works.
→ Understand the mechanism first.

### 🚨 Ignoring the User
User says "it broke after you changed X" → immediately diff X.
→ User observations are the most valuable data.

## Environment Checklist

```
□ Runtime: system or venv/nvm?
□ Dependencies: match expected versions?
□ Config: .env, config.json — recent changes?
□ Process manager: PM2/systemd — restart method?
□ Logs: tail -f before reproducing
□ Backup: snapshot before any change
```

## Deployment Safety (Hardened SCP Flow)

**Iron Rule: NEVER edit files directly on the server. NEVER overwrite server files without backup.**

```
Standard deployment (every time, no exceptions):

1. PULL    scp server:/opt/apps/项目/ ./local-项目/
           (pull the files you need + related files)

2. EDIT    Make changes locally
           (complex multi-line → write full file, never sed)

3. VERIFY  node -c *.js                    # syntax check
           node -e "require('./file')"     # module load check
           (STOP if verification fails — do not proceed)

4. BACKUP  ssh server "cp file file.bak.$(date +%s)"

5. PUSH    scp ./local-file server:/opt/apps/项目/file

6. RESTART pm2 restart <app>
           (use SAME method as original — check ps/pm2 show first)

7. HEALTH  curl -s http://localhost:<port>/health
           pm2 logs <app> --lines 5 --nostream
           (if unhealthy → revert backup immediately)
```

### Pull Scope Rules
```
Changing 1 file    → pull that file + its imports/importers
Changing routes    → also pull server.js (check mount points)
Changing frontend  → also pull index.html (check script tags)
Changing config    → also pull code that reads the config
Unsure what to pull → pull the whole project directory
```

### What NOT to Do
```
❌ sed -i for multi-line code on server
❌ Skip node -c after editing .js
❌ pm2 restart before syntax verification
❌ Tell user to refresh before health check passes
❌ Push without backup
```

## 🚨 Server Code Modification Rules

**Every code change on a server MUST be syntax-verified before restart/reload.**

```
After editing .js files:
  □ node -c <file>                          # Syntax check
  □ node -e "require('./<file>')"           # Module load check (for route files)
  □ FAIL → DO NOT restart. DO NOT tell user to refresh. Fix first.

After editing .html files:
  □ Check critical tag closure (div/script/style)
  □ grep -c '<div' file && grep -c '</div' file   # Count match

Complex multi-line changes:
  □ Write complete file locally → scp upload
  □ NEVER use sed for multi-line code insertion (newlines get swallowed)
  □ If sed is unavoidable → verify with node -c immediately after

Restart sequence:
  □ node -c *.js passes → pm2 restart <app>
  □ Check pm2 logs --lines 5 for startup errors
  □ curl health endpoint to confirm service is up
```

**Why**: `sed -i` multi-line insertion silently corrupts JS (newlines become single line), causing syntax errors that break the entire page with no visible error to the user.

## Decision Tree

```
Problem appears
  ├─ I just edited something? → DIFF → REVERT if suspect
  ├─ Service won't start? → CHECK startup command + env
  ├─ New error after fix? → STOP (patch chain!) → Revert → Phase 1
  ├─ User reports regression? → DIFF before/after
  ├─ Tempted to work around? → ROOT CAUSE GATE → fix the real issue
  └─ Intermittent? → CHECK logs + external deps + timing
```


---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### README.md

```markdown
# 🔍 Debug Methodology

[![ClawHub](https://img.shields.io/badge/ClawHub-debug--methodology-blue?style=flat-square)](https://clawhub.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE)
[![OpenClaw Skill](https://img.shields.io/badge/OpenClaw-Agent_Skill-orange?style=flat-square)](https://github.com/openclaw/openclaw)

**系统化调试方法论** — 适用于 AI Agent 和开发者的通用调试规范。

> 从真实生产事故中提炼，结合 Nicole Tietz、Brendan Gregg、Julia Evans 等业界顶级工程师的方法论。

## 为什么需要这个？

AI Agent 在调试时容易陷入以下陷阱：
- 🚨 **醉汉反模式** — 随机改代码直到问题消失
- 🚨 **路灯反模式** — 只在熟悉的地方找，而不是问题真正在的地方
- 🚨 **补丁链** — 每次报新错就修那个错，越改越乱
- 🚨 **忽略用户** — 用户说"改了X就坏了"，却继续自己猜

这套方法论提供了一个**强制性的调试流程**，避免这些常见错误。

## 核心流程

```
Phase 1: STOP    → 动手前先搞清现状（进程、环境、启动命令）
Phase 2: THINK   → 形成一个假设（优先查自己改了什么）
Phase 3: TEST    → 一次改一个，验证后再继续
Phase 4: DETECT  → 改了2次没好？全部回退，重新来
```

## 快速决策树

```
Error appears
  ├─ Was I just editing? → DIFF my changes → REVERT if suspect
  ├─ Service won't start? → CHECK startup command + environment
  ├─ New error after fix? → STOP (patch chain!) → Revert all → Phase 1
  ├─ User reports regression? → DIFF before/after their last known-good
  └─ Intermittent? → CHECK logs + external dependencies + timing
```

## 作为 OpenClaw Skill 使用

将 `SKILL.md` 放入你的 skills 目录：

```bash
mkdir -p ~/.agents/skills/debug-methodology
cp SKILL.md ~/.agents/skills/debug-methodology/
```

重启 OpenClaw 后，所有 session 遇到调试场景会自动加载这套方法论。

## 完整规范

详见 [SKILL.md](SKILL.md) — 包含：

- **4阶段调试流程**（STOP → Hypothesize → Test → Patch-Chain Detection）
- **4大反模式警告**（醉汉/路灯/货物崇拜/忽略用户）
- **环境检查清单**（runtime/deps/config/process manager/logs/backup）
- **部署安全流程**（拉取→备份→修改→测试→部署→验证）
- **快速决策树**

## 起源

这套方法论源自一次真实的生产事故：

修复一个简单的超时问题（2步就能搞定），却因为重启服务时没用虚拟环境，走了10步弯路。事后复盘发现，如果一开始跑一句 `ps -p <PID> -o command=` 就能避免所有问题。

由此总结出这套通用调试规范，并结合业界最佳实践形成了完整的方法论。

## Install

```bash
clawhub install debug-methodology
```

## Wiki

更详细的案例分析和扩展内容请查看 [Wiki](../../wiki)。

## 🔗 Part of the AI Dev Quality Suite

| Skill | Purpose | Install |
|-------|---------|---------|
| [bug-audit](https://github.com/abczsl520/bug-audit-skill) | Dynamic bug hunting, 200+ pitfall patterns | `clawhub install bug-audit` |
| [codex-review](https://github.com/abczsl520/codex-review) | Three-tier code review with adversarial testing | `clawhub install codex-review` |
| **debug-methodology** (this) | Root-cause debugging, prevents patch-chaining | `clawhub install debug-methodology` |
| [nodejs-project-arch](https://github.com/abczsl520/nodejs-project-arch) | AI-friendly architecture, 70-93% token savings | `clawhub install nodejs-project-arch` |
| [game-quality-gates](https://github.com/abczsl520/game-quality-gates) | 12 universal game dev quality checks | `clawhub install game-quality-gates` |

## License

MIT

```

### _meta.json

```json
{
  "owner": "abczsl520",
  "slug": "debug-methodology",
  "displayName": "Debug Methodology",
  "latest": {
    "version": "1.2.0",
    "publishedAt": 1772820095593,
    "commit": "https://github.com/openclaw/skills/commit/12eb4d910ae1ea53ffe1766fc261f09be9dc67f5"
  },
  "history": [
    {
      "version": "1.1.0",
      "publishedAt": 1772695043046,
      "commit": "https://github.com/openclaw/skills/commit/90c3ac7d3323a179328dd130dc8e987602c38532"
    },
    {
      "version": "1.0.0",
      "publishedAt": 1772557747980,
      "commit": "https://github.com/openclaw/skills/commit/0395c2008c554e688b46367d519c8f71bd3de9e2"
    }
  ]
}

```