SkillHub ClubShip Full StackFull StackBackend

voiceclaw

Local voice I/O for OpenClaw agents. Transcribe inbound audio/voice messages using local Whisper (whisper.cpp) and generate voice replies using local Piper TTS. Requires whisper, piper, and ffmpeg pre-installed on the system. All inference runs on-device — no network calls, no cloud APIs, no API keys. Use when an agent receives a voice/audio message and should respond in both voice and text, or when any text response should be synthesized and sent as audio. Triggers on: voice messages, audio attachments, respond in voice, send as audio, speak this, voiceclaw.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

3,111

Hot score

Updated

March 20, 2026

Overall rating

C4.0

Composite score

4.0

Best-practice grade

A92.0

Install command

npx @skill-hub/cli install openclaw-skills-voiceclaw

Repository

openclaw/skills

Skill path: skills/asif2bd/voiceclaw

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Backend.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install voiceclaw into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/openclaw/skills before adding voiceclaw to shared team environments
Use voiceclaw for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: voiceclaw
description: "Local voice I/O for OpenClaw agents. Transcribe inbound audio/voice messages using local Whisper (whisper.cpp) and generate voice replies using local Piper TTS. Requires whisper, piper, and ffmpeg pre-installed on the system. All inference runs on-device — no network calls, no cloud APIs, no API keys. Use when an agent receives a voice/audio message and should respond in both voice and text, or when any text response should be synthesized and sent as audio. Triggers on: voice messages, audio attachments, respond in voice, send as audio, speak this, voiceclaw."
metadata:
  {
    "openclaw":
      {
        "requires": { "bins": ["whisper", "piper", "ffmpeg"] },
        "network": "none",
        "env":
          [
            { "name": "WHISPER_BIN", "description": "Path to whisper binary (default: auto-detected via which)" },
            { "name": "WHISPER_MODEL", "description": "Path to ggml-base.en.bin model file (default: ~/.cache/whisper/ggml-base.en.bin)" },
            { "name": "PIPER_BIN", "description": "Path to piper binary (default: auto-detected via which)" },
            { "name": "VOICECLAW_VOICES_DIR", "description": "Path to directory containing .onnx voice model files (default: ~/.local/share/piper/voices)" }
          ]
      }
  }
---

# VoiceClaw

Local-only voice I/O for OpenClaw agents.

- **STT:** `transcribe.sh` — converts audio to text via local Whisper binary
- **TTS:** `speak.sh` — converts text to speech via local Piper binary
- **Network calls: none** — both scripts run fully offline
- **No cloud APIs, no API keys required**

---

## Prerequisites

The following must be installed on the system before using this skill:

| Requirement | Purpose |
|---|---|
| `whisper` binary | Speech-to-text inference |
| `ggml-base.en.bin` model file | Whisper STT model |
| `piper` binary | Text-to-speech synthesis |
| `*.onnx` voice model files | Piper TTS voices |
| `ffmpeg` | Audio format conversion |

See **README.md** for installation and setup instructions.

---

## Environment Variables

| Variable | Default | Purpose |
|---|---|---|
| `WHISPER_BIN` | auto-detected via `which` | Path to whisper binary |
| `WHISPER_MODEL` | `~/.cache/whisper/ggml-base.en.bin` | Path to Whisper model file |
| `PIPER_BIN` | auto-detected via `which` | Path to piper binary |
| `VOICECLAW_VOICES_DIR` | `~/.local/share/piper/voices` | Directory containing `.onnx` voice model files |

---

## Verify Setup

```bash
which whisper && echo "STT binary: OK"
which piper   && echo "TTS binary: OK"
which ffmpeg  && echo "ffmpeg: OK"
ls "${WHISPER_MODEL:-$HOME/.cache/whisper/ggml-base.en.bin}" && echo "STT model: OK"
ls "${VOICECLAW_VOICES_DIR:-$HOME/.local/share/piper/voices}"/*.onnx 2>/dev/null | head -1 && echo "TTS voices: OK"
```

---

## Inbound Voice: Transcribe

```bash
# Transcribe audio → text (supports ogg, mp3, m4a, wav, flac)
TRANSCRIPT=$(bash scripts/transcribe.sh /path/to/audio.ogg)
```

Override model path:
```bash
WHISPER_MODEL=/path/to/ggml-base.en.bin bash scripts/transcribe.sh audio.ogg
```

---

## Outbound Voice: Speak

```bash
# Step 1: Generate WAV (local Piper — no network)
WAV=$(bash scripts/speak.sh "Your response here." /tmp/reply.wav en_US-lessac-medium)

# Step 2: Convert to OGG Opus (Telegram voice requirement)
ffmpeg -i "$WAV" -c:a libopus -b:a 32k /tmp/reply.ogg -y -loglevel error

# Step 3: Send via message tool (filePath=/tmp/reply.ogg)
```

Override voice directory:
```bash
VOICECLAW_VOICES_DIR=/path/to/voices bash scripts/speak.sh "Hello." /tmp/reply.wav
```

---

## Available Voices

| Voice | Style |
|---|---|
| `en_US-lessac-medium` | Neutral American (default) |
| `en_US-amy-medium` | Warm American female |
| `en_US-joe-medium` | American male |
| `en_US-kusal-medium` | Expressive American male |
| `en_US-danny-low` | Deep American male (fast) |
| `en_GB-alba-medium` | British female |
| `en_GB-northern_english_male-medium` | Northern British male |

---

## Agent Behavior Rules

1. **Voice in → Voice + Text out.** Always respond with both a voice reply and a text reply when a voice message is received.
2. **Include the transcript.** Show *"🎙️ I heard: [transcript]"* at the top of every text reply to a voice message.
3. **Keep voice responses concise.** Piper TTS works best under ~200 words — summarize for audio, include full detail in text.
4. **Local only.** Never use a cloud TTS/STT API. Only the local `whisper` and `piper` binaries.
5. **Send voice before text.** Send the audio file first, then follow with the text reply.

---

## Full Example

```bash
# 1. Transcribe inbound voice message
TRANSCRIPT=$(bash path/to/voiceclaw/scripts/transcribe.sh /path/to/voice.ogg)

# 2. Compose reply and generate audio
RESPONSE="Deployment complete. All checks passed."
WAV=$(bash path/to/voiceclaw/scripts/speak.sh "$RESPONSE" /tmp/reply_$$.wav)
ffmpeg -i "$WAV" -c:a libopus -b:a 32k /tmp/reply_$$.ogg -y -loglevel error

# 3. Send voice + text
# message(action=send, filePath=/tmp/reply_$$.ogg, ...)
# reply: "🎙️ I heard: $TRANSCRIPT\n\n$RESPONSE"
```

---

## Troubleshooting

| Issue | Fix |
|---|---|
| `whisper: command not found` | Ensure whisper binary is installed and in PATH |
| Whisper model not found | Set `WHISPER_MODEL=/path/to/ggml-base.en.bin` |
| `piper: command not found` | Ensure piper binary is installed and in PATH |
| Voice model missing | Set `VOICECLAW_VOICES_DIR=/path/to/voices/` |
| OGG won't play on Telegram | Ensure `-c:a libopus` flag in ffmpeg command |


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/transcribe.sh

```bash
#!/usr/bin/env bash
# transcribe.sh — Convert audio to text using local Whisper (whisper.cpp)
# Usage: transcribe.sh <audio_file> [model_path]
# Output: prints transcript to stdout
# Supports: ogg, mp3, m4a, wav, flac (auto-converts to wav via ffmpeg)
#
# Environment variables:
#   WHISPER_BIN   — path to whisper binary (default: auto-detected via `which`)
#   WHISPER_MODEL — path to ggml model file (default: ~/.cache/whisper/ggml-base.en.bin)

set -euo pipefail

AUDIO_FILE="${1:-}"
MODEL="${2:-${WHISPER_MODEL:-$HOME/.cache/whisper/ggml-base.en.bin}}"
WHISPER_BIN="${WHISPER_BIN:-$(which whisper 2>/dev/null || echo whisper)}"
TMP_WAV="/tmp/voiceclaw_stt_$$.wav"

if [[ -z "$AUDIO_FILE" ]]; then
  echo "Usage: transcribe.sh <audio_file> [model_path]" >&2
  echo "  Env: WHISPER_BIN, WHISPER_MODEL" >&2
  exit 1
fi

if [[ ! -f "$AUDIO_FILE" ]]; then
  echo "Error: file not found: $AUDIO_FILE" >&2
  exit 1
fi

if [[ ! -f "$MODEL" ]]; then
  echo "Error: Whisper model not found: $MODEL" >&2
  echo "Set WHISPER_MODEL=/path/to/ggml-base.en.bin to point to your model." >&2
  echo "See README.md for one-time model download instructions." >&2
  exit 1
fi

cleanup() { rm -f "$TMP_WAV"; }
trap cleanup EXIT

# Convert to 16kHz mono WAV (Whisper requirement) — local ffmpeg, no network
ffmpeg -i "$AUDIO_FILE" -ar 16000 -ac 1 "$TMP_WAV" -y -loglevel error

# Transcribe — local whisper binary, no network
"$WHISPER_BIN" -m "$MODEL" "$TMP_WAV" 2>/dev/null \
  | grep -E '^\[' \
  | sed 's/\[[0-9:. ->]*\]  *//' \
  | tr '\n' ' ' \
  | sed 's/^[[:space:]]*//' \
  | sed 's/[[:space:]]*$//'

echo  # final newline

```

### scripts/speak.sh

```bash
#!/usr/bin/env bash
# speak.sh — Convert text to speech using local Piper TTS
# Usage: speak.sh "text to speak" [output_file.wav] [voice]
# Output: writes WAV to output_file, prints output path to stdout
# Available voices: en_US-amy-medium, en_US-joe-medium, en_US-lessac-medium,
#                   en_US-kusal-medium, en_US-danny-low,
#                   en_GB-alba-medium, en_GB-northern_english_male-medium
#
# Environment variables:
#   PIPER_BIN           — path to piper binary (default: auto-detected via `which`)
#   VOICECLAW_VOICES_DIR — path to folder containing *.onnx voice models
#                          (default: ~/.local/share/piper/voices)

set -euo pipefail

TEXT="${1:-}"
OUTPUT="${2:-/tmp/voiceclaw_tts_$$.wav}"
VOICE="${3:-en_US-lessac-medium}"
VOICES_DIR="${VOICECLAW_VOICES_DIR:-$HOME/.local/share/piper/voices}"
PIPER_BIN="${PIPER_BIN:-$(which piper 2>/dev/null || echo piper)}"

if [[ -z "$TEXT" ]]; then
  echo "Usage: speak.sh \"text\" [output.wav] [voice]" >&2
  echo "  Env: PIPER_BIN, VOICECLAW_VOICES_DIR" >&2
  exit 1
fi

# Sanitize voice name — allow only safe characters to prevent path traversal
VOICE=$(echo "$VOICE" | tr -cd 'a-zA-Z0-9_-')
if [[ -z "$VOICE" ]]; then
  echo "Error: voice name is empty after sanitization" >&2
  exit 1
fi

MODEL="$VOICES_DIR/$VOICE.onnx"
CONFIG="$VOICES_DIR/$VOICE.onnx.json"

if [[ ! -f "$MODEL" ]]; then
  echo "Error: voice model not found: $MODEL" >&2
  echo "Set VOICECLAW_VOICES_DIR=/path/to/voices/ or install piper voices." >&2
  echo "Available voices in $VOICES_DIR:" >&2
  ls "$VOICES_DIR"/*.onnx 2>/dev/null | xargs -n1 basename | sed 's/\.onnx$//' >&2 || echo "  (none found)" >&2
  exit 1
fi

# Generate WAV — local piper binary, no network
CONFIG_ARGS=()
[[ -f "$CONFIG" ]] && CONFIG_ARGS=(-c "$CONFIG")

echo "$TEXT" | "$PIPER_BIN" -m "$MODEL" "${CONFIG_ARGS[@]}" -f "$OUTPUT" 2>/dev/null

echo "$OUTPUT"

```



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### README.md

```markdown
# 🎙️ VoiceClaw — Local Voice I/O for OpenClaw Agents

**ClawHub:** [clawhub.ai/Asif2BD/voiceclaw](https://clawhub.ai/Asif2BD/voiceclaw) · **GitHub:** [github.com/Asif2BD/VoiceClaw](https://github.com/Asif2BD/VoiceClaw)

> Created by **[M Asif Rahman](https://github.com/Asif2BD)** — Founder of [MissionDeck.ai](https://missiondeck.ai) · [xCloud](https://xcloud.host) · [WPDevelopers](https://github.com/WPDevelopers)

A local-only voice skill for [OpenClaw](https://openclaw.ai) agents. Transcribe inbound voice messages with **Whisper** and reply with synthesized speech via **Piper TTS** — no cloud, no API keys, no paid services.

---

## Install

**Option 1 — Via ClawHub** *(if ClawHub is installed)*
```bash
clawhub install voiceclaw
```

**Option 2 — Via Git** *(no ClawHub needed)*
```bash
git clone https://github.com/Asif2BD/VoiceClaw.git ~/.openclaw/custom-skills/voiceclaw
```

**Option 3 — Download release** *(manual, no tools needed)*
```bash
curl -L https://github.com/Asif2BD/VoiceClaw/releases/latest/download/voiceclaw.skill -o voiceclaw.skill
unzip voiceclaw.skill -d ~/.openclaw/custom-skills/voiceclaw
```

> After any install method, restart OpenClaw for the skill to be detected.

---

## Requirements

- `whisper` — whisper.cpp binary ([install guide](https://github.com/ggerganov/whisper.cpp))
- Whisper model: `ggml-base.en.bin` — auto-downloaded on first use, or manually:
  ```bash
  # One-time setup only — not run by the skill scripts
  mkdir -p ~/.cache/whisper
  curl -L -o ~/.cache/whisper/ggml-base.en.bin \
    https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
  ```
  > Set `WHISPER_MODEL=/path/to/ggml-base.en.bin` if your model is stored elsewhere.
- `piper` — TTS binary + `.onnx` voice models ([install guide](https://github.com/rhasspy/piper)). Set `VOICECLAW_VOICES_DIR=/path/to/voices/` to point to your voice models (default: `~/.local/share/piper/voices/`)
- `ffmpeg` — for audio format conversion

---

## What it does

- **Speech-to-Text**: Inbound voice/audio (OGG, MP3, WAV, M4A) → transcript text via Whisper.cpp
- **Text-to-Speech**: Agent text replies → voice audio via Piper TTS (7 English voices)
- **Agent behavior**: When a voice message arrives, the agent automatically responds with both voice + text
- **100% local**: No data sent anywhere — all inference runs on your server

---

## How it works

```
Inbound voice message (OGG/MP3/WAV)
        ↓
  ffmpeg → 16kHz mono WAV
        ↓
  whisper.cpp → transcript text
        ↓
  Agent reads transcript, composes reply
        ↓
  Piper TTS → WAV → OGG Opus
        ↓
  Voice reply + text transcript sent together
```

---

## Usage

```bash
# Transcribe a voice message
bash scripts/transcribe.sh /path/to/voice.ogg

# Generate a voice reply (returns path to WAV)
bash scripts/speak.sh "Your task is complete." /tmp/reply.wav

# Convert WAV → OGG Opus for Telegram
ffmpeg -i /tmp/reply.wav -c:a libopus -b:a 32k /tmp/reply.ogg -y
```

---

## Available Voices

| Voice ID | Style |
|---|---|
| `en_US-lessac-medium` | Neutral American (default) |
| `en_US-amy-medium` | Warm American female |
| `en_US-joe-medium` | American male |
| `en_US-kusal-medium` | Expressive American male |
| `en_US-danny-low` | Deep American male (fast) |
| `en_GB-alba-medium` | British female |
| `en_GB-northern_english_male-medium` | Northern British male |

Voice models live at `/opt/piper/voices/`. See `SKILL.md` for full agent integration instructions.

---

## Security

- **All processing is local** — no audio or text is sent to any cloud service or external API
- **Temp files are cleaned up** — audio is converted in `/tmp` and deleted immediately after transcription (bash `trap` on EXIT)
- **Voice names are sanitized** — input stripped to `[a-zA-Z0-9_-]` only, preventing path traversal
- **No network calls** — neither script makes any network request

---

## Author

**[M Asif Rahman](https://github.com/Asif2BD)**
- 🌐 [asif.im](https://asif.im)
- 🐙 [github.com/Asif2BD](https://github.com/Asif2BD)
- 🔑 [clawhub.ai/Asif2BD](https://clawhub.ai/Asif2BD)

---

## License

MIT © 2026 [M Asif Rahman](https://github.com/Asif2BD)

```

### _meta.json

```json
{
  "owner": "asif2bd",
  "slug": "voiceclaw",
  "displayName": "VoiceClaw",
  "latest": {
    "version": "1.0.6",
    "publishedAt": 1772203550157,
    "commit": "https://github.com/openclaw/skills/commit/d9fd3726c478264910b49beed20658e6ba57e263"
  },
  "history": []
}

```