SkillHub ClubAnalyze Data & AIFull StackData / AI

zhipu-asr

Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

3,084

Hot score

Updated

March 20, 2026

Overall rating

C5.0

Composite score

5.0

Best-practice grade

B81.2

Install command

npx @skill-hub/cli install openclaw-skills-zhipu-asr

Repository

openclaw/skills

Skill path: skills/franklu0819-lang/zhipu-asr

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install zhipu-asr into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/openclaw/skills before adding zhipu-asr to shared team environments
Use zhipu-asr for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: zhipu-asr
description: Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
metadata:
  {
    "openclaw":
      {
        "requires": { "bins": ["jq", "curl", "ffmpeg"], "env": ["ZHIPU_API_KEY"] },
      },
  }
---

# Zhipu AI Automatic Speech Recognition (ASR)

Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.

## Setup

**1. Get your API Key:**
Get a key from [Zhipu AI Console](https://bigmodel.cn/usercenter/proj-mgmt/apikeys)

**2. Set it in your environment:**
```bash
export ZHIPU_API_KEY="your-key-here"
```

## Supported Audio Formats

- **WAV** - Recommended, best quality
- **MP3** - Widely supported
- **OGG** - Auto-converted to MP3
- **M4A** - Auto-converted to MP3
- **AAC** - Auto-converted to MP3
- **FLAC** - Auto-converted to MP3
- **WMA** - Auto-converted to MP3

> **Note:** The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.

## File Constraints

- **Maximum file size:** 25 MB
- **Maximum duration:** 30 seconds
- **Recommended sample rate:** 16000 Hz or higher
- **Audio channels:** Mono or stereo

## Usage

### Basic Transcription

Transcribe an audio file with default settings:

```bash
bash scripts/speech_to_text.sh recording.wav
```

### Transcription with Context

Provide previous transcription or context for better accuracy:

```bash
bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容，有助于提高准确性"
```

### Transcription with Hotwords

Use custom vocabulary to improve recognition of specific terms:

```bash
bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
```

### Full Options

Combine context and hotwords:

```bash
bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
```

**Parameters:**
- `audio_file` (required): Path to audio file (.wav or .mp3)
- `prompt` (optional): Previous transcription or context text (max 8000 chars)
- `hotwords` (optional): Comma-separated list of specific terms (max 100 words)

## Features

### Context Prompts

**Why use context prompts:**
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments

**When to use:**
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions

**Example:**
```bash
bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容：讨论了项目进展和下一步计划"
```

### Hotwords

**What are hotwords:**
Custom vocabulary list that boosts recognition accuracy for specific terms.

**Best use cases:**
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms

**Examples:**
```bash
# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"

# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"

# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
```

## Workflow Examples

### Transcribe a Meeting

```bash
# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav

# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"

# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
```

### Transcribe a Lecture

```bash
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
```

### Process Multiple Files

```bash
for file in recording_*.wav; do
    bash scripts/speech_to_text.sh "$file"
done
```

## Audio Quality Tips

**Best practices for accurate transcription:**

1. **Clear audio source**
   - Minimize background noise
   - Use good quality microphone
   - Speak clearly and at moderate pace

2. **Optimal audio settings**
   - Sample rate: 16000 Hz or higher
   - Bit depth: 16-bit or higher
   - Single channel (mono) is sufficient

3. **File preparation**
   - Remove silence from beginning/end
   - Normalize audio levels
   - Ensure consistent volume

## Output Format

The script outputs JSON with:
- `id`: Task ID
- `created`: Request timestamp (Unix timestamp)
- `request_id`: Unique request identifier
- `model`: Model name used
- `text`: Transcribed text

Example output:
```json
{
  "id": "task-12345",
  "created": 1234567890,
  "request_id": "req-abc123",
  "model": "glm-asr-2512",
  "text": "你好，这是转录的文本内容"
}
```

## Troubleshooting

**File Size Issues:**
- Split audio files larger than 25 MB
- Reduce sample rate or bit depth
- Use compression (MP3) for smaller files

**Duration Issues:**
- Split recordings longer than 30 seconds
- Process segments separately
- Use context prompts to maintain continuity

**Poor Accuracy:**
- Improve audio quality
- Use hotwords for specific terms
- Provide context prompts
- Ensure clear speech and minimal noise

**Format Issues:**
- Ensure file is .wav or .mp3
- Check file is not corrupted
- Verify audio can be played by standard players

## Limitations

- Maximum audio duration: 30 seconds per request
- File size limit: 25 MB
- Maximum hotwords: 100 terms
- Context prompt limit: 8000 characters
- Best performance with Chinese language audio

## Performance Notes

- Typical transcription time: 1-3 seconds
- Real-time or faster for most audio
- Processing time scales with audio quality and length


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/speech_to_text.sh

```bash
#!/bin/bash
# Zhipu AI Speech-to-Text Script
# Usage: ./speech_to_text.sh <audio_file> [prompt] [hotwords]

set -e

# Configuration
API_ENDPOINT="https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"

# Get API key from environment
if [ -z "$ZHIPU_API_KEY" ]; then
    echo "Error: ZHIPU_API_KEY environment variable is not set" >&2
    echo "" >&2
    echo "To fix:" >&2
    echo "1. Get a key from https://bigmodel.cn/usercenter/proj-mgmt/apikeys" >&2
    echo "2. Run: export ZHIPU_API_KEY=\"your-key\"" >&2
    exit 1
fi

# Parse arguments
AUDIO_FILE="$1"
PROMPT="$2"
HOTWORDS="$3"

# Validate audio file
if [ -z "$AUDIO_FILE" ]; then
    echo "Usage: $0 <audio_file> [prompt] [hotwords]" >&2
    echo "" >&2
    echo "Examples:" >&2
    echo "  $0 recording.wav" >&2
    echo "  $0 recording.wav \"这是之前的转录内容\"" >&2
    echo "  $0 recording.mp3 \"\" \"人名,地名,专业术语\"" >&2
    echo "" >&2
    echo "Supported formats: .wav, .mp3, .ogg, .m4a, .aac, .flac, .wma" >&2
    echo "Max file size: 25 MB" >&2
    echo "Max duration: 30 seconds" >&2
    exit 1
fi

# Check if file exists
if [ ! -f "$AUDIO_FILE" ]; then
    echo "Error: Audio file not found: $AUDIO_FILE" >&2
    exit 1
fi

# Auto-convert audio format if needed
ORIGINAL_FILE="$AUDIO_FILE"
FILE_EXT="${AUDIO_FILE##*.}"
FILE_EXT=$(echo "$FILE_EXT" | tr '[:upper:]' '[:lower:]')

# Supported formats by API: wav, mp3
# Convert other formats to mp3
if [ "$FILE_EXT" != "wav" ] && [ "$FILE_EXT" != "mp3" ]; then
    echo "Converting audio format: $FILE_EXT → mp3" >&2

    # Check if ffmpeg is available
    if ! command -v ffmpeg &> /dev/null; then
        echo "Error: ffmpeg is required for format conversion" >&2
        echo "Install with: yum install -y ffmpeg" >&2
        exit 1
    fi

    # Create temp file for converted audio
    TEMP_AUDIO=$(mktemp --suffix=.mp3)

    # Convert to mp3 with optimal settings for ASR
    # Using libmp3lame for compatibility. ar=16000, ac=1 (mono) as per Zhipu best practices.
    ffmpeg -i "$AUDIO_FILE" \
        -acodec libmp3lame \
        -ar 16000 \
        -ac 1 \
        -b:a 64k \
        -y \
        "$TEMP_AUDIO" 2>/dev/null

    # Replace AUDIO_FILE with converted file
    AUDIO_FILE="$TEMP_AUDIO"

    echo "Conversion complete: $TEMP_AUDIO" >&2
    echo "" >&2

    # Cleanup function for temp file
    cleanup() {
        if [ -f "$TEMP_AUDIO" ]; then
            rm -f "$TEMP_AUDIO"
        fi
    }
    trap cleanup EXIT
else
    # No conversion needed, no cleanup required
    cleanup() { :; }
    trap cleanup EXIT
fi

# Check file size (25 MB limit)
FILE_SIZE=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
MAX_SIZE=$((25 * 1024 * 1024))
if [ "$FILE_SIZE" -gt "$MAX_SIZE" ]; then
    echo "Error: File size exceeds 25 MB limit" >&2
    echo "Current size: $(($FILE_SIZE / 1024 / 1024)) MB" >&2
    exit 1
fi

# Check duration if ffmpeg is available
if command -v ffprobe &> /dev/null; then
    DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$AUDIO_FILE")
    # Compare with 30s limit (allow a tiny margin for float comparison in shell)
    if (( $(echo "$DURATION > 30.5" | bc -l) )); then
        echo "Error: Audio duration ($DURATION s) exceeds 30 second limit" >&2
        exit 1
    fi
fi

# Build base payload
PAYLOAD=$(jq -n \
    --arg model "glm-asr-2512" \
    '{
        model: $model
    }')

# Add prompt if provided
if [ -n "$PROMPT" ]; then
    PAYLOAD=$(echo "$PAYLOAD" | jq --arg prompt "$PROMPT" '. + {prompt: $prompt}')
fi

# Add hotwords if provided
if [ -n "$HOTWORDS" ]; then
    # Convert comma-separated hotwords to JSON array
    HOTWORDS_ARRAY=$(echo "$HOTWORDS" | jq -R 'split(",") | map(trim)')
    PAYLOAD=$(echo "$PAYLOAD" | jq --argjson hotwords "$HOTWORDS_ARRAY" '. + {hotwords: $hotwords}')
fi

# Make API request
echo "Transcribing audio file: $AUDIO_FILE" >&2
if [ -n "$PROMPT" ]; then
    echo "Using context prompt: $(echo "$PROMPT" | cut -c1-50)..." >&2
fi
if [ -n "$HOTWORDS" ]; then
    echo "Hotwords: $HOTWORDS" >&2
fi
echo "" >&2

# Build curl command arguments
CURL_ARGS=()
CURL_ARGS+=(-H "Authorization: Bearer $ZHIPU_API_KEY")
CURL_ARGS+=(-F "file=@$AUDIO_FILE")
CURL_ARGS+=(-F "model=glm-asr-2512")

if [ -n "$PROMPT" ]; then
    CURL_ARGS+=(-F "prompt=$PROMPT")
fi

if [ -n "$HOTWORDS" ]; then
    # Convert comma-separated to array format for curl
    IFS=',' read -ra HW_ARRAY <<< "$HOTWORDS"
    for word in "${HW_ARRAY[@]}"; do
        word=$(echo "$word" | xargs)
        CURL_ARGS+=(-F "hotwords[]=$word")
    done
fi

RESPONSE=$(curl -s -X POST "$API_ENDPOINT" "${CURL_ARGS[@]}")

# Check for errors
if echo "$RESPONSE" | jq -e '.error' > /dev/null 2>&1; then
    ERROR_MSG=$(echo "$RESPONSE" | jq -r '.error.message // .error')
    echo "Error: $ERROR_MSG" >&2
    exit 1
fi

# Extract and display result
echo "$RESPONSE" | jq '.'

TRANSCRIBED_TEXT=$(echo "$RESPONSE" | jq -r '.text // empty')

if [ -n "$TRANSCRIBED_TEXT" ]; then
    echo "" >&2
    echo "Transcribed text:" >&2
    echo "$TRANSCRIBED_TEXT" >&2
fi

```



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### README.md

```markdown
# Zhipu AI ASR Skill

Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Transcribe Chinese audio files to text with high accuracy.

## Features

- 🎤 **Multiple Audio Formats**: WAV, MP3, OGG, M4A, AAC, FLAC, WMA
- 🇨🇳 **Chinese Language Support**: Optimized for Mandarin Chinese
- 📝 **Context Prompts**: Improve accuracy with previous transcription context
- 🔥 **Hotwords**: Custom vocabulary for specific terms (names, jargon, etc.)
- ⚡ **Fast Processing**: Real-time or faster transcription speed
- 🔄 **Auto Format Conversion**: Automatically converts unsupported formats to MP3

## Requirements

- `jq` - JSON processor
- `ffmpeg` - Audio format conversion
- `ZHIPU_API_KEY` environment variable

## Quick Start

```bash
# Install dependencies (if needed)
sudo apt-get install jq ffmpeg

# Set your API key
export ZHIPU_API_KEY="your-key-here"

# Transcribe an audio file
bash scripts/speech_to_text.sh recording.wav

# With context and hotwords
bash scripts/speech_to_text.sh recording.wav "previous context" "term1,term2,term3"
```

## File Constraints

- **Max file size**: 25 MB
- **Max duration**: 30 seconds
- **Supported formats**: WAV (recommended), MP3
- **Other formats**: Auto-converted to MP3

## Use Cases

- 🎙️ Meeting transcription
- 📚 Lecture recording
- 💼 Voice memos
- 🎞️ Video subtitle generation
- 📞 Call recording transcription

## Author

franklu0819-lang

## License

MIT

```

### _meta.json

```json
{
  "owner": "franklu0819-lang",
  "slug": "zhipu-asr",
  "displayName": "Zhipu Asr",
  "latest": {
    "version": "1.0.2",
    "publishedAt": 1773357542254,
    "commit": "https://github.com/openclaw/skills/commit/034e746f7a6216acf95437cc0d7f6900755c6efb"
  },
  "history": [
    {
      "version": "1.0.1",
      "publishedAt": 1771684631424,
      "commit": "https://github.com/openclaw/skills/commit/983669383df4852cf5f78deae1f4619951f1ed59"
    }
  ]
}

```