zhipu-asr
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install openclaw-skills-zhipu-asr
Repository
Skill path: skills/franklu0819-lang/zhipu-asr
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: openclaw.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install zhipu-asr into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/openclaw/skills before adding zhipu-asr to shared team environments
- Use zhipu-asr for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: zhipu-asr
description: Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
metadata:
{
"openclaw":
{
"requires": { "bins": ["jq", "curl", "ffmpeg"], "env": ["ZHIPU_API_KEY"] },
},
}
---
# Zhipu AI Automatic Speech Recognition (ASR)
Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.
## Setup
**1. Get your API Key:**
Get a key from [Zhipu AI Console](https://bigmodel.cn/usercenter/proj-mgmt/apikeys)
**2. Set it in your environment:**
```bash
export ZHIPU_API_KEY="your-key-here"
```
## Supported Audio Formats
- **WAV** - Recommended, best quality
- **MP3** - Widely supported
- **OGG** - Auto-converted to MP3
- **M4A** - Auto-converted to MP3
- **AAC** - Auto-converted to MP3
- **FLAC** - Auto-converted to MP3
- **WMA** - Auto-converted to MP3
> **Note:** The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.
## File Constraints
- **Maximum file size:** 25 MB
- **Maximum duration:** 30 seconds
- **Recommended sample rate:** 16000 Hz or higher
- **Audio channels:** Mono or stereo
## Usage
### Basic Transcription
Transcribe an audio file with default settings:
```bash
bash scripts/speech_to_text.sh recording.wav
```
### Transcription with Context
Provide previous transcription or context for better accuracy:
```bash
bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"
```
### Transcription with Hotwords
Use custom vocabulary to improve recognition of specific terms:
```bash
bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
```
### Full Options
Combine context and hotwords:
```bash
bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
```
**Parameters:**
- `audio_file` (required): Path to audio file (.wav or .mp3)
- `prompt` (optional): Previous transcription or context text (max 8000 chars)
- `hotwords` (optional): Comma-separated list of specific terms (max 100 words)
## Features
### Context Prompts
**Why use context prompts:**
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments
**When to use:**
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions
**Example:**
```bash
bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"
```
### Hotwords
**What are hotwords:**
Custom vocabulary list that boosts recognition accuracy for specific terms.
**Best use cases:**
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms
**Examples:**
```bash
# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"
# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"
# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
```
## Workflow Examples
### Transcribe a Meeting
```bash
# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav
# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"
# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
```
### Transcribe a Lecture
```bash
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
```
### Process Multiple Files
```bash
for file in recording_*.wav; do
bash scripts/speech_to_text.sh "$file"
done
```
## Audio Quality Tips
**Best practices for accurate transcription:**
1. **Clear audio source**
- Minimize background noise
- Use good quality microphone
- Speak clearly and at moderate pace
2. **Optimal audio settings**
- Sample rate: 16000 Hz or higher
- Bit depth: 16-bit or higher
- Single channel (mono) is sufficient
3. **File preparation**
- Remove silence from beginning/end
- Normalize audio levels
- Ensure consistent volume
## Output Format
The script outputs JSON with:
- `id`: Task ID
- `created`: Request timestamp (Unix timestamp)
- `request_id`: Unique request identifier
- `model`: Model name used
- `text`: Transcribed text
Example output:
```json
{
"id": "task-12345",
"created": 1234567890,
"request_id": "req-abc123",
"model": "glm-asr-2512",
"text": "你好,这是转录的文本内容"
}
```
## Troubleshooting
**File Size Issues:**
- Split audio files larger than 25 MB
- Reduce sample rate or bit depth
- Use compression (MP3) for smaller files
**Duration Issues:**
- Split recordings longer than 30 seconds
- Process segments separately
- Use context prompts to maintain continuity
**Poor Accuracy:**
- Improve audio quality
- Use hotwords for specific terms
- Provide context prompts
- Ensure clear speech and minimal noise
**Format Issues:**
- Ensure file is .wav or .mp3
- Check file is not corrupted
- Verify audio can be played by standard players
## Limitations
- Maximum audio duration: 30 seconds per request
- File size limit: 25 MB
- Maximum hotwords: 100 terms
- Context prompt limit: 8000 characters
- Best performance with Chinese language audio
## Performance Notes
- Typical transcription time: 1-3 seconds
- Real-time or faster for most audio
- Processing time scales with audio quality and length
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/speech_to_text.sh
```bash
#!/bin/bash
# Zhipu AI Speech-to-Text Script
# Usage: ./speech_to_text.sh <audio_file> [prompt] [hotwords]
set -e
# Configuration
API_ENDPOINT="https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"
# Get API key from environment
if [ -z "$ZHIPU_API_KEY" ]; then
echo "Error: ZHIPU_API_KEY environment variable is not set" >&2
echo "" >&2
echo "To fix:" >&2
echo "1. Get a key from https://bigmodel.cn/usercenter/proj-mgmt/apikeys" >&2
echo "2. Run: export ZHIPU_API_KEY=\"your-key\"" >&2
exit 1
fi
# Parse arguments
AUDIO_FILE="$1"
PROMPT="$2"
HOTWORDS="$3"
# Validate audio file
if [ -z "$AUDIO_FILE" ]; then
echo "Usage: $0 <audio_file> [prompt] [hotwords]" >&2
echo "" >&2
echo "Examples:" >&2
echo " $0 recording.wav" >&2
echo " $0 recording.wav \"这是之前的转录内容\"" >&2
echo " $0 recording.mp3 \"\" \"人名,地名,专业术语\"" >&2
echo "" >&2
echo "Supported formats: .wav, .mp3, .ogg, .m4a, .aac, .flac, .wma" >&2
echo "Max file size: 25 MB" >&2
echo "Max duration: 30 seconds" >&2
exit 1
fi
# Check if file exists
if [ ! -f "$AUDIO_FILE" ]; then
echo "Error: Audio file not found: $AUDIO_FILE" >&2
exit 1
fi
# Auto-convert audio format if needed
ORIGINAL_FILE="$AUDIO_FILE"
FILE_EXT="${AUDIO_FILE##*.}"
FILE_EXT=$(echo "$FILE_EXT" | tr '[:upper:]' '[:lower:]')
# Supported formats by API: wav, mp3
# Convert other formats to mp3
if [ "$FILE_EXT" != "wav" ] && [ "$FILE_EXT" != "mp3" ]; then
echo "Converting audio format: $FILE_EXT → mp3" >&2
# Check if ffmpeg is available
if ! command -v ffmpeg &> /dev/null; then
echo "Error: ffmpeg is required for format conversion" >&2
echo "Install with: yum install -y ffmpeg" >&2
exit 1
fi
# Create temp file for converted audio
TEMP_AUDIO=$(mktemp --suffix=.mp3)
# Convert to mp3 with optimal settings for ASR
# Using libmp3lame for compatibility. ar=16000, ac=1 (mono) as per Zhipu best practices.
ffmpeg -i "$AUDIO_FILE" \
-acodec libmp3lame \
-ar 16000 \
-ac 1 \
-b:a 64k \
-y \
"$TEMP_AUDIO" 2>/dev/null
# Replace AUDIO_FILE with converted file
AUDIO_FILE="$TEMP_AUDIO"
echo "Conversion complete: $TEMP_AUDIO" >&2
echo "" >&2
# Cleanup function for temp file
cleanup() {
if [ -f "$TEMP_AUDIO" ]; then
rm -f "$TEMP_AUDIO"
fi
}
trap cleanup EXIT
else
# No conversion needed, no cleanup required
cleanup() { :; }
trap cleanup EXIT
fi
# Check file size (25 MB limit)
FILE_SIZE=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
MAX_SIZE=$((25 * 1024 * 1024))
if [ "$FILE_SIZE" -gt "$MAX_SIZE" ]; then
echo "Error: File size exceeds 25 MB limit" >&2
echo "Current size: $(($FILE_SIZE / 1024 / 1024)) MB" >&2
exit 1
fi
# Check duration if ffmpeg is available
if command -v ffprobe &> /dev/null; then
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$AUDIO_FILE")
# Compare with 30s limit (allow a tiny margin for float comparison in shell)
if (( $(echo "$DURATION > 30.5" | bc -l) )); then
echo "Error: Audio duration ($DURATION s) exceeds 30 second limit" >&2
exit 1
fi
fi
# Build base payload
PAYLOAD=$(jq -n \
--arg model "glm-asr-2512" \
'{
model: $model
}')
# Add prompt if provided
if [ -n "$PROMPT" ]; then
PAYLOAD=$(echo "$PAYLOAD" | jq --arg prompt "$PROMPT" '. + {prompt: $prompt}')
fi
# Add hotwords if provided
if [ -n "$HOTWORDS" ]; then
# Convert comma-separated hotwords to JSON array
HOTWORDS_ARRAY=$(echo "$HOTWORDS" | jq -R 'split(",") | map(trim)')
PAYLOAD=$(echo "$PAYLOAD" | jq --argjson hotwords "$HOTWORDS_ARRAY" '. + {hotwords: $hotwords}')
fi
# Make API request
echo "Transcribing audio file: $AUDIO_FILE" >&2
if [ -n "$PROMPT" ]; then
echo "Using context prompt: $(echo "$PROMPT" | cut -c1-50)..." >&2
fi
if [ -n "$HOTWORDS" ]; then
echo "Hotwords: $HOTWORDS" >&2
fi
echo "" >&2
# Build curl command arguments
CURL_ARGS=()
CURL_ARGS+=(-H "Authorization: Bearer $ZHIPU_API_KEY")
CURL_ARGS+=(-F "file=@$AUDIO_FILE")
CURL_ARGS+=(-F "model=glm-asr-2512")
if [ -n "$PROMPT" ]; then
CURL_ARGS+=(-F "prompt=$PROMPT")
fi
if [ -n "$HOTWORDS" ]; then
# Convert comma-separated to array format for curl
IFS=',' read -ra HW_ARRAY <<< "$HOTWORDS"
for word in "${HW_ARRAY[@]}"; do
word=$(echo "$word" | xargs)
CURL_ARGS+=(-F "hotwords[]=$word")
done
fi
RESPONSE=$(curl -s -X POST "$API_ENDPOINT" "${CURL_ARGS[@]}")
# Check for errors
if echo "$RESPONSE" | jq -e '.error' > /dev/null 2>&1; then
ERROR_MSG=$(echo "$RESPONSE" | jq -r '.error.message // .error')
echo "Error: $ERROR_MSG" >&2
exit 1
fi
# Extract and display result
echo "$RESPONSE" | jq '.'
TRANSCRIBED_TEXT=$(echo "$RESPONSE" | jq -r '.text // empty')
if [ -n "$TRANSCRIBED_TEXT" ]; then
echo "" >&2
echo "Transcribed text:" >&2
echo "$TRANSCRIBED_TEXT" >&2
fi
```
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### README.md
```markdown
# Zhipu AI ASR Skill
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Transcribe Chinese audio files to text with high accuracy.
## Features
- 🎤 **Multiple Audio Formats**: WAV, MP3, OGG, M4A, AAC, FLAC, WMA
- 🇨🇳 **Chinese Language Support**: Optimized for Mandarin Chinese
- 📝 **Context Prompts**: Improve accuracy with previous transcription context
- 🔥 **Hotwords**: Custom vocabulary for specific terms (names, jargon, etc.)
- ⚡ **Fast Processing**: Real-time or faster transcription speed
- 🔄 **Auto Format Conversion**: Automatically converts unsupported formats to MP3
## Requirements
- `jq` - JSON processor
- `ffmpeg` - Audio format conversion
- `ZHIPU_API_KEY` environment variable
## Quick Start
```bash
# Install dependencies (if needed)
sudo apt-get install jq ffmpeg
# Set your API key
export ZHIPU_API_KEY="your-key-here"
# Transcribe an audio file
bash scripts/speech_to_text.sh recording.wav
# With context and hotwords
bash scripts/speech_to_text.sh recording.wav "previous context" "term1,term2,term3"
```
## File Constraints
- **Max file size**: 25 MB
- **Max duration**: 30 seconds
- **Supported formats**: WAV (recommended), MP3
- **Other formats**: Auto-converted to MP3
## Use Cases
- 🎙️ Meeting transcription
- 📚 Lecture recording
- 💼 Voice memos
- 🎞️ Video subtitle generation
- 📞 Call recording transcription
## Author
franklu0819-lang
## License
MIT
```
### _meta.json
```json
{
"owner": "franklu0819-lang",
"slug": "zhipu-asr",
"displayName": "Zhipu Asr",
"latest": {
"version": "1.0.2",
"publishedAt": 1773357542254,
"commit": "https://github.com/openclaw/skills/commit/034e746f7a6216acf95437cc0d7f6900755c6efb"
},
"history": [
{
"version": "1.0.1",
"publishedAt": 1771684631424,
"commit": "https://github.com/openclaw/skills/commit/983669383df4852cf5f78deae1f4619951f1ed59"
}
]
}
```