SkillHub ClubWrite Technical DocsFull StackTech Writer

speech-to-text

Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C2.8

Composite score

2.8

Best-practice grade

B81.2

Install command

npx @skill-hub/cli install smorand-claude-config-speech-to-text

Repository

smorand/claude-config

Skill path: skills/speech-to-text

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: smorand.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install speech-to-text into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/smorand/claude-config before adding speech-to-text to shared team environments
Use speech-to-text for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: speech-to-text
description: Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution.
---

# Speech-to-Text Transcription Skill

Expert in transcribing audio and video files to structured meeting minutes with automatic attendee name detection, speaker attribution, and markdown formatting.

## Core Capabilities

### Audio Processing
- **Multi-Format Support:** MP3, WAV, M4A, AAC, OGG, FLAC
- **Auto-Chunking:** Splits files >30 minutes into 30-minute segments with 30-second overlaps
- **Parallel Processing:** Processes chunks concurrently (max 3 workers)
- **AI Merge Reconciliation:** One-pass merge of all chunks with overlap detection

### Meeting Minutes Generation
- **Attendee Detection:** Automatically identifies speaker names from conversation
- **Speaker Attribution:** Associates each statement with actual speaker names (not "Speaker 1", "Speaker 2")
- **Structured Output:** Clean markdown with Attendees list and Minutes sections
- **Language Preservation:** Maintains original audio language

### Performance
- **Instant Startup:** Compiled Go binary with no dependencies
- **No Environment Setup:** Default GCP project built-in (`oa-data-btdpexploration-np`)
- **Processing Time:** ~20 minutes for long recordings
- **Single Executable:** No Python, no virtualenv, no package installation

## Usage Instructions

### Basic Transcription

```bash
# Transcribe audio and save to file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o ~/Downloads/meeting_transcript.md

# With custom meeting name
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/recording.mp3 -o transcript.md -m "Weekly Team Sync"

# Display to stdout (no file save)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/audio.wav
```

### Command-Line Options

| Flag | Required | Default | Description |
|------|----------|---------|-------------|
| `<audio-file>` | **Yes** | - | Path to audio file (absolute path recommended) |
| `-o` | No | stdout | Output markdown file path |
| `-m` | No | filename | Custom meeting name for title |
| `-project` | No | `oa-data-btdpexploration-np` | GCP project ID |
| `-location` | No | `global` | GCP region for VertexAI |
| `-model` | No | `gemini-2.5-flash` | Gemini model to use |

### Finding Audio Files in Downloads

When user mentions "downloaded audio" or "audio from downloads":

```bash
# Search for audio files in Downloads (last 7 days)
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" -o -name "*.wav" -o -name "*.ogg" -o -name "*.aac" -o -name "*.flac" \) -mtime -7 | sort -r

# Or list by modification time
ls -lt ~/Downloads/*.{mp3,m4a,wav,ogg,aac,flac} 2>/dev/null | head -10
```

## Typical Workflows

### Workflow 1: User Mentions Downloaded Audio

**User Request:** "Transcribe the downloaded meeting audio"

**Steps:**
1. Search `~/Downloads` for recent audio files (last 7 days)
2. List found files and ask user to confirm which one
3. Transcribe using absolute path
4. Save as `<filename>_transcript.md` in `~/Downloads`

```bash
# Find recent audio files
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" \) -mtime -7

# Transcribe selected file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting_2025-12-30.mp3 -o ~/Downloads/meeting_2025-12-30_transcript.md
```

### Workflow 2: User Provides Explicit Path

**User Request:** "Transcribe /home/sebastien/recordings/interview.wav"

**Steps:**
1. Use provided path directly
2. Save transcript next to original: `/home/sebastien/recordings/interview_transcript.md`

```bash
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/recordings/interview.wav -o /home/sebastien/recordings/interview_transcript.md
```

### Workflow 3: Preprocessing for Meeting Summary

**User Request:** "Create a summary of the recorded meeting"

**Steps:**
1. Use `speech-to-text` skill to transcribe audio → generates `.md` transcript
2. Pass transcript to `meetings-summary` skill → generates structured summary

```bash
# Step 1: Transcribe (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# Step 2: Summarize (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill
```

### Workflow 4: Batch Processing Multiple Files

**User Request:** "Transcribe all MP3 files in Downloads"

**CRITICAL: Always process files sequentially (NEVER in parallel)**

```bash
# Process sequentially to avoid rate limits
for audio in ~/Downloads/*.mp3; do
    echo "Processing: $audio"
    output="${audio%.mp3}_transcript.md"
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "$output"
    # Wait for completion before next file
done
```

**❌ NEVER do this (parallel processing will fail):**
```bash
# DO NOT USE - will hit rate limits
for audio in ~/Downloads/*.mp3; do
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "${audio%.mp3}_transcript.md" &
done
wait
```

## Output Format

The transcription generates clean, structured markdown:

```markdown
# Meeting Name

## Attendees
- Sebastien Morand
- John Doe
- Jane Smith

## Minutes
- **Sebastien Morand**: Welcome everyone. Let's start with the quarterly review.
- **John Doe**: Thank you, Sebastien. I'd like to discuss the budget allocation for Q1.
- **Jane Smith**: I agree with John's points. Additionally, we should consider...
```

**Key Features:**
- **Section 1:** Attendees list (extracted from conversation)
- **Section 2:** Verbatim transcription with speaker attribution
- **Speaker names:** Detected from conversation (not generic "Speaker 1")
- **Clean markdown:** Ready for further processing or display

## Critical Constraints

### 1. Always Use Absolute Paths

**✅ Recommended:**
```bash
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/audio/recording.wav
```

**⚠️ Avoid relative paths** (can cause confusion):
```bash
~/.claude/skills/speech-to-text/scripts/speech-to-text ./meeting.mp3  # May fail if cwd is wrong
```

### 2. Never Process Files in Parallel

**Why:** Gemini API rate limits will cause 429 errors when processing multiple files concurrently.

**Always process sequentially:**
- One file at a time
- Wait for completion before starting next
- Use `for` loops without `&` backgrounding

### 3. Default Output Location

Always save transcripts next to the original audio file unless user specifies otherwise:

- Input: `~/Downloads/meeting.mp3`
- Output: `~/Downloads/meeting_transcript.md`

### 4. No Environment Variables Needed

The binary has a default project (`oa-data-btdpexploration-np`) built-in. Only use `-project` flag if user explicitly needs a different project.

## Prerequisites

### System Requirements

**ffmpeg** - Required for audio chunking (files >30 minutes)
```bash
# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg
```

### GCP Setup

**Authentication:**
```bash
gcloud auth application-default login
```

**Enable VertexAI API:**
```bash
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np
```

**No other dependencies required** - Go binary is self-contained.

## Integration with Other Skills

### With `meetings-summary` Skill

This skill provides the audio-to-text conversion. For structured meeting summaries:

```bash
# 1. Transcribe audio (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# 2. Generate summary (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill for action items, decisions, etc.
```

**Note:** This skill only converts audio to text. Use `meetings-summary` for:
- Extracting action items
- Identifying decisions
- Structuring topics
- Generating email summaries

### With `topic-manager` Skill

After transcription, topics can be extracted and stored:

```bash
# 1. Transcribe
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/one-on-one.mp3 -o /tmp/transcript.md

# 2. Extract topics (topic-manager skill)
# Process /tmp/transcript.md to update topic notes
```

## Troubleshooting

### Audio File Not Found

**Error:** `ERROR: Audio file not found`

**Solutions:**
- Use absolute paths for reliability
- Verify file exists: `ls -lh /full/path/to/audio.mp3`
- Check file extension matches actual format
- Ensure no typos in path

### Authentication Errors

**Error:** `Failed to initialize VertexAI` or `authentication failed`

**Solutions:**
```bash
# Re-authenticate
gcloud auth application-default login

# Enable API
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np
```

### Rate Limit Errors (429)

**Error:** `429: Too Many Requests`

**Solutions:**
- Wait 5-10 minutes before retrying
- Ensure files are processed sequentially (not in parallel)
- Check for other processes hitting the same API
- Verify you're not running multiple transcriptions concurrently

### ffmpeg Not Found

**Error:** `ffmpeg: command not found`

**Solutions:**
```bash
# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg

# Verify installation
ffmpeg -version
```

### Long Processing Time

**Expected Behavior:**
- Short recordings (<30 min): 2-5 minutes
- Long recordings (>30 min): 15-20 minutes
- Very long recordings (>60 min): 30+ minutes

**Why:** Gemini 2.5 Flash processing time scales with audio duration.

**Optimization:**
- Lower bitrate audio (64-128 kbps) is sufficient for speech
- MP3 or M4A formats process fastest
- Keep files under 100 MB when possible

## Model Information

### Default Model: Gemini 2.5 Flash

**Model ID:** `gemini-2.5-flash`

**Features:**
- Multimodal (audio + text)
- Automatic language detection
- Optimized for cost and speed
- Best for meeting transcriptions
- High accuracy for speaker identification

### Alternative Models

**Gemini 2.5 Pro** - Higher accuracy, slower, more expensive
```bash
~/.claude/skills/speech-to-text/scripts/speech-to-text audio.mp3 -model gemini-2.5-pro
```

**When to use Pro:**
- Critical transcriptions requiring maximum accuracy
- Complex audio with multiple overlapping speakers
- Poor audio quality requiring advanced processing

## Performance Tips

1. **File Size:** Keep audio files under 100 MB when possible
2. **Format:** MP3 or M4A are fastest to process
3. **Quality:** Lower bitrate audio (64-128 kbps) is sufficient for speech
4. **Duration:** Files >60 minutes may take 30+ minutes to transcribe
5. **Sequential Processing:** Always process one file at a time

## Logging

All progress logs are written to stderr with timestamps in format `[YYYY-MM-DD HH:MM:SS]`:

```
[2025-12-30 14:23:45] Analyzing recording: filename.ogg
[2025-12-30 14:23:45] File size: 22.45 MB
[2025-12-30 14:23:45] Recording duration: 35.23 minutes (2114.0 seconds)
[2025-12-30 14:23:45] Recording exceeds 30 minutes - will split into 2 chunks
[2025-12-30 14:23:45] Cutting recording into 2 chunks with 30-second overlaps
[2025-12-30 14:23:46] Creating chunk 1/2 - start: 0.0s, duration: 1830.0s
[2025-12-30 14:23:47] Creating chunk 2/2 - start: 1770.0s, duration: 1830.0s
[2025-12-30 14:23:47] Successfully created 2 chunk files
[2025-12-30 14:23:47] Starting parallel transcription of 2 chunks (max 3 concurrent)
[2025-12-30 14:25:30] Transcription of chunk 1/2 completed
[2025-12-30 14:25:42] Transcription of chunk 2/2 completed
[2025-12-30 14:25:42] Starting reconciliation of overlapping chunks
[2025-12-30 14:25:42] Merging all 2 chunks in one pass
[2025-12-30 14:25:42] Sending all chunks to AI for merge reconciliation
[2025-12-30 14:26:15] All chunks merged successfully
[2025-12-30 14:26:15] Reconciliation completed successfully
[2025-12-30 14:26:15] Finalizing output with meeting title
[2025-12-30 14:26:15] Writing final output to: minutes.md
[2025-12-30 14:26:15] Successfully wrote 15.67 KB to minutes.md
```

## Skill Scope

### This Skill Handles

- ✅ Audio/video to text transcription
- ✅ Attendee name detection
- ✅ Speaker attribution
- ✅ Structured markdown output
- ✅ Multi-format audio support
- ✅ Long audio chunking and merging

### This Skill Does NOT Handle

- ❌ Meeting summaries (use `meetings-summary` skill)
- ❌ Action item extraction (use `meetings-summary` skill)
- ❌ Decision tracking (use `meetings-summary` skill)
- ❌ Audio editing or processing
- ❌ Audio analysis (sentiment, emotions, etc.)

## Example Interactions

### Example 1: Downloaded Audio

**User:** "I have a downloaded meeting recording, can you transcribe it?"

**Claude Actions:**
1. Search `~/Downloads` for audio files (last 7 days)
2. Display list of found files with size and date
3. Ask user to confirm which file
4. Transcribe selected file with absolute path
5. Save as `<filename>_transcript.md` in `~/Downloads`
6. Display success message with output location

---

### Example 2: Explicit Path with Summary

**User:** "Transcribe /home/sebastien/interview.wav and summarize it"

**Claude Actions:**
1. Use `speech-to-text` skill to transcribe
   - Output: `/home/sebastien/interview_transcript.md`
2. Load `meetings-summary` skill
3. Generate summary from transcript

---

### Example 3: Batch Processing

**User:** "Convert all my meeting recordings to text"

**Claude Actions:**
1. Ask user for directory location (default: `~/Downloads`)
2. List all audio files in directory
3. Ask for confirmation to process all files
4. Process sequentially (one by one, not parallel)
5. Save each transcript next to its audio file
6. Display progress and completion summary

## Skill Location

- **Binary:** `~/.claude/skills/speech-to-text/scripts/speech-to-text`
- **Documentation:** `/home/sebastien/projects/claude-config/skills/speech-to-text/`
- **Source Code:** Not included (compiled Go binary only)
- **Logs:** stderr (timestamped format)

## Author

**Sebastien MORAND**
Email: [email protected]
Role: CTO Data & AI at L'Oréal