Back to skills
SkillHub ClubShip Full StackFull Stack

whisper-transcription

Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
785
Hot score
99
Updated
March 20, 2026
Overall rating
C5.1
Composite score
5.1
Best-practice grade
S96.0

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-whisper-transcription

Repository

benchflow-ai/SkillsBench

Skill path: tasks-no-skills/video-filler-word-remover/environment/skills/whisper-transcription

Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install whisper-transcription into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/benchflow-ai/SkillsBench before adding whisper-transcription to shared team environments
  • Use whisper-transcription for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: whisper-transcription
description: "Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word."
---

# Whisper Transcription

OpenAI Whisper provides accurate speech-to-text with word-level timestamps.

## Installation

```bash
pip install openai-whisper
```

## Model Selection

**Use the `tiny` model for fast transcription** - it's sufficient for most tasks and runs much faster:

| Model | Size | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |

**Recommendation: Start with `tiny` - it handles clear interview/podcast audio well.**

## Basic Usage with Word Timestamps

```python
import whisper
import json

def transcribe_with_timestamps(audio_path, output_path):
    """
    Transcribe audio and get word-level timestamps.

    Args:
        audio_path: Path to audio/video file
        output_path: Path to save JSON output
    """
    # Use tiny model for speed
    model = whisper.load_model("tiny")

    # Transcribe with word timestamps
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en"  # Specify language for better accuracy
    )

    # Extract words with timestamps
    words = []
    for segment in result["segments"]:
        if "words" in segment:
            for word_info in segment["words"]:
                words.append({
                    "word": word_info["word"].strip(),
                    "start": word_info["start"],
                    "end": word_info["end"]
                })

    with open(output_path, "w") as f:
        json.dump(words, f, indent=2)

    return words
```

## Detecting Specific Words

```python
def find_words(transcription, target_words):
    """
    Find specific words in transcription with their timestamps.

    Args:
        transcription: List of word dicts with 'word', 'start', 'end'
        target_words: Set of words to find (lowercase)

    Returns:
        List of matches with word and timestamp
    """
    matches = []
    target_lower = {w.lower() for w in target_words}

    for item in transcription:
        word = item["word"].lower().strip()
        # Remove punctuation for matching
        clean_word = ''.join(c for c in word if c.isalnum())

        if clean_word in target_lower:
            matches.append({
                "word": clean_word,
                "timestamp": item["start"]
            })

    return matches
```

## Complete Example: Find Filler Words

```python
import whisper
import json

# Filler words to detect
FILLER_WORDS = {
    "um", "uh", "hum", "hmm", "mhm",
    "like", "so", "well", "yeah", "okay",
    "basically", "actually", "literally"
}

def detect_fillers(audio_path, output_path):
    # Load tiny model (fast!)
    model = whisper.load_model("tiny")

    # Transcribe
    result = model.transcribe(audio_path, word_timestamps=True, language="en")

    # Find fillers
    fillers = []
    for segment in result["segments"]:
        for word_info in segment.get("words", []):
            word = word_info["word"].lower().strip()
            clean = ''.join(c for c in word if c.isalnum())

            if clean in FILLER_WORDS:
                fillers.append({
                    "word": clean,
                    "timestamp": round(word_info["start"], 2)
                })

    with open(output_path, "w") as f:
        json.dump(fillers, f, indent=2)

    return fillers

# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
```

## Audio Extraction (if needed)

Whisper can process video files directly, but for cleaner results:

```bash
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
```

## Multi-Word Phrases

For detecting phrases like "you know" or "I mean":

```python
def find_phrases(transcription, phrases):
    """Find multi-word phrases in transcription."""
    matches = []
    words = [w["word"].lower().strip() for w in transcription]

    for phrase in phrases:
        phrase_words = phrase.lower().split()
        phrase_len = len(phrase_words)

        for i in range(len(words) - phrase_len + 1):
            if words[i:i+phrase_len] == phrase_words:
                matches.append({
                    "word": phrase,
                    "timestamp": transcription[i]["start"]
                })

    return matches
```
whisper-transcription | SkillHub