whisper-transcription
Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install benchflow-ai-skillsbench-whisper-transcription
Repository
Skill path: tasks-no-skills/video-filler-word-remover/environment/skills/whisper-transcription
Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: benchflow-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install whisper-transcription into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/benchflow-ai/SkillsBench before adding whisper-transcription to shared team environments
- Use whisper-transcription for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: whisper-transcription
description: "Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word."
---
# Whisper Transcription
OpenAI Whisper provides accurate speech-to-text with word-level timestamps.
## Installation
```bash
pip install openai-whisper
```
## Model Selection
**Use the `tiny` model for fast transcription** - it's sufficient for most tasks and runs much faster:
| Model | Size | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |
**Recommendation: Start with `tiny` - it handles clear interview/podcast audio well.**
## Basic Usage with Word Timestamps
```python
import whisper
import json
def transcribe_with_timestamps(audio_path, output_path):
"""
Transcribe audio and get word-level timestamps.
Args:
audio_path: Path to audio/video file
output_path: Path to save JSON output
"""
# Use tiny model for speed
model = whisper.load_model("tiny")
# Transcribe with word timestamps
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en" # Specify language for better accuracy
)
# Extract words with timestamps
words = []
for segment in result["segments"]:
if "words" in segment:
for word_info in segment["words"]:
words.append({
"word": word_info["word"].strip(),
"start": word_info["start"],
"end": word_info["end"]
})
with open(output_path, "w") as f:
json.dump(words, f, indent=2)
return words
```
## Detecting Specific Words
```python
def find_words(transcription, target_words):
"""
Find specific words in transcription with their timestamps.
Args:
transcription: List of word dicts with 'word', 'start', 'end'
target_words: Set of words to find (lowercase)
Returns:
List of matches with word and timestamp
"""
matches = []
target_lower = {w.lower() for w in target_words}
for item in transcription:
word = item["word"].lower().strip()
# Remove punctuation for matching
clean_word = ''.join(c for c in word if c.isalnum())
if clean_word in target_lower:
matches.append({
"word": clean_word,
"timestamp": item["start"]
})
return matches
```
## Complete Example: Find Filler Words
```python
import whisper
import json
# Filler words to detect
FILLER_WORDS = {
"um", "uh", "hum", "hmm", "mhm",
"like", "so", "well", "yeah", "okay",
"basically", "actually", "literally"
}
def detect_fillers(audio_path, output_path):
# Load tiny model (fast!)
model = whisper.load_model("tiny")
# Transcribe
result = model.transcribe(audio_path, word_timestamps=True, language="en")
# Find fillers
fillers = []
for segment in result["segments"]:
for word_info in segment.get("words", []):
word = word_info["word"].lower().strip()
clean = ''.join(c for c in word if c.isalnum())
if clean in FILLER_WORDS:
fillers.append({
"word": clean,
"timestamp": round(word_info["start"], 2)
})
with open(output_path, "w") as f:
json.dump(fillers, f, indent=2)
return fillers
# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
```
## Audio Extraction (if needed)
Whisper can process video files directly, but for cleaner results:
```bash
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
```
## Multi-Word Phrases
For detecting phrases like "you know" or "I mean":
```python
def find_phrases(transcription, phrases):
"""Find multi-word phrases in transcription."""
matches = []
words = [w["word"].lower().strip() for w in transcription]
for phrase in phrases:
phrase_words = phrase.lower().split()
phrase_len = len(phrase_words)
for i in range(len(words) - phrase_len + 1):
if words[i:i+phrase_len] == phrase_words:
matches.append({
"word": phrase,
"timestamp": transcription[i]["start"]
})
return matches
```