Editor's Note
whisper-transcribe
|
Install
npx skills add https://github.com/majiayu000/claude-skill-registry --skill whisper-transcribeWhisper Transcribe Skill
Transcribe audio and video files to text using OpenAI's Whisper with contextual grounding from markdown files.
Purpose
Intelligent audio/video transcription that:
- Converts media files to accurate text transcripts
- Uses markdown context files to correct technical terms, names, and jargon
- Handles various audio/video formats (mp3, wav, m4a, mp4, webm, etc.)
When to Use
- User asks to transcribe an audio or video file
- User wants to convert a recording to text
- User mentions "whisper" in context of transcription
- User needs meeting notes or interview transcripts
- User has media files with domain-specific terminology
Installation
macOS (Recommended for MacBook Pro)
# Install via Homebrew (recommended)
brew install ffmpeg openai-whisper
# Verify installation
whisper --version
Linux/pip Installation
# Install ffmpeg first
sudo apt install ffmpeg # Debian/Ubuntu
# or: sudo dnf install ffmpeg # Fedora
# Install Whisper
pip install openai-whisper
Verify Installation
whisper --version
ffmpeg -version
Transcription Workflow
Step 1: Identify Media File and Context
- Locate the audio/video file to transcribe
- Check for markdown files in the same directory (context files)
- If no context files exist, optionally create one using
assets/context-template.md
Step 2: Run Whisper Transcription
Basic transcription:
whisper "/path/to/audio.mp3" --output_dir "/path/to/output"
With model selection (trade-off: speed vs accuracy):
# Fast (less accurate)
whisper "audio.mp3" --model tiny
# Balanced (recommended)
whisper "audio.mp3" --model base
# High quality
whisper "audio.mp3" --model small
# Best quality (slower, requires more RAM)
whisper "audio.mp3" --model medium
whisper "audio.mp3" --model large
With language specification:
whisper "audio.mp3" --language en
Output format options:
whisper "audio.mp3" --output_format txt # Plain text
whisper "audio.mp3" --output_format srt # Subtitles
whisper "audio.mp3" --output_format vtt # Web subtitles
whisper "audio.mp3" --output_format json # Detailed JSON
whisper "audio.mp3" --output_format all # All formats
Step 3: Apply Context Grounding
Use the scripts/transcribe_with_context.py script for automated grounding, or manually apply corrections:
# Automated approach (recommended)
python scripts/transcribe_with_context.py /path/to/audio.mp3
For manual grounding:
- Read the transcript output
- Read all
.mdfiles in the media file's directory - Extract terminology, names, and technical terms from context files
- Search transcript for likely misrecognitions
- Apply corrections based on context
Common corrections:
- "cooler net ease" -> "Kubernetes"
- "sequel" -> "SQL"
- "post gress" -> "Postgres"
- Names: Match phonetic variations to names in context files
Step 4: Save Corrected Transcript
Save the grounded transcript with a clear filename:
original_filename_transcript.txt
original_filename_transcript.md
Context Files
Context files are markdown files in the same directory as the media file. They provide grounding information to improve transcription accuracy.
What to Include in Context Files
- People: Names of speakers, team members, interviewees
- Technical Terms: Domain-specific vocabulary, product names
- Acronyms: Abbreviations and their expansions
- Organizations: Company names, department names
- Projects: Project codenames, feature names
Context File Example
See assets/context-template.md for a complete template.
# Meeting Context
## Speakers
- Richard Hightower (host)
- Jane Smith (engineering lead)
## Technical Terms
- Kubernetes (container orchestration)
- FastAPI (Python web framework)
- AlloyDB (Google Cloud database)
## Acronyms
- CI/CD - Continuous Integration/Continuous Deployment
- PR - Pull Request
Model Selection Guide
Use base for general use, medium for important recordings. See references/whisper-options.md for full model comparison and all available options.
Quick reference: tiny (fastest) < base (balanced) < small (better) < medium (high) < large (best accuracy)
For MacBook Pro with Apple Silicon: small or medium models recommended for best speed/accuracy balance.
Troubleshooting
"whisper: command not found"
# macOS
brew install openai-whisper
# Linux
pip install openai-whisper
export PATH="$HOME/.local/bin:$PATH"
"ffmpeg not found"
# macOS
brew install ffmpeg
# Linux
sudo apt install ffmpeg
Out of memory errors
Use a smaller model:
whisper "audio.mp3" --model tiny
Slow transcription
- Use
tinyorbasemodel for faster results - Ensure correct architecture is being used (Apple Silicon vs Intel)
Resources
scripts/
The scripts/transcribe_with_context.py script automates the full workflow:
- Finds context files automatically
- Runs Whisper transcription
- Applies context-based corrections
- Saves the final transcript
Usage:
python scripts/transcribe_with_context.py /path/to/audio.mp3
references/
See references/whisper-options.md for complete CLI reference and advanced options.
assets/
The assets/context-template.md provides a template for creating context files to improve transcription accuracy.