
I wanted to automatically convert podcast audio to text, so I tried OpenAI Whisper. Here's a summary of the process from installation to actual conversion.
Quick Terminology
- Whisper: An open-source speech recognition model released by OpenAI (various sizes:
tiny,base,small,medium,large). - ffmpeg: An essential command-line tool for audio/video conversion and processing. Whisper uses it to read various audio formats.
- GPU(CUDA): If you have an Nvidia GPU, Whisper runs much faster.
- log-Mel spectrogram: A spectrum representation used to transform audio before feeding it into a neural network.
- DecodingOptions / transcribe: APIs in Whisper for adjusting decoding or transcribing entire files.
## Installation (Commands executable in Jupyter / Terminal)
1) Python Package Installation (Recommended)
Run in a Jupyter cell or terminal:
# (Option 1) Install from official repository via pip
pip install -U openai-whisper
# (Option 2) Install directly from GitHub (latest source)
pip install -U git+https://github.com/openai/whisper.git
# Install PyTorch according to your system/CUDA. Example: CPU-only or CUDA version
# CPU version example:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# (For CUDA, follow the official PyTorch installation guide)
Note: Whisper runs on top of PyTorch. If you need GPU (CUDA), you must install PyTorch with CUDA build.
2) ffmpeg Installation (by Platform)
- Windows (using Scoop; the method used in the original article):
# PowerShell (administrator privileges recommended)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iwr -useb get.scoop.sh | iex
scoop install ffmpeg
- macOS (Homebrew):
brew install ffmpeg
- Ubuntu / Debian:
sudo apt update
sudo apt install -y ffmpeg
Restart Jupyter/terminal after installation.
## ๐ Complete Python Code for Execution
Below is a complete example script that can be run directly in local/cloud environments. (Save as whisper_transcribe.py or paste directly into a Jupyter cell.)
# whisper_transcribe.py
# Required libraries: torch, whisper, time
# Installation (terminal): pip install -U openai-whisper torch
import time
import torch
import whisper
# ---------------------
# 1) GPU check and device setup
# ---------------------
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"
# ---------------------
# 2) Load model
# Model names: tiny, base, small, medium, large
# (The original article used "base" as an example)
# ---------------------
model_name = "base" # Change to "tiny" or "large" etc. as needed
print(f"Loading Whisper model '{model_name}' on {device} ...")
start_load = time.time()
model = whisper.load_model(model_name, device=device)
load_time = time.time() - start_load
()
(Optional) Check model parameter count
def count_params(m): return sum(p.numel() for p in m.parameters()) try: total_params = count_params(model) print(f"Approx. model parameters: {total_params:,}") except Exception as e: print("Could not count model params:", e)
---------------------
3) Load audio (file must be in local directory)
- load_audio: adjusts sample rate and returns numpy array
- pad_or_trim: pads/trims to 30-second slots (as described in original)
---------------------
audio_path = "realpython_podcast.mp3" # Replace with actual filename print("Loading audio:", audio_path) audio = whisper.load_audio(audio_path) # (numpy) audio samples audio = whisper.pad_or_trim(audio) # Adjust length (method used in original) mel = whisper.log_mel_spectrogram(audio).to(device)
---------------------
4) Language detection (detect_language)
---------------------
print("Detecting language...") lang_probs = model.detect_language(mel)
detect_language returns (language, probs) in some versions; handle generically:
if isinstance(lang_probs, tuple) and len(lang_probs) >= 2: _, probs = lang_probs detected = max(probs, key=probs.get) else: # newer whisper wraps differently; attempt to read probs attr probs = lang_probs detected = max(probs, key=probs.get) if isinstance(probs, dict) else "unknown" print("Detected language:", detected)
---------------------
5) Decoding (check first 30 seconds) - DecodingOptions + decode
---------------------
print("Decoding short snippet (first frame / 30s) using DecodingOptions...") options = whisper.DecodingOptions(fp16=False) # fp16=False recommended for CPU stability result_snippet = whisper.decode(model, mel, options) print("Snippet text (first chunk):") print(result_snippet.text)
---------------------
6) Full file transcription (transcribe)
- model.transcribe() internally splits and processes the file
---------------------
print("Transcribing full file ... (this may take time on CPU)") t0 = time.time() result_full = model.transcribe(audio_path) # or model.transcribe(audio, **kwargs) elapsed = time.time() - t0 print(f"Transcription finished in {elapsed/60:.2f} minutes ({elapsed:.1f} seconds).")
Access result text:
full_text = result_full.get("text", result_full) if isinstance(result_full, dict) else str(result_full) print("---- Transcript preview (first 500 chars) ----") print(full_text[:500]) print("---- End preview ----")
---------------------
7) Save results
---------------------
out_txt = "transcript.txt" with open(out_txt, "w", encoding="utf-8") as f: f.write(full_text) print("Saved transcript to:", out_txt)






