
I wanted to automatically convert podcast audio to text, so I tried OpenAI Whisper. Here's a summary of the process from installation to actual conversion.
Quick Terminology
- Whisper: An open-source speech recognition model released by OpenAI (various sizes:
tiny,base,small,medium,large). - ffmpeg: An essential command-line tool for audio/video conversion and processing. Whisper uses it to read various audio formats.
- GPU(CUDA): If you have an Nvidia GPU, Whisper runs much faster.
- log-Mel spectrogram: A spectrum representation used to transform audio before feeding it into a neural network.
- DecodingOptions / transcribe: APIs in Whisper for adjusting decoding or transcribing entire files.
## Installation (Commands executable in Jupyter / Terminal)
1) Python Package Installation (Recommended)
Run in a Jupyter cell or terminal:
# (Option 1) Install from official repository via pip
pip install -U openai-whisper
# (Option 2) Install directly from GitHub (latest source)
pip install -U git+https://github.com/openai/whisper.git
# Install PyTorch according to your system/CUDA. Example: CPU-only or CUDA version
# CPU version example:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# (For CUDA, follow the official PyTorch installation guide)
Note: Whisper runs on top of PyTorch. If you need GPU (CUDA), you must install PyTorch with CUDA build.
2) ffmpeg Installation (by Platform)
- Windows (using Scoop; the method used in the original article):
# PowerShell (administrator privileges recommended)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iwr -useb get.scoop.sh | iex
scoop install ffmpeg
- macOS (Homebrew):
brew install ffmpeg
- Ubuntu / Debian:
sudo apt update
sudo apt install -y ffmpeg
Restart Jupyter/terminal after installation.
## 🐍 Complete Python Code for Execution
Below is a complete example script that can be run directly in local/cloud environments. (Save as whisper_transcribe.py or paste directly into a Jupyter cell.)
# whisper_transcribe.py
# Required libraries: torch, whisper, time
# Installation (terminal): pip install -U openai-whisper torch
import time
import torch
import whisper
# ---------------------
# 1) GPU check and device setup
# ---------------------
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"
# ---------------------
# 2) Load model
# Model names: tiny, base, small, medium, large
# (The original article used "base" as an example)
# ---------------------
model_name = "base" # Change to "tiny" or "large" etc. as needed
print(f"Loading Whisper model '{model_name}' on {device} ...")
start_load = time.time()
model = whisper.load_model(model_name, device=device)
load_time = time.time() - start_load
print(f"Model loaded in {load_time:.1f}s")
# (Optional) Check model parameter count
def count_params(m):
return sum(p.numel() for p in m.parameters())
try:
total_params = count_params(model)
print(f"Approx. model parameters: {total_params:,}")
Exception e:
(, e)
audio_path =
(, audio_path)
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(device)
()
lang_probs = model.detect_language(mel)
(lang_probs, ) (lang_probs) >= :
_, probs = lang_probs
detected = (probs, key=probs.get)
:
probs = lang_probs
detected = (probs, key=probs.get) (probs, )
(, detected)
()
options = whisper.DecodingOptions(fp16=)
result_snippet = whisper.decode(model, mel, options)
()
(result_snippet.text)
()
t0 = time.time()
result_full = model.transcribe(audio_path)
elapsed = time.time() - t0
()
full_text = result_full.get(, result_full) (result_full, ) (result_full)
()
(full_text[:])
()
out_txt =
(out_txt, , encoding=) f:
f.write(full_text)
(, out_txt)
()
()
()
()
The script above exactly follows the flow described in the original article (load audio → pad_or_trim → mel → detect_language → DecodingOptions + decode → model.transcribe → output/save).
TEST: Tested with Kehlani's "How It's Done"


- Conversion is fast for short audio files.
- Korean quality is significantly lower, so it's mainly usable for English.
- It's not 100% accurate translation; it feels like about 60-70% accuracy.
Conclusion: It seems quite usable for getting a general understanding of English videos.
## Actual Performance (Original Article Test)
-
1 hour 10 minute RealPython podcast transcription test (original experiment)
- CPU (default CPU): About 56 minutes
- GPU (cloud, CUDA): About 4 minutes
-
Conclusion: Having a GPU makes it much faster. (Varies significantly by model size/hardware)

## Practical Tips & Notes
- Without ffmpeg, you won't be able to read various formats (.m4a, .ogg, etc.), so make sure to install it.
- For large audio files (over 1 hour), there are memory/time constraints, so test with
baseortinyfirst, then check accuracy withmedium/large. - Using
fp16=Truewith GPU makes it faster (but may cause stability issues in some environments). - There are many Whisper versions/variants (community forks), so API calling methods may differ slightly depending on the version.
