default alt text

I wanted to automatically convert podcast audio to text, so I tried OpenAI Whisper. Here's a summary of the process from installation to actual conversion.

Quick Terminology

  • Whisper: An open-source speech recognition model released by OpenAI (various sizes: tiny, base, small, medium, large).
  • ffmpeg: An essential command-line tool for audio/video conversion and processing. Whisper uses it to read various audio formats.
  • GPU(CUDA): If you have an Nvidia GPU, Whisper runs much faster.
  • log-Mel spectrogram: A spectrum representation used to transform audio before feeding it into a neural network.
  • DecodingOptions / transcribe: APIs in Whisper for adjusting decoding or transcribing entire files.


## Installation (Commands executable in Jupyter / Terminal)

1) Python Package Installation (Recommended)

Run in a Jupyter cell or terminal:

# (Option 1) Install from official repository via pip
pip install -U openai-whisper

# (Option 2) Install directly from GitHub (latest source)
pip install -U git+https://github.com/openai/whisper.git

# Install PyTorch according to your system/CUDA. Example: CPU-only or CUDA version
# CPU version example:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# (For CUDA, follow the official PyTorch installation guide)

Note: Whisper runs on top of PyTorch. If you need GPU (CUDA), you must install PyTorch with CUDA build.



2) ffmpeg Installation (by Platform)

  • Windows (using Scoop; the method used in the original article):
# PowerShell (administrator privileges recommended)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iwr -useb get.scoop.sh | iex
scoop install ffmpeg
  • macOS (Homebrew):
brew install ffmpeg
  • Ubuntu / Debian:
sudo apt update
sudo apt install -y ffmpeg

Restart Jupyter/terminal after installation.



## ๐Ÿ Complete Python Code for Execution

Below is a complete example script that can be run directly in local/cloud environments. (Save as whisper_transcribe.py or paste directly into a Jupyter cell.)

# whisper_transcribe.py
# Required libraries: torch, whisper, time
# Installation (terminal): pip install -U openai-whisper torch

import time
import torch
import whisper

# ---------------------
# 1) GPU check and device setup
# ---------------------
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"

# ---------------------
# 2) Load model
#    Model names: tiny, base, small, medium, large
#    (The original article used "base" as an example)
# ---------------------
model_name = "base"   # Change to "tiny" or "large" etc. as needed
print(f"Loading Whisper model '{model_name}' on {device} ...")
start_load = time.time()
model = whisper.load_model(model_name, device=device)
load_time = time.time() - start_load
()

(Optional) Check model parameter count

def count_params(m): return sum(p.numel() for p in m.parameters()) try: total_params = count_params(model) print(f"Approx. model parameters: {total_params:,}") except Exception as e: print("Could not count model params:", e)

---------------------

3) Load audio (file must be in local directory)

- load_audio: adjusts sample rate and returns numpy array

- pad_or_trim: pads/trims to 30-second slots (as described in original)

---------------------

audio_path = "realpython_podcast.mp3" # Replace with actual filename print("Loading audio:", audio_path) audio = whisper.load_audio(audio_path) # (numpy) audio samples audio = whisper.pad_or_trim(audio) # Adjust length (method used in original) mel = whisper.log_mel_spectrogram(audio).to(device)

---------------------

4) Language detection (detect_language)

---------------------

print("Detecting language...") lang_probs = model.detect_language(mel)

detect_language returns (language, probs) in some versions; handle generically:

if isinstance(lang_probs, tuple) and len(lang_probs) >= 2: _, probs = lang_probs detected = max(probs, key=probs.get) else: # newer whisper wraps differently; attempt to read probs attr probs = lang_probs detected = max(probs, key=probs.get) if isinstance(probs, dict) else "unknown" print("Detected language:", detected)

---------------------

5) Decoding (check first 30 seconds) - DecodingOptions + decode

---------------------

print("Decoding short snippet (first frame / 30s) using DecodingOptions...") options = whisper.DecodingOptions(fp16=False) # fp16=False recommended for CPU stability result_snippet = whisper.decode(model, mel, options) print("Snippet text (first chunk):") print(result_snippet.text)

---------------------

6) Full file transcription (transcribe)

- model.transcribe() internally splits and processes the file

---------------------

print("Transcribing full file ... (this may take time on CPU)") t0 = time.time() result_full = model.transcribe(audio_path) # or model.transcribe(audio, **kwargs) elapsed = time.time() - t0 print(f"Transcription finished in {elapsed/60:.2f} minutes ({elapsed:.1f} seconds).")

Access result text:

full_text = result_full.get("text", result_full) if isinstance(result_full, dict) else str(result_full) print("---- Transcript preview (first 500 chars) ----") print(full_text[:500]) print("---- End preview ----")

---------------------

7) Save results

---------------------

out_txt = "transcript.txt" with open(out_txt, "w", encoding="utf-8") as f: f.write(full_text) print("Saved transcript to:", out_txt)