default alt text

I wanted to automatically convert podcast audio to text, so I tried OpenAI Whisper. Here's a summary of the process from installation to actual conversion.

Quick Terminology

Whisper: An open-source speech recognition model released by OpenAI (various sizes: tiny, base, small, medium, large).
ffmpeg: An essential command-line tool for audio/video conversion and processing. Whisper uses it to read various audio formats.
GPU(CUDA): If you have an Nvidia GPU, Whisper runs much faster.
log-Mel spectrogram: A spectrum representation used to transform audio before feeding it into a neural network.
DecodingOptions / transcribe: APIs in Whisper for adjusting decoding or transcribing entire files.

Installation (Commands executable in Jupyter / Terminal)

1) Python Package Installation (Recommended)

Run in a Jupyter cell or terminal:

# (Option 1) Install from official repository via pip
pip install -U openai-whisper

# (Option 2) Install directly from GitHub (latest source)
pip install -U git+https://github.com/openai/whisper.git

# Install PyTorch according to your system/CUDA. Example: CPU-only or CUDA version
# CPU version example:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# (For CUDA, follow the official PyTorch installation guide)

Note: Whisper runs on top of PyTorch. If you need GPU (CUDA), you must install PyTorch with CUDA build.

2) ffmpeg Installation (by Platform)

Windows (using Scoop; the method used in the original article):

# PowerShell (administrator privileges recommended)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iwr -useb get.scoop.sh | iex
scoop install ffmpeg

macOS (Homebrew):

brew install ffmpeg

Ubuntu / Debian:

sudo apt update
sudo apt install -y ffmpeg

Restart Jupyter/terminal after installation.

🐍 Complete Python Code for Execution

Below is a complete example script that can be run directly in local/cloud environments. (Save as whisper_transcribe.py or paste directly into a Jupyter cell.)

# whisper_transcribe.py
# Required libraries: torch, whisper, time
# Installation (terminal): pip install -U openai-whisper torch

import time
import torch
import whisper

# ---------------------
# 1) GPU check and device setup
# ---------------------
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"

# ---------------------
# 2) Load model
#    Model names: tiny, base, small, medium, large
#    (The original article used "base" as an example)
# ---------------------
model_name = "base"   # Change to "tiny" or "large" etc. as needed
print(f"Loading Whisper model '{model_name}' on {device} ...")
start_load = time.time()
model = whisper.load_model(model_name, device=device)
load_time = time.time() - start_load
()


 ():
     (p.numel()  p  m.parameters())
:
    total_params = count_params(model)
    ()
 Exception  e:
    (, e)

---------------------

3) Load audio (file must be in local directory)

- load_audio: adjusts sample rate and returns numpy array

- pad_or_trim: pads/trims to 30-second slots (as described in original)

---------------------

audio_path = "realpython_podcast.mp3" # Replace with actual filename print("Loading audio:", audio_path) audio = whisper.load_audio(audio_path) # (numpy) audio samples audio = whisper.pad_or_trim(audio) # Adjust length (method used in original) mel = whisper.log_mel_spectrogram(audio).to(device)

---------------------

4) Language detection (detect_language)

---------------------

print("Detecting language...") lang_probs = model.detect_language(mel)

detect_language returns (language, probs) in some versions; handle generically:

if isinstance(lang_probs, tuple) and len(lang_probs) >= 2: _, probs = lang_probs detected = max(probs, key=probs.get) else: # newer whisper wraps differently; attempt to read probs attr probs = lang_probs detected = max(probs, key=probs.get) if isinstance(probs, dict) else "unknown" print("Detected language:", detected)

---------------------

5) Decoding (check first 30 seconds) - DecodingOptions + decode

---------------------

print("Decoding short snippet (first frame / 30s) using DecodingOptions...") options = whisper.DecodingOptions(fp16=False) # fp16=False recommended for CPU stability result_snippet = whisper.decode(model, mel, options) print("Snippet text (first chunk):") print(result_snippet.text)

---------------------

6) Full file transcription (transcribe)

- model.transcribe() internally splits and processes the file

---------------------

print("Transcribing full file ... (this may take time on CPU)") t0 = time.time() result_full = model.transcribe(audio_path) # or model.transcribe(audio, **kwargs) elapsed = time.time() - t0 print(f"Transcription finished in {elapsed/60:.2f} minutes ({elapsed:.1f} seconds).")

Access result text:

full_text = result_full.get("text", result_full) if isinstance(result_full, dict) else str(result_full) print("---- Transcript preview (first 500 chars) ----") print(full_text[:500]) print("---- End preview ----")

---------------------

7) Save results

---------------------

out_txt = "transcript.txt" with open(out_txt, "w", encoding="utf-8") as f: f.write(full_text) print("Saved transcript to:", out_txt)

---------------------

8) Print additional tips

---------------------

print("\nTips:") print(" - If you have a CUDA GPU, set device='cuda' and install a CUDA-enabled torch for big speedups.") print(" - Use smaller models (tiny/base) for quick CPU transcriptions; use medium/large for best accuracy.") print(" - If you want timestamps or word-level alignment, consider post-processing or other libraries.")

My Experience Converting Podcasts to Text with OpenAI Whisper

Quick Terminology

Installation (Commands executable in Jupyter / Terminal)

1) Python Package Installation (Recommended)

2) ffmpeg Installation (by Platform)

🐍 Complete Python Code for Execution

---------------------

3) Load audio (file must be in local directory)

- load_audio: adjusts sample rate and returns numpy array

- pad_or_trim: pads/trims to 30-second slots (as described in original)

---------------------

---------------------

4) Language detection (detect_language)

---------------------

detect_language returns (language, probs) in some versions; handle generically:

---------------------

5) Decoding (check first 30 seconds) - DecodingOptions + decode

---------------------

---------------------

6) Full file transcription (transcribe)

- model.transcribe() internally splits and processes the file

---------------------

Access result text:

---------------------

7) Save results

---------------------

---------------------

8) Print additional tips

---------------------

Mixing Nature and Machine on NightCafe — An Art Experiment Log

Google Nano Banana 2 Launch — Half the Price of Pro with 4K, But Really?

I Added "Bas-Relief Sketch" to My Prompt and My AI Art Gained Real Depth

GPT Image 2 Prompts: The More You Write, the Worse They Get

Stable Diffusion Oil Painting Artifact Investigation — Fully Fixed on RTX 3080 12GB

The One Thing People Who Draw with AI Can Never Do