default alt text

I wanted to automatically convert podcast audio to text, so I tried OpenAI Whisper. Here's a summary of the process from installation to actual conversion.

Quick Terminology

  • Whisper: An open-source speech recognition model released by OpenAI (various sizes: tiny, base, small, medium, large).
  • ffmpeg: An essential command-line tool for audio/video conversion and processing. Whisper uses it to read various audio formats.
  • GPU(CUDA): If you have an Nvidia GPU, Whisper runs much faster.
  • log-Mel spectrogram: A spectrum representation used to transform audio before feeding it into a neural network.
  • DecodingOptions / transcribe: APIs in Whisper for adjusting decoding or transcribing entire files.


## Installation (Commands executable in Jupyter / Terminal)

1) Python Package Installation (Recommended)

Run in a Jupyter cell or terminal:

# (Option 1) Install from official repository via pip
pip install -U openai-whisper

# (Option 2) Install directly from GitHub (latest source)
pip install -U git+https://github.com/openai/whisper.git

# Install PyTorch according to your system/CUDA. Example: CPU-only or CUDA version
# CPU version example:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# (For CUDA, follow the official PyTorch installation guide)

Note: Whisper runs on top of PyTorch. If you need GPU (CUDA), you must install PyTorch with CUDA build.



2) ffmpeg Installation (by Platform)

  • Windows (using Scoop; the method used in the original article):
# PowerShell (administrator privileges recommended)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iwr -useb get.scoop.sh | iex
scoop install ffmpeg
  • macOS (Homebrew):
brew install ffmpeg
  • Ubuntu / Debian:
sudo apt update
sudo apt install -y ffmpeg

Restart Jupyter/terminal after installation.



## 🐍 Complete Python Code for Execution

Below is a complete example script that can be run directly in local/cloud environments. (Save as whisper_transcribe.py or paste directly into a Jupyter cell.)

# whisper_transcribe.py
# Required libraries: torch, whisper, time
# Installation (terminal): pip install -U openai-whisper torch

import time
import torch
import whisper

# ---------------------
# 1) GPU check and device setup
# ---------------------
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
device = "cuda" if torch.cuda.is_available() else "cpu"

# ---------------------
# 2) Load model
#    Model names: tiny, base, small, medium, large
#    (The original article used "base" as an example)
# ---------------------
model_name = "base"   # Change to "tiny" or "large" etc. as needed
print(f"Loading Whisper model '{model_name}' on {device} ...")
start_load = time.time()
model = whisper.load_model(model_name, device=device)
load_time = time.time() - start_load
print(f"Model loaded in {load_time:.1f}s")

# (Optional) Check model parameter count
def count_params(m):
    return sum(p.numel() for p in m.parameters())
try:
    total_params = count_params(model)
    print(f"Approx. model parameters: {total_params:,}")
 Exception  e:
    (, e)






audio_path =   
(, audio_path)
audio = whisper.load_audio(audio_path)       
audio = whisper.pad_or_trim(audio)           
mel = whisper.log_mel_spectrogram(audio).to(device)




()
lang_probs = model.detect_language(mel)

 (lang_probs, )  (lang_probs) >= :
    _, probs = lang_probs
    detected = (probs, key=probs.get)
:
    
    probs = lang_probs
    detected = (probs, key=probs.get)  (probs, )  
(, detected)




()
options = whisper.DecodingOptions(fp16=)  
result_snippet = whisper.decode(model, mel, options)
()
(result_snippet.text)





()
t0 = time.time()
result_full = model.transcribe(audio_path)  
elapsed = time.time() - t0
()

full_text = result_full.get(, result_full)  (result_full, )  (result_full)
()
(full_text[:])
()




out_txt = 
 (out_txt, , encoding=)  f:
    f.write(full_text)
(, out_txt)




()
()
()
()

The script above exactly follows the flow described in the original article (load audio → pad_or_trim → mel → detect_language → DecodingOptions + decode → model.transcribe → output/save).


TEST: Tested with Kehlani's "How It's Done"

default alt text

default alt text

  • Conversion is fast for short audio files.
  • Korean quality is significantly lower, so it's mainly usable for English.
  • It's not 100% accurate translation; it feels like about 60-70% accuracy.

Conclusion: It seems quite usable for getting a general understanding of English videos.


## Actual Performance (Original Article Test)
  • 1 hour 10 minute RealPython podcast transcription test (original experiment)

    • CPU (default CPU): About 56 minutes
    • GPU (cloud, CUDA): About 4 minutes
  • Conclusion: Having a GPU makes it much faster. (Varies significantly by model size/hardware) default alt text



## Practical Tips & Notes
  • Without ffmpeg, you won't be able to read various formats (.m4a, .ogg, etc.), so make sure to install it.
  • For large audio files (over 1 hour), there are memory/time constraints, so test with base or tiny first, then check accuracy with medium/large.
  • Using fp16=True with GPU makes it faster (but may cause stability issues in some environments).
  • There are many Whisper versions/variants (community forks), so API calling methods may differ slightly depending on the version.