Blog article: Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now” #3074
hardrockhodl
started this conversation in
Show and tell
Replies: 3 comments 1 reply
-
Thanks for this. This is very helpful. |
Beta Was this translation helpful? Give feedback.
0 replies
-
To collect the necessary data and to make it easier for future use I made a python-script. #!/usr/bin/env python3
"""
Script Summary:
This Python script automates the transcription of MP4 video files using the Whisper.cpp tool.
It performs the following steps:
1. Model Selection:
Displays a list of available Whisper models (e.g., base, medium, large-v3) and prompts the user to select one by entering a number (1–6).
2. File Selection:
Lists the five most recent MP4 files in a specified Downloads directory, allowing the user to choose one for transcription.
3. File Conversion:
Converts the selected MP4 file to WAV format using ffmpeg.
4. Transcription:
Transcribes the WAV file using the chosen Whisper model, leveraging the Whisper.cpp command-line tool with specified parameters (e.g., 8 threads, auto language detection, VTT output).
The script assumes the presence of ffmpeg, Whisper.cpp, and pre-trained model files in specified paths.
"""
import os
import subprocess
from pathlib import Path
# Define paths for Downloads directory and Whisper CLI executable
DOWNLOADS = Path("/path/to/folder")
WHISPER_CLI = Path("/path/to/whisper-CLI")
# Dictionary mapping model choices to their names and file paths
MODELS = {
"1": ("base", Path("/path/to/whisper.cpp/models/ggml-base.bin")),
"2": ("medium", Path("/path/to/whisper.cpp/models/ggml-medium.bin")),
"3": ("medium-q8", Path("/path/to/whisper.cpp/models/ggml-medium-q8_0.bin")),
"4": ("large-v3", Path("/path/to/whisper.cpp/models/ggml-large-v3.bin")),
"5": ("large-v3-turbo", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo.bin")),
"6": ("large-v3-turbo-q8", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo-q8_0.bin")),
}
# Display available models for user selection
print("Choose model:")
for key, (name, _) in MODELS.items():
print(f"{key}: {name}")
model_choice = input("Model [1 - 6]: ")
MODEL = MODELS[model_choice][1]
print()
# Retrieve and display the 5 most recent MP4 files
mp4_files = sorted(DOWNLOADS.glob("*.mp4"), key=os.path.getmtime, reverse=True)[:5]
print("Recent MP4 files:")
for idx, file in enumerate(mp4_files, 1):
print(f"{idx}: {file.name}")
# Prompt user to select an MP4 file for transcription
choice = int(input("Choose what file to transcribe [1 - 5]: ")) - 1
selected_mp4 = mp4_files[choice]
wav_file = selected_mp4.with_suffix(".wav")
# Convert the selected MP4 file to WAV format using ffmpeg
subprocess.run(["ffmpeg", "-y", "-i", str(selected_mp4), str(wav_file)], check=True)
# Transcribe the WAV file using Whisper.cpp with specified parameters
subprocess.run([
str(WHISPER_CLI),
"-t", "8", # Use 8 threads
"-p", "1", # Single processor
"-m", str(MODEL), # Selected model
"-f", str(wav_file), # Input WAV file
"-l", "auto", # Auto-detect language
"-ovtt" # Output in VTT format
], check=True) Here is sample output from the terminal. <user>@<computer> [%]
[~] > transcribe
Choose model:
1: base
2: medium
3: medium-q8
4: large-v3
5: large-v3-turbo
6: large-v3-turbo-q8
Model [1 - 6]: 4
Recent MP4 files:
1: queen.mp4
2: video-test.mp4
3: test.mp4
4: nihilist.mp4
5: dubbel-ciggen.mp4
Choose what file to transcribe [1 - 5]: 1 |
Beta Was this translation helpful? Give feedback.
0 replies
-
You need to redact your PII from it (the command path), otherwise: thank
you!
…On Fri, Apr 25, 2025 at 4:31 AM hardrockhodl ***@***.***> wrote:
To collect the necessary data and to make it easier for future use I made
a python-script.
#!/usr/bin/env python3
"""Script Summary:This Python script automates the transcription of MP4 video files using the Whisper.cpp tool. It performs the following steps:1. Model Selection: Displays a list of available Whisper models (e.g., base, medium, large-v3) and prompts the user to select one by entering a number (1–6).2. File Selection: Lists the five most recent MP4 files in a specified Downloads directory, allowing the user to choose one for transcription.3. File Conversion: Converts the selected MP4 file to WAV format using ffmpeg. 4. Transcription: Transcribes the WAV file using the chosen Whisper model, leveraging the Whisper.cpp command-line tool with specified parameters (e.g., 8 threads, auto language detection, VTT output).The script assumes the presence of ffmpeg, Whisper.cpp, and pre-trained model files in specified paths."""
import osimport subprocessfrom pathlib import Path
# Define paths for Downloads directory and Whisper CLI executableDOWNLOADS = Path("/path/to/Downloads")WHISPER_CLI = Path("/path/to/whisper.cpp/build/bin/whisper-cli")
# Dictionary mapping model choices to their names and file pathsMODELS = {
"1": ("base", Path("/path/to/whisper.cpp/models/ggml-base.bin")),
"2": ("medium", Path("/path/to/whisper.cpp/models/ggml-medium.bin")),
"3": ("medium-q8", Path("/path/to/whisper.cpp/models/ggml-medium-q8_0.bin")),
"4": ("large-v3", Path("/path/to/whisper.cpp/models/ggml-large-v3.bin")),
"5": ("large-v3-turbo", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo.bin")),
"6": ("large-v3-turbo-q8", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo-q8_0.bin")),
}
# Display available models for user selectionprint("Choose model:")for key, (name, _) in MODELS.items():
print(f"{key}: {name}")model_choice = input("Model [1 - 6]: ")MODEL = MODELS[model_choice][1]print()
# Retrieve and display the 5 most recent MP4 filesmp4_files = sorted(DOWNLOADS.glob("*.mp4"), key=os.path.getmtime, reverse=True)[:5]print("Recent MP4 files:")for idx, file in enumerate(mp4_files, 1):
print(f"{idx}: {file.name}")
# Prompt user to select an MP4 file for transcriptionchoice = int(input("Choose what file to transcribe [1 - 5]: ")) - 1selected_mp4 = mp4_files[choice]wav_file = selected_mp4.with_suffix(".wav")
# Convert the selected MP4 file to WAV format using ffmpegsubprocess.run(["ffmpeg", "-y", "-i", str(selected_mp4), str(wav_file)], check=True)
# Transcribe the WAV file using Whisper.cpp with specified parameterssubprocess.run([
str(WHISPER_CLI),
"-t", "8", # Use 8 threads
"-p", "1", # Single processor
"-m", str(MODEL), # Selected model
"-f", str(wav_file), # Input WAV file
"-l", "auto", # Auto-detect language
"-ovtt" # Output in VTT format
], check=True)
Here is sample output from the terminal.
(I made an alias "transcribe" for the script so I can easier remember it)
𐕣 hardrockhodl 𐕣 mbp-private 𐕣 [%]
[~] > transcribe
Choose model:
1: base
2: medium
3: medium-q8
4: large-v3
5: large-v3-turbo
6: large-v3-turbo-q8
Model [1 - 6]: 4
Recent MP4 files:
1: queen.mp4
2: video-test.mp4
3: test.mp4
4: nihilist.mp4
5: dubbel-ciggen.mp4
Choose what file to transcribe [1 - 5]: 1
—
Reply to this email directly, view it on GitHub
<#3074 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFACJ5Q7A4UTCF24T3YQXWD23G3DZAVCNFSM6AAAAAB32I2YXKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOJUGIZTGMI>
.
You are receiving this because you commented.Message ID: <ggml-org/whisper
.***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I just posted an article on medium, but it's behind a paywall.
So I thought I share it for you guys here.
Hope you like it.
AI Transcription Showdown
Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now”
Which Model Captures the Energy of Freddie Mercury’s Classic Hit?
Transcribing song lyrics from audio is no easy feat, especially for a high-octane track like Queen’s Don’t Stop Me Now, with its rapid vocals, vibrant vocalizations (e.g., “ooh, ooh, ooh”), and iconic “La-da-da-da-dah” ending. In this article, we put six Whisper models :
base
,medium
,medium-q8
,large-v3
,large-v3-turbo
, andlarge-v3-turbo-q8_0
to the test to find out which delivers the most accurate transcription.Using the Whisper.cpp framework on an Apple M1 Pro, we evaluate each model based on accuracy, completeness, fidelity, and processing time, offering insights for anyone tackling music transcription with AI.
Methodology
We processed an MP4 file of Don’t Stop Me Now (duration: 3:31) using each Whisper model, generating VTT files with transcribed lyrics. These transcriptions were compared against the official lyrics, focusing on three key criteria:
Processing time was measured in minutes and seconds (converted from milliseconds) to assess efficiency. Model details, including size and architecture, were extracted from command-line outputs to provide context for performance differences.
Results
Each model’s transcription was scrutinized for errors, omissions, and stylistic accuracy. Below, we summarize their performance and present a detailed ranking table for easy comparison.
Model Performance Overview
Detailed Evaluation
1. large-v3:
2. medium-q8:
3. medium:
medium-q8
despite a larger size (1533.14 MB).medium-q8
due to more errors.4. large-v3-turbo (un-quantized):
medium-q8
.5. large-v3-turbo-q8_0:
6. base:
Ranking Table
Insights and Recommendations
Top Pick for Quality:
large-v3
The
large-v3
model is the clear winner for accuracy and fidelity, making it ideal for professional audio analysis or archival work. Its 52-second processing time is a trade-off for its superior results.Best All-Rounder:
medium-q8
The
medium-q8
model strikes an excellent balance, delivering near-top-tier accuracy in just 17 seconds. It’s perfect for most users, especially on systems with limited resources.Speed Considerations:
The
large-v3-turbo-q8_0
(14 seconds) andbase
(6 seconds) models are the fastest but compromise on quality. They’re suitable for quick drafts or less complex audio.Turbo Limitations:
Both
large-v3-turbo
models (un-quantized and q8_0) prioritize speed with fewer text layers (4 vs. 32), but this leads to missing vocalizations and incorrect endings, making them less ideal for music transcription.Challenges with the Ending
The “La-da-da-da-dah” ending is a hallmark of *Don’t Stop Me Now*, yet most models struggled:
medium-q8
andmedium
approximated it as "Ah, da, da, da" and "La, la, la," respectively, which is closer than others.large-v3-turbo
andlarge-v3-turbo-q8_0
replaced it with "Thank you" or omitted it, possibly misinterpreting audio artifacts.large-v3
skipped it entirely, focusing on earlier sections.base
labeled it "(singing in foreign language)," a significant error.This highlights a broader challenge in AI transcription: handling non-lyrical vocalizations and song endings, where models may misinterpret or oversimplify.
Conclusion
Transcribing Don’t Stop Me Now showcased the strengths and limitations of Whisper models. The
large-v3
model leads for precision, whilemedium-q8
offers a practical compromise for speed and quality. Thelarge-v3-turbo
models, though fast, stumble on lyrical complexity, and thebase
model is best avoided for music. Future improvements in handling vocalizations and endings could make these models even more powerful for music transcription.Whether you’re a researcher, musician, or AI enthusiast.
Choose your model based on your needs:
large-v3
for top qualitymedium-q8
for versatilitylarge-v3-turbo-q8_0
for speedTry them out and let us know your results!
Links
Beta Was this translation helpful? Give feedback.
All reactions