Blog article: Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now” #3074

hardrockhodl · 2025-04-24T23:28:28Z

hardrockhodl
Apr 24, 2025

I just posted an article on medium, but it's behind a paywall.
So I thought I share it for you guys here.
Hope you like it.

AI Transcription Showdown

Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now”

Which Model Captures the Energy of Freddie Mercury’s Classic Hit?

Transcribing song lyrics from audio is no easy feat, especially for a high-octane track like Queen’s Don’t Stop Me Now, with its rapid vocals, vibrant vocalizations (e.g., “ooh, ooh, ooh”), and iconic “La-da-da-da-dah” ending. In this article, we put six Whisper models : base, medium, medium-q8, large-v3, large-v3-turbo, and large-v3-turbo-q8_0 to the test to find out which delivers the most accurate transcription.

Using the Whisper.cpp framework on an Apple M1 Pro, we evaluate each model based on accuracy, completeness, fidelity, and processing time, offering insights for anyone tackling music transcription with AI.

Methodology

We processed an MP4 file of Don’t Stop Me Now (duration: 3:31) using each Whisper model, generating VTT files with transcribed lyrics. These transcriptions were compared against the official lyrics, focusing on three key criteria:

Accuracy: Correctness of wording, phrasing, and punctuation.
Completeness: Inclusion of all song parts, including verses, choruses, vocalizations, and the ending.
Fidelity: Preservation of stylistic elements, such as repetitions and vocalizations, to maintain the song’s energy and structure.

Processing time was measured in minutes and seconds (converted from milliseconds) to assess efficiency. Model details, including size and architecture, were extracted from command-line outputs to provide context for performance differences.

 .....:: Hardware and Software ::......
+--------+-----------------------------+
| Model  | MacBook Pro (16-inch, 2021) |
| macOS  | Sequoia 15.4.1              |
| Chip   | Apple M1 Pro                |
| Cores  | 10 (8 perf. and 2 eff.)     |
| Memory | 16 GB                       |
+--------+-----------------------------+

Results

Each model’s transcription was scrutinized for errors, omissions, and stylistic accuracy. Below, we summarize their performance and present a detailed ranking table for easy comparison.

Model Performance Overview

Model	Model Size (MB)	Audio Layers	Text Layers	Processing Time	Key Strengths	Key Weaknesses
base	147.37	6	6	0:06	Fastest processing	Major errors (e.g., "Lady Gaga," "Mr. Farron Pied"), incorrect ending
medium	1533.14	24	24	0:21	Good structure	Errors (e.g., "driving inside out"), simplified ending
medium-q8	822.75	24	24	0:17	Balanced accuracy and efficiency	Minor errors (e.g., "six machine"), simplified ending
large-v3	3094.36	32	32	0:52	Highest accuracy	Extraneous repetitions, missing ending
large-v3-turbo	1623.92	32	4	0:17	Improved over q8_0	Incorrect ending ("Thank you"), missing vocalizations
large-v3-turbo-q8_0	873.55	32	4	0:14	Fast for large model	Missing vocalizations, incorrect ending

Detailed Evaluation

1. large-v3:

Accuracy: Excels with minor errors, such as “whoa” instead of “oh” in “explode” and “Mr. Fire Up High” once instead of “Mister Fahrenheit.”
Completeness: Captures most of the song but omits the “La-da-da-da-dah” ending. Includes extraneous repetitions of “burning through the sky.”
Fidelity: Strong, preserving the song’s dynamic energy despite repetitions.
Processing Time: Slowest at 52 seconds, reflecting its large size (3094.36 MB) and robust architecture (32 audio/text layers).
Verdict: The gold standard for accuracy and fidelity, perfect for high-quality transcription needs.

2. medium-q8:

Accuracy: Very accurate, with minor errors like “just while they call me Mr. Fahrenheit” and “six machine” instead of “sex machine.”
Completeness: Nearly complete, featuring a simplified “Ah, da, da, da” ending and minor vocalization omissions.
Fidelity: Maintains the song’s structure and vibe with slight simplifications.
Processing Time: Efficient at 17 seconds, leveraging its quantized model (822.75 MB).
Verdict: A practical choice for balancing quality and speed.

3. medium:

Accuracy: Fairly accurate but hampered by errors like “driving inside out” instead of “I’ll turn it inside out” and “six machine.”
Completeness: Includes most of the song with an abbreviated “La, la, la” ending; simplifies vocalizations.
Fidelity: Decent but misses some stylistic flourishes.
Processing Time: 21 seconds, slower than medium-q8 despite a larger size (1533.14 MB).
Verdict: Solid but outperformed by medium-q8 due to more errors.

4. large-v3-turbo (un-quantized):

Accuracy: Generally accurate, with errors like “leaving” instead of “leaping” and “A battle. Oh, oh, oh, oh, oh, explode” instead of “about to oh, oh, oh, oh, oh, explode.”
Completeness: Misses vocalizations (e.g., “hey, hey, hey,” “I like it”) and the “La-da-da-da-dah” ending, replaced with “Thank you.”
Fidelity: Captures some energy but feels rushed; the incorrect ending disrupts the song’s closure.
Processing Time: 17 seconds, competitive with medium-q8.
Verdict: Slightly better than its quantized counterpart but limited by key omissions.

5. large-v3-turbo-q8_0:

Accuracy: Similar to un-quantized, with errors like “leaving” and “but I’ll explode.”
Completeness: Lacks vocalizations and the ending, with a more condensed structure.
Fidelity: Rushed and less precise than the un-quantized version.
Processing Time: Fastest among larger models at 14 seconds.
Verdict: Prioritizes speed but sacrifices detail.

6. base:

Accuracy: Poor, with major errors like “Lady Gaga” for “Lady Godiva” and “Mr. Farron Pied” for “Mister Fahrenheit.”
Completeness: Incorrectly ends with “(singing in foreign language)”; misordered sections.
Fidelity: Fails to capture the song’s structure or style.
Processing Time: Fastest at 6 seconds, due to its small size (147.37 MB).
Verdict: Unsuitable for music transcription.

Ranking Table

Rank	Model	Accuracy	Completeness	Fidelity	Processing Time
1	large-v3	Highly accurate; minor errors ("whoa" vs. "oh," "Mr. Fire Up High" once).	Most of song; missing "La-da-da-da-dah"; extraneous repetitions.	Captures energy but disrupted by repetitions.	0:52
2	medium-q8	Very accurate; minor errors ("just while...," "six machine").	Nearly complete; simplified "Ah, da, da, da" ending; minor omissions.	Maintains structure with slight simplifications.	0:17
3	medium	Fairly accurate; errors ("driving inside out," "six machine").	Most of song; abbreviated "La, la, la" ending; simplified vocalizations.	Decent structure but loses some flourishes.	0:21
4	large-v3-turbo	Generally accurate; errors ("leaving" vs. "leaping," "A battle...").	Missing vocalizations and "La-da-da-da-dah" ("Thank you"); gap in bridge.	Captures some energy but rushed; incorrect ending.	0:17
5	large-v3-turbo-q8_0	Generally accurate; errors ("leaving," "but I'll explode").	Missing vocalizations and "La-da-da-da-dah"; condensed structure.	Captures some energy but rushed and less precise.	0:14
6	base	Major errors ("Lady Gaga," "Mr. Farron Pied").	Incomplete; incorrect "(singing in foreign language)" ending; misordered.	Fails to capture structure or style.	0:06

Insights and Recommendations

Top Pick for Quality: large-v3
The large-v3 model is the clear winner for accuracy and fidelity, making it ideal for professional audio analysis or archival work. Its 52-second processing time is a trade-off for its superior results.
Best All-Rounder: medium-q8
The medium-q8 model strikes an excellent balance, delivering near-top-tier accuracy in just 17 seconds. It’s perfect for most users, especially on systems with limited resources.
Speed Considerations:
The large-v3-turbo-q8_0 (14 seconds) and base (6 seconds) models are the fastest but compromise on quality. They’re suitable for quick drafts or less complex audio.
Turbo Limitations:
Both large-v3-turbo models (un-quantized and q8_0) prioritize speed with fewer text layers (4 vs. 32), but this leads to missing vocalizations and incorrect endings, making them less ideal for music transcription.

Challenges with the Ending

The “La-da-da-da-dah” ending is a hallmark of *Don’t Stop Me Now*, yet most models struggled:

medium-q8 and medium approximated it as "Ah, da, da, da" and "La, la, la," respectively, which is closer than others.
large-v3-turbo and large-v3-turbo-q8_0 replaced it with "Thank you" or omitted it, possibly misinterpreting audio artifacts.
large-v3 skipped it entirely, focusing on earlier sections.
base labeled it "(singing in foreign language)," a significant error.

This highlights a broader challenge in AI transcription: handling non-lyrical vocalizations and song endings, where models may misinterpret or oversimplify.

Conclusion

Transcribing Don’t Stop Me Now showcased the strengths and limitations of Whisper models. The large-v3 model leads for precision, while medium-q8 offers a practical compromise for speed and quality. The large-v3-turbo models, though fast, stumble on lyrical complexity, and the base model is best avoided for music. Future improvements in handling vocalizations and endings could make these models even more powerful for music transcription.

Whether you’re a researcher, musician, or AI enthusiast.
Choose your model based on your needs:

large-v3 for top quality
medium-q8 for versatility
large-v3-turbo-q8_0 for speed

Try them out and let us know your results!

Links

cafeTechne · 2025-04-25T01:32:02Z

cafeTechne
Apr 25, 2025

Thanks for this. This is very helpful.

0 replies

hardrockhodl · 2025-04-25T04:31:35Z

hardrockhodl
Apr 25, 2025
Author

To collect the necessary data and to make it easier for future use I made a python-script.

#!/usr/bin/env python3

"""
Script Summary:
This Python script automates the transcription of MP4 video files using the Whisper.cpp tool. 
It performs the following steps:
1. Model Selection: 
   Displays a list of available Whisper models (e.g., base, medium, large-v3) and prompts the user to select one by entering a number (1–6).

2. File Selection: 
   Lists the five most recent MP4 files in a specified Downloads directory, allowing the user to choose one for transcription.

3. File Conversion: 
   Converts the selected MP4 file to WAV format using ffmpeg.
   
4. Transcription: 
   Transcribes the WAV file using the chosen Whisper model, leveraging the Whisper.cpp command-line tool with specified parameters (e.g., 8 threads, auto language detection, VTT output).

The script assumes the presence of ffmpeg, Whisper.cpp, and pre-trained model files in specified paths.
"""

import os
import subprocess
from pathlib import Path

# Define paths for Downloads directory and Whisper CLI executable
DOWNLOADS = Path("/path/to/folder")
WHISPER_CLI = Path("/path/to/whisper-CLI")

# Dictionary mapping model choices to their names and file paths
MODELS = {
    "1": ("base", Path("/path/to/whisper.cpp/models/ggml-base.bin")),
    "2": ("medium", Path("/path/to/whisper.cpp/models/ggml-medium.bin")),
    "3": ("medium-q8", Path("/path/to/whisper.cpp/models/ggml-medium-q8_0.bin")),
    "4": ("large-v3", Path("/path/to/whisper.cpp/models/ggml-large-v3.bin")),
    "5": ("large-v3-turbo", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo.bin")),
    "6": ("large-v3-turbo-q8", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo-q8_0.bin")),
}

# Display available models for user selection
print("Choose model:")
for key, (name, _) in MODELS.items():
    print(f"{key}: {name}")
model_choice = input("Model [1 - 6]: ")
MODEL = MODELS[model_choice][1]
print()

# Retrieve and display the 5 most recent MP4 files
mp4_files = sorted(DOWNLOADS.glob("*.mp4"), key=os.path.getmtime, reverse=True)[:5]
print("Recent MP4 files:")
for idx, file in enumerate(mp4_files, 1):
    print(f"{idx}: {file.name}")

# Prompt user to select an MP4 file for transcription
choice = int(input("Choose what file to transcribe [1 - 5]: ")) - 1
selected_mp4 = mp4_files[choice]
wav_file = selected_mp4.with_suffix(".wav")

# Convert the selected MP4 file to WAV format using ffmpeg
subprocess.run(["ffmpeg", "-y", "-i", str(selected_mp4), str(wav_file)], check=True)

# Transcribe the WAV file using Whisper.cpp with specified parameters
subprocess.run([
    str(WHISPER_CLI),
    "-t", "8",  # Use 8 threads
    "-p", "1",  # Single processor
    "-m", str(MODEL),  # Selected model
    "-f", str(wav_file),  # Input WAV file
    "-l", "auto",  # Auto-detect language
    "-ovtt"  # Output in VTT format
], check=True)

Here is sample output from the terminal.
(I made an alias "transcribe" for the script so I can easier remember it)

<user>@<computer> [%]
[~] > transcribe 
Choose model:
1: base
2: medium
3: medium-q8
4: large-v3
5: large-v3-turbo
6: large-v3-turbo-q8
Model [1 - 6]: 4

Recent MP4 files:
1: queen.mp4
2: video-test.mp4
3: test.mp4
4: nihilist.mp4
5: dubbel-ciggen.mp4
Choose what file to transcribe [1 - 5]: 1

0 replies

cafeTechne · 2025-04-25T18:08:41Z

cafeTechne
Apr 25, 2025

You need to redact your PII from it (the command path), otherwise: thank you!

…

On Fri, Apr 25, 2025 at 4:31 AM hardrockhodl ***@***.***> wrote: To collect the necessary data and to make it easier for future use I made a python-script. #!/usr/bin/env python3 """Script Summary:This Python script automates the transcription of MP4 video files using the Whisper.cpp tool. It performs the following steps:1. Model Selection: Displays a list of available Whisper models (e.g., base, medium, large-v3) and prompts the user to select one by entering a number (1–6).2. File Selection: Lists the five most recent MP4 files in a specified Downloads directory, allowing the user to choose one for transcription.3. File Conversion: Converts the selected MP4 file to WAV format using ffmpeg. 4. Transcription: Transcribes the WAV file using the chosen Whisper model, leveraging the Whisper.cpp command-line tool with specified parameters (e.g., 8 threads, auto language detection, VTT output).The script assumes the presence of ffmpeg, Whisper.cpp, and pre-trained model files in specified paths.""" import osimport subprocessfrom pathlib import Path # Define paths for Downloads directory and Whisper CLI executableDOWNLOADS = Path("/path/to/Downloads")WHISPER_CLI = Path("/path/to/whisper.cpp/build/bin/whisper-cli") # Dictionary mapping model choices to their names and file pathsMODELS = { "1": ("base", Path("/path/to/whisper.cpp/models/ggml-base.bin")), "2": ("medium", Path("/path/to/whisper.cpp/models/ggml-medium.bin")), "3": ("medium-q8", Path("/path/to/whisper.cpp/models/ggml-medium-q8_0.bin")), "4": ("large-v3", Path("/path/to/whisper.cpp/models/ggml-large-v3.bin")), "5": ("large-v3-turbo", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo.bin")), "6": ("large-v3-turbo-q8", Path("/path/to/whisper.cpp/models/ggml-large-v3-turbo-q8_0.bin")), } # Display available models for user selectionprint("Choose model:")for key, (name, _) in MODELS.items(): print(f"{key}: {name}")model_choice = input("Model [1 - 6]: ")MODEL = MODELS[model_choice][1]print() # Retrieve and display the 5 most recent MP4 filesmp4_files = sorted(DOWNLOADS.glob("*.mp4"), key=os.path.getmtime, reverse=True)[:5]print("Recent MP4 files:")for idx, file in enumerate(mp4_files, 1): print(f"{idx}: {file.name}") # Prompt user to select an MP4 file for transcriptionchoice = int(input("Choose what file to transcribe [1 - 5]: ")) - 1selected_mp4 = mp4_files[choice]wav_file = selected_mp4.with_suffix(".wav") # Convert the selected MP4 file to WAV format using ffmpegsubprocess.run(["ffmpeg", "-y", "-i", str(selected_mp4), str(wav_file)], check=True) # Transcribe the WAV file using Whisper.cpp with specified parameterssubprocess.run([ str(WHISPER_CLI), "-t", "8", # Use 8 threads "-p", "1", # Single processor "-m", str(MODEL), # Selected model "-f", str(wav_file), # Input WAV file "-l", "auto", # Auto-detect language "-ovtt" # Output in VTT format ], check=True) Here is sample output from the terminal. (I made an alias "transcribe" for the script so I can easier remember it) 𐕣 hardrockhodl 𐕣 mbp-private 𐕣 [%] [~] > transcribe Choose model: 1: base 2: medium 3: medium-q8 4: large-v3 5: large-v3-turbo 6: large-v3-turbo-q8 Model [1 - 6]: 4 Recent MP4 files: 1: queen.mp4 2: video-test.mp4 3: test.mp4 4: nihilist.mp4 5: dubbel-ciggen.mp4 Choose what file to transcribe [1 - 5]: 1 — Reply to this email directly, view it on GitHub <#3074 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFACJ5Q7A4UTCF24T3YQXWD23G3DZAVCNFSM6AAAAAB32I2YXKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOJUGIZTGMI> . You are receiving this because you commented.Message ID: <ggml-org/whisper .***@***.***>

1 reply

hardrockhodl Apr 25, 2025
Author

It was already anonymized, but thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blog article: Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now” #3074

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Blog article: Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now” #3074

Uh oh!

Uh oh!

hardrockhodl Apr 24, 2025

AI Transcription Showdown

Comparing Whisper Models for Transcribing Queen’s “Don’t Stop Me Now”

Which Model Captures the Energy of Freddie Mercury’s Classic Hit?

Methodology

Results

Model Performance Overview

Detailed Evaluation

1. large-v3:

2. medium-q8:

3. medium:

4. large-v3-turbo (un-quantized):

5. large-v3-turbo-q8_0:

6. base:

Ranking Table

Insights and Recommendations

Challenges with the Ending

Conclusion

Links

Replies: 3 comments · 1 reply

Uh oh!

cafeTechne Apr 25, 2025

Uh oh!

Uh oh!

hardrockhodl Apr 25, 2025 Author

Uh oh!

cafeTechne Apr 25, 2025

Uh oh!

hardrockhodl Apr 25, 2025 Author

hardrockhodl
Apr 24, 2025

Replies: 3 comments 1 reply

cafeTechne
Apr 25, 2025

hardrockhodl
Apr 25, 2025
Author

cafeTechne
Apr 25, 2025

hardrockhodl Apr 25, 2025
Author