How to efficiently stream audio into whisper.cpp for real-time transcription? #3314

d3r3k-d4nk · 2025-07-09T21:30:21Z

d3r3k-d4nk
Jul 9, 2025

I’m trying to use whisper.cpp for real-time transcription. What’s the best way to stream audio chunks while keeping context intact? Any tips for chunk size, overlap, or performance?
Appreciate any guidance!

Answered by officiallyutso

Jul 9, 2025

Yeah Sure,

#include "whisper.h"
#include <vector>
#include <string>
#include <iostream>

// Function to load audio and split into chunks
std::vector<std::vector<float>> split_audio(const std::vector<float>& pcm, int chunk_samples, int overlap_samples = 0) {
    std::vector<std::vector<float>> chunks;
    int step = chunk_samples - overlap_samples;

    for (size_t start = 0; start < pcm.size(); start += step) {
        size_t end = std::min(start + chunk_samples, pcm.size());
        chunks.emplace_back(pcm.begin() + start, pcm.begin() + end);
    }

    return chunks;
}

int main() {
    // Load model
    struct whisper_context* ctx = whisper_init_from_file("models/ggml-base.en.bin");

View full answer

officiallyutso · 2025-07-09T21:32:50Z

officiallyutso
Jul 9, 2025

Yes, you can achieve this with a bit of logic on your side.

Here’s how:

Split the full audio into manageable chunks (e.g., 30s or 1min each).

Use whisper_full_parallel() or whisper_full() on each chunk sequentially.

For every chunk, append the transcribed text to a single buffer.

To maintain context and improve accuracy at chunk boundaries, you can optionally include a few seconds of overlap between chunks.

This way, the output looks like a continuous transcription, even though you processed it in parts.

Also, whisper.cpp itself doesn’t stitch audio—it just processes what you feed it. So the stitching is up to your implementation.

Let me know if you want a code snippet to help you get started!

3 replies

d3r3k-d4nk Jul 9, 2025
Author

Okay, that seems reasonable. A code snippet would work as well.

officiallyutso Jul 9, 2025

Yeah Sure,

#include "whisper.h"
#include <vector>
#include <string>
#include <iostream>

// Function to load audio and split into chunks
std::vector<std::vector<float>> split_audio(const std::vector<float>& pcm, int chunk_samples, int overlap_samples = 0) {
    std::vector<std::vector<float>> chunks;
    int step = chunk_samples - overlap_samples;

    for (size_t start = 0; start < pcm.size(); start += step) {
        size_t end = std::min(start + chunk_samples, pcm.size());
        chunks.emplace_back(pcm.begin() + start, pcm.begin() + end);
    }

    return chunks;
}

int main() {
    // Load model
    struct whisper_context* ctx = whisper_init_from_file("models/ggml-base.en.bin");

    // Load audio (assume it's a 16-bit mono 16kHz PCM WAV)
    std::vector<float> pcmf32;
    whisper_load_wav_file_f32("input.wav", pcmf32);  // Provided by whisper.cpp

    // Split into 30-second chunks (assuming 16kHz sample rate)
    int chunk_samples = 30 * 16000;
    int overlap_samples = 2 * 16000; // optional 2-second overlap
    auto chunks = split_audio(pcmf32, chunk_samples, overlap_samples);

    // Loop over chunks and transcribe
    std::string full_transcript;
    for (size_t i = 0; i < chunks.size(); ++i) {
        whisper_full_params params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
        whisper_full(ctx, params, chunks[i].data(), chunks[i].size());

        // Get text result
        int n_segments = whisper_full_n_segments(ctx);
        for (int j = 0; j < n_segments; ++j) {
            full_transcript += whisper_full_get_segment_text(ctx, j);
        }

        std::cout << "[Chunk " << i + 1 << "/" << chunks.size() << "] Done\n";
    }

    std::cout << "\nFull Transcript:\n" << full_transcript << "\n";

    whisper_free(ctx);
    return 0;
}

Notes:

This uses built-in WAV loader whisper_load_wav_file_f32() from the examples.
overlap_samples helps avoid cutting off words between chunks.

You could improve this further by aligning text with timestamps or handling silence detection.

Answer selected by d3r3k-d4nk

d3r3k-d4nk Jul 9, 2025
Author

Okay, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to efficiently stream audio into whisper.cpp for real-time transcription? #3314

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to efficiently stream audio into whisper.cpp for real-time transcription? #3314

Uh oh!

d3r3k-d4nk Jul 9, 2025

Replies: 1 comment · 3 replies

Uh oh!

officiallyutso Jul 9, 2025

Uh oh!

d3r3k-d4nk Jul 9, 2025 Author

Uh oh!

Uh oh!

officiallyutso Jul 9, 2025

Uh oh!

d3r3k-d4nk Jul 9, 2025 Author

d3r3k-d4nk
Jul 9, 2025

Replies: 1 comment 3 replies

officiallyutso
Jul 9, 2025

d3r3k-d4nk Jul 9, 2025
Author

d3r3k-d4nk Jul 9, 2025
Author