You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Whisper Transcription + Summarization + Diarization API
1
+
# Whisper Transcription API
2
2
3
-
This project provides a high-performance pipeline for **audio/video transcription**, **speaker diarization**, and **summarization** using [Faster-Whisper](https://github.com/guillaumekln/faster-whisper), Hugging Face LLMs (e.g. Mistral), and [pyannote.audio](https://github.com/pyannote/pyannote-audio). It exposes a **FastAPI-based REST API** and supports CLI usage as well.
This blueprint provides a complete solution for running **audio/video transcription**, **speaker diarization**, and **summarization** via a RESTful API. It integrates [Faster-Whisper](https://github.com/guillaumekln/faster-whisper) for efficient transcription, [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization, and Hugging Face instruction-tuned LLMs (e.g., Mistral-7B) for summarization. It supports multi-GPU acceleration, real-time streaming logs, and JSON/text output formats.
7
6
8
-
The overall architecture consists of several key stages. First, audio is converted using ffmpeg and optionally denoised using a hybrid method combining Demucs (for structured background removal) and either noisereduce or DeepFilterNet (for static noise). Next, silence-aware chunking is applied using pydub to segment speech cleanly without breaking mid-sentence. The Whisper model then transcribes each chunk, optionally followed by speaker diarization using pyannote-audio. Finally, if summarization is enabled, an instruction-tuned LLM such as Mistral-7B generates concise and structured summaries. Outputs are written to .txt ,log and .json files, optionally embedded with speaker turns and summaries.
|`audio_url`|`string`| URL to a Pre-Authenticated Request (PAR) of the audio file stored in OCI Object Storage. |
61
-
|`model`|`string`| Whisper model name to use (`base`, `medium`, `turbo`, etc.). |
62
-
|`summary`|`bool`| (Optional)Whether to generate a summary at the end. If `true` and no custom model path is provided, `mistralai/Mistral-7B-Instruct-v0.1` will be loaded from Hugging Face. Requires `hf_token`. |
63
-
|`speaker`|`bool`| (Optional)Whether to enable speaker diarization. Requires `hf_token`. If `false`, all segments will be labeled as "Speaker 1". |
64
-
|`max_speakers`|`int`| (Optional) Helps improve diarization accuracy by specifying the expected number of speakers. |
65
-
|`denoise`|`bool`| (Optional) Apply basic denoising to improve quality in noisy recordings. |
66
-
|`streaming`|`bool`| (Optional) Enable real-time log streaming for transcription chunks and progress updates. |
67
-
|`hf_token`|`string`| (Optional)Hugging Face token, required for loading models like Mistral or enabling speaker diarization. |
0 commit comments