Skip to content

Commit 2bf88e9

Browse files
committed
Update README.md
1 parent 5b1df6b commit 2bf88e9

File tree

1 file changed

+69
-109
lines changed

1 file changed

+69
-109
lines changed

docs/whisper_transcription/README.md

Lines changed: 69 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,72 @@
1-
# Whisper Transcription + Summarization + Diarization API
1+
# Whisper Transcription API
22

3-
This project provides a high-performance pipeline for **audio/video transcription**, **speaker diarization**, and **summarization** using [Faster-Whisper](https://github.com/guillaumekln/faster-whisper), Hugging Face LLMs (e.g. Mistral), and [pyannote.audio](https://github.com/pyannote/pyannote-audio). It exposes a **FastAPI-based REST API** and supports CLI usage as well.
3+
### Transcription + Summarization + Diarization Pipeline (FastAPI-powered)
44

5-
---
6-
## System Architecture
5+
This blueprint provides a complete solution for running **audio/video transcription**, **speaker diarization**, and **summarization** via a RESTful API. It integrates [Faster-Whisper](https://github.com/guillaumekln/faster-whisper) for efficient transcription, [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization, and Hugging Face instruction-tuned LLMs (e.g., Mistral-7B) for summarization. It supports multi-GPU acceleration, real-time streaming logs, and JSON/text output formats.
76

8-
The overall architecture consists of several key stages. First, audio is converted using ffmpeg and optionally denoised using a hybrid method combining Demucs (for structured background removal) and either noisereduce or DeepFilterNet (for static noise). Next, silence-aware chunking is applied using pydub to segment speech cleanly without breaking mid-sentence. The Whisper model then transcribes each chunk, optionally followed by speaker diarization using pyannote-audio. Finally, if summarization is enabled, an instruction-tuned LLM such as Mistral-7B generates concise and structured summaries. Outputs are written to .txt ,log and .json files, optionally embedded with speaker turns and summaries.
9-
![image](https://github.com/user-attachments/assets/6a8b55f0-9de5-46e9-9ef0-80e904f61a7d)
7+
---
108

11-
## Features
9+
## Key Features
1210

13-
- Transcribes audio using **Faster-Whisper** (multi-GPU support)
14-
- Summarizes long transcripts using **Mistral-7B** as a default
15-
- Performs speaker diarization via **PyAnnote**
16-
- Optional denoising using **Demucs + Noisereduce**
17-
- Supports real-time **streaming API responses**
18-
- Works on common formats: `.flac`, `.wav`, `.mp3`, `.m4a`, `.aac`, `.ogg`, `.webm`, `.opus` `.mp4`, `.mp3`, `.mov`, `.mkv`, `.avi`, etc.
11+
| Capability | Description |
12+
|------------------------|-----------------------------------------------------------------------------------------------|
13+
| Transcription | Fast, multi-GPU inference with Faster-Whisper |
14+
| Summarization | Uses Mistral-7B (or other HF models) to create summaries of long transcripts |
15+
| Speaker Diarization | Global speaker labeling via pyannote.audio |
16+
| Denoising | Hybrid removal of background noise using Demucs and noisereduce |
17+
| Real-Time Streaming | Logs stream live via HTTP if enabled |
18+
| Format Compatibility | Supports `.mp3`, `.wav`, `.flac`, `.aac`, `.m4a`, `.mp4`, `.webm`, `.mov`, `.mkv`, `.avi`, etc. |
1919

2020
---
2121

22-
## Usage
22+
## Deployment on OCI Blueprint
2323

24-
### Start Blueprint Deployment
25-
in the deploymen part of Blueprint, add a recipe suchas the following
26-
```bash
24+
### Sample Recipe (Service Mode)
25+
```json
2726
{
28-
"recipe_id": "whisper transcription",
27+
"recipe_id": "whisper_transcription",
2928
"recipe_mode": "service",
3029
"deployment_name": "whisper-transcription-a10",
3130
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:whisper_transcription_v8",
3231
"recipe_node_shape": "VM.GPU.A10.2",
3332
"recipe_replica_count": 1,
3433
"recipe_container_port": "8000",
35-
"recipe_nvidia_gpu_count": 2,
34+
"recipe_nvidia_gpu_count": 2,
3635
"recipe_node_pool_size": 1,
3736
"recipe_node_boot_volume_size_in_gbs": 200,
3837
"recipe_ephemeral_storage_size": 100,
3938
"recipe_shared_memory_volume_size_limit_in_mb": 200
4039
}
41-
4240
```
43-
#### Endpoint
4441

42+
### Endpoint
4543
```
4644
POST https://<YOUR_DEPLOYMENT>.nip.io/transcribe
4745
```
48-
49-
**Example:**
50-
```
51-
https://whisper-transcription-a10-6666.130-162-199-33.nip.io/transcribe
52-
```
46+
**Example:**
47+
`https://whisper-transcription-a10-6666.130-162-199-33.nip.io/transcribe`
5348

5449
---
5550

56-
#### Parameters
57-
58-
| Parameter | Type | Description |
59-
|----------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
60-
| `audio_url` | `string` | URL to a Pre-Authenticated Request (PAR) of the audio file stored in OCI Object Storage. |
61-
| `model` | `string` | Whisper model name to use (`base`, `medium`, `turbo`, etc.). |
62-
| `summary` | `bool` | (Optional)Whether to generate a summary at the end. If `true` and no custom model path is provided, `mistralai/Mistral-7B-Instruct-v0.1` will be loaded from Hugging Face. Requires `hf_token`. |
63-
| `speaker` | `bool` | (Optional)Whether to enable speaker diarization. Requires `hf_token`. If `false`, all segments will be labeled as "Speaker 1". |
64-
| `max_speakers` | `int` | (Optional) Helps improve diarization accuracy by specifying the expected number of speakers. |
65-
| `denoise` | `bool` | (Optional) Apply basic denoising to improve quality in noisy recordings. |
66-
| `streaming` | `bool` | (Optional) Enable real-time log streaming for transcription chunks and progress updates. |
67-
| `hf_token` | `string` | (Optional)Hugging Face token, required for loading models like Mistral or enabling speaker diarization. |
68-
| `prop-decrease` | `Float` | (Optional) Controls level of noise suppression. Range: 0.0–1.0. Default: 0.7. |
69-
| `summarized-model` | `path` | (Optional) Path or HF ID of LLM used for summarization. Default: mistralai/Mistral-7B-Instruct-v0.1. |
70-
| `ground-truth` | `path` | (Optional) Path to .txt file with expected transcription for WER (Word Error Rate) evaluation. |
71-
51+
## API Parameters
52+
53+
| Name | Type | Description |
54+
|-------------------|-----------|-----------------------------------------------------------------------------------------------------------------------|
55+
| `audio_url` | string | URL to audio file in OCI Object Storage (requires PAR) |
56+
| `model` | string | Whisper model to use: `base`, `medium`, `large`, `turbo`, etc. |
57+
| `summary` | bool | Whether to generate a summary (default: false). Requires `hf_token` if model path not provided |
58+
| `speaker` | bool | Whether to run diarization (default: false). Requires `hf_token` |
59+
| `max_speakers` | int | (Optional) Maximum number of speakers expected for diarization |
60+
| `denoise` | bool | Whether to apply noise reduction |
61+
| `streaming` | bool | Enables real-time logs via /stream_log endpoint |
62+
| `hf_token` | string | Hugging Face access token (required for diarization or HF-hosted summarizers) |
63+
| `prop_decrease` | float | (Optional) Controls level of noise suppression. Range: 0.0–1.0 (default: 0.7) |
64+
| `summarized_model`| string | (Optional) Path or HF model ID for summarizer. Default: `mistralai/Mistral-7B-Instruct-v0.1` |
65+
| `ground_truth` | string | (Optional) Path to reference transcript file to compute WER |
7266

7367
---
7468

75-
#### Example `curl` Command
76-
69+
## Example cURL Command
7770
```bash
7871
curl -k -N -L -X POST https://<YOUR_DEPLOYMENT>.nip.io/transcribe \
7972
-F "audio_url=<YOUR_PAR_URL>" \
@@ -82,97 +75,64 @@ curl -k -N -L -X POST https://<YOUR_DEPLOYMENT>.nip.io/transcribe \
8275
-F "speaker=true" \
8376
-F "streaming=true" \
8477
-F "denoise=false" \
85-
-F "hf_token=hf_xxxxxxxxxxxxxxx" \
78+
-F "hf_token=hf_xxxxxxx" \
8679
-F "max_speakers=2"
8780
```
88-
**Example:**
8981

90-
```bash
91-
curl -k -N -L -X POST https://whisper-transcription-a10-6666.130-162-199-33.nip.io/transcribe\
92-
-F "audio_url=https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Kn-d3p3vHBqYGck5hcG24p1BrE63d7MN4jqpQzmYWchBIPZA5bymsVPWwJl-VbPq/n/iduyx1qnmway/b/whisper-transcription/o/test.wav" \
93-
-F "model=turbo" \
94-
-F "summary=true" \
95-
-F "speaker=true" \
96-
-F "streaming=true" \
97-
-F "denoise=false" \
98-
-F "hf_token=hf_xxxxxxxxxxxxxxx" \
99-
-F "max_speakers=2"
100-
```
10182
---
10283

103-
#### Real-Time Log Streaming
84+
## Output Files
85+
86+
Each processed audio generates the following:
87+
88+
- `*.txt` – Human-readable transcript with speaker turns and timestamps
89+
- `*.json` – Full structured metadata: transcript, summary, diarization
90+
- `*.log` – Detailed processing log (useful for debugging or auditing)
91+
92+
---
10493

105-
If `streaming=true`, the API will return:
94+
## Streaming Logs
10695

96+
If `streaming=true`, the response will contain a log filename:
10797
```json
10898
{
10999
"meta": "logfile_name",
110100
"logfile": "transcription_log_remote_audio_<timestamp>.log"
111101
}
112102
```
113-
114-
To stream logs in real-time (in another terminal):
115-
116-
```bash
117-
curl -N https://<YOUR_DEPLOYMENT>.nip.io/stream_log/transcription_log_remote_audio_<timestamp>.log
118-
```
119-
120-
**Example:**
121-
103+
To stream logs in real-time:
122104
```bash
123-
curl -N https://whisper-transcription-a10-6666.130-162-199-33.nip.io/stream_log/transcription_log_remote_audio_20250604_020250.log
105+
curl -N https://<YOUR_DEPLOYMENT>.nip.io/stream_log/<log_filename>
124106
```
125107

126-
This shows chunk-wise transcription output live, followed by the summary at the end.
127-
128-
---
129-
130-
#### Non-Streaming Mode
131-
132-
If `streaming=false`, the API will return the entire transcription (and summary if requested) in a single JSON response when processing is complete.
133-
134108
---
135109

136-
## Outputs
110+
## Hugging Face Access
137111

138-
For each input file, the pipeline generates:
112+
To enable diarization, accept model terms at:
113+
https://huggingface.co/pyannote/segmentation
139114

140-
- `*.txt` — Transcript with speaker labels and timestamps
141-
- `*.json` — Transcript + speaker segments + summary
142-
- `transcription_log_*.log` — Full debug log for reproducibility
143-
144-
---
145-
146-
## Hugging Face Token
147-
148-
To enable **speaker diarization**, accept the model terms at:
149-
[https://huggingface.co/pyannote/segmentation](https://huggingface.co/pyannote/segmentation)
150-
151-
Then generate a token at:
152-
[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
115+
Generate token at:
116+
https://huggingface.co/settings/tokens
153117

154118
---
155119

156120
## Dependencies
157121

158-
Key Python packages:
159-
160-
- `faster-whisper`
161-
- `transformers`
162-
- `pyannote.audio`
163-
- `librosa`, `pydub`, `noisereduce`
164-
- `ffmpeg-python`, `demucs`
165-
- `fastapi`, `uvicorn`, `jiwer`
122+
| Package | Purpose |
123+
|---------------------|----------------------------------|
124+
| `faster-whisper` | Core transcription engine |
125+
| `transformers` | Summarization via Hugging Face |
126+
| `pyannote.audio` | Speaker diarization |
127+
| `pydub`, `librosa` | Audio chunking and processing |
128+
| `demucs` | Vocal separation / denoising |
129+
| `fastapi`, `uvicorn`| REST API server |
130+
| `jiwer` | WER evaluation |
166131

167132
---
168133

169-
## Notes
170-
171-
- The API uses a **cached Whisper model per variant** for faster performance.
172-
- **Diarization is performed globally** over the entire audio, not per chunk.
173-
- **Denoising uses Demucs to isolate vocals**, which may be GPU-intensive.
174-
175-
---
176-
177-
134+
## Final Notes
178135

136+
- Whisper model is GPU-cached per thread for performance.
137+
- Diarization runs globally, not chunk-by-chunk.
138+
- Denoising is optional but improves quality on noisy files.

0 commit comments

Comments
 (0)