Releases: QuentinFuxa/WhisperLiveKit
0.2.8
Dependency and Compatibility Changes
- Removed Triton <3 requirement
- Tested compatibility with Python 3.14 and 3.15
Performance Improvements
- Simulstreaming backend now defaults to MLX-Whisper (if available) or Faster-Whisper (if available) encoders, paired with Whisper cross-attention and decoder using an AlignAtt policy, for increased speed. Can be disabled using
--disable-fast-encoder
- Encoders are loaded once and shared in Simulstreaming, reducing vRAM usage
- Only the decoder of Whisper is loaded when using a different encoder, reducing vRAM usage
Frontend Enhancements
- Added a microphone picker
- Loads the UI as a single inline HTML file (instead of separate CSS, JS, SVGs and HTML files) for simplified deployment
Bug Fixes and Improvements
- Resolved warmup error when no connection is provided or when the language is set to auto
- Added pip timeout and retries in Dockerfile when installing Torch/TorchVision/TorchAudio
- Fixed issue where an exception is raised when language is set to 'auto' and task is set to 'translation'
- Enabled auto-detection of language for warmup if not specified
0.2.7
0.2.7: Diarization Improvements
- New default backend: Sortformer is now the default diarization backend, replacing Diart
- 6x faster processing: Reduced latency from ~2s to ~0.3s on CPU
- Significantly improved speaker detection (Constraint: Currently supports up to 4 speakers maximum)
- Shared model loading: A single Sortformer model
SortformerDiarization
is now shared across users and instances to reduce memory footprint. Speaker caches, frames, etc. are handled per user inSortformerDiarizationOnline
- Enhanced alignment: Improved time and token synchronization between transcription and diarization results
0.2.6
-
Voice Activity Control (VAC) by Default: VAC is now enabled by default to improve transcription accuracy by filtering out non-speech segments before processing transcription & diarization. You can disable it with the
--no-vac
flag. -
Simulstreaming Backend Enhancements:
- The
simulstreaming
backend is now the default transcription backend. - Improved timestamp accuracy for audio segments longer than 30 seconds.
- Backends models are now recycled to optimize resource usage, by removing whisper hooks at the end of a transcription
- Added the ability to preload multiple backend models using the
--preloaded_model_count
argument, when several users are espected
- The
-
Diarization with Silences: The
diart
diarization backend now correctly handles pauses and silences, improving speaker turn detection. -
Time Handling: Aligned time handling between the backend and the frontend for better synchronization.
-
WebSocket Communication: Buffering is disabled during silent periods.
-
Default Model: The default model is now
base
.
0.2.5
Build & Dependencies
- Migrated to
pyproject.toml
- Replacedsetup.py
with PEP-recommended packaging bda72b8 - Removed NumPy version constraint - No longer restricted to
numpy < 2.0.0
197293e
Backend Architecture
- Refactored SimulStreaming backend separation - Improved architecture to allow multiple users to share the same backend Whisper model instance d098af3 197293e
- Enhanced performance monitoring - Lag metrics now update every 0.1 seconds and are independent of token emission frequency 2bbdc70
- Reduced hallucinations - SimulStreaming is now less likely to generate false transcriptions during silent periods 87b9ed6
Frontend Improvements
- Enhanced silence indicators - Now displays three distinct types of silences:
- Model-detected silences (
[BLANK_AUDIO]
) - Token emission gaps
- End-of-transcription silences
38b4ebe
- Model-detected silences (
- Dark theme support - Added dark mode 4e56130
- Improved UX during transcription by @davidgumberg

0.2.4
Bug Fixes
-
Diarization Queue Audio Overlap Fixed a bug where
diarization_queue
was sent the entireself.pcm_buffer
on every iteration, instead of just the latest chunk. PR by @choomegan (commit) -
License Display Error Fixed dual license warning display when using simulstreaming backend. 46efbdf
Enhancements
-
Improved Punctuation Splitting for Diarization Enhanced the
use_punctuation_split
logic to improve diarization results. Commits: 3ad3683, 5b9977c, 56114d3 -
Deployment Guide Update Fixed and clarified the Deployment Guide in the README. PR by @luisla-rivas (commit)
-
Architecture Update e40b5a3
-
Dockerfile Improvements Updated Dockerfile to install
build-essential
and update the PyTorch version. - (Idea from @callumgarven) (commit)
Core Updates
-
Update to latest version of SimulStreaming Fixes warmup with >30s audio files
-
SimulStreaming Whisper Core Update Updated SimulStreaming whisper core from version 20230918 to 20250625. Solves tensor mismatch on some gpus due to triton version - Commits: 8e056cb, 4cfed6e
0.2.2
New:
-
Replace ffmpeg-python with raw ffmpeg calls:
- Fixes systematic crashes after 9 minutes on some machines
- Improves reboot and restart handling
- Allows ffmpeg to restart without crashing the server on conversion errors
-
Update to latest SimulWhisper:
- Adds compatibility with English-only models
- Infers word-level timestamps for better diarization alignment
- Other improvements: https://github.com/ufal/SimulStreaming/commits/main/
-
Prevent buffer from growing indefinitely when no tokens are created
-
Fix Hugging Face token file handling in Docker
-
Remove default 8000 port in WebSocket when no port is provided
0.2.1
New SimulStreaming backend for transcription. Associated preprint: https://arxiv.org/abs/2506.17077
- Up to 5 time faster on tiny model: #134 (comment)
- Requires to install
pip install whisperlivekit[simulstreaming]
. Dual licensed: https://github.com/ufal/SimulStreaming?tab=readme-ov-file#-licence-and-contributions - Use it with
--backend simulstreaming
- SimulStreaming limitations for now:
- No buffer preview is available
- Diarization maybe be less precise
- Punctuation can be less accurate
- English-only model (tiny.en, base.en, medium.en) are not compatible for now
0.1.9
Faster Diarization, Smarter Speaker Splitting
- Faster diarization, with buffering logic and fixed-size audio chunks, now aligned with --min-chunk-size for improved real-time performance
- Punctuation-based speaker splitting (beta): enables more natural transitions using --punctuation-split
- Custom diarization models: use --segmentation-model and --embedding-model to specify alternate backends. See here to get a list of available models
0.1.8
Changed
TranscriptionEngine
(exWhisperLiveKit
) can now be initialized with parameters directly via its constructor (e.g.,TranscriptionEngine(backend="faster-whisper", model="small")
, for greater flexibility for programmatic use in addition to command-line argument parsing.
Moved
- New module
whisperlivekit.parse_args
for handling command-line argument parsing. - New module
whisperlivekit.web.web_interface
for serving the web interface HTML.
0.1.7 -> 0.1.8: 993a835