Skip to content

Releases: QuentinFuxa/WhisperLiveKit

0.2.8

02 Sep 19:29
Compare
Choose a tag to compare

Dependency and Compatibility Changes

  • Removed Triton <3 requirement
  • Tested compatibility with Python 3.14 and 3.15

Performance Improvements

  • Simulstreaming backend now defaults to MLX-Whisper (if available) or Faster-Whisper (if available) encoders, paired with Whisper cross-attention and decoder using an AlignAtt policy, for increased speed. Can be disabled using --disable-fast-encoder
  • Encoders are loaded once and shared in Simulstreaming, reducing vRAM usage
  • Only the decoder of Whisper is loaded when using a different encoder, reducing vRAM usage

Frontend Enhancements

  • Added a microphone picker
  • Loads the UI as a single inline HTML file (instead of separate CSS, JS, SVGs and HTML files) for simplified deployment

Bug Fixes and Improvements

  • Resolved warmup error when no connection is provided or when the language is set to auto
  • Added pip timeout and retries in Dockerfile when installing Torch/TorchVision/TorchAudio
  • Fixed issue where an exception is raised when language is set to 'auto' and task is set to 'translation'
  • Enabled auto-detection of language for warmup if not specified

0.2.7

27 Aug 16:29
Compare
Choose a tag to compare

0.2.7: Diarization Improvements

  • New default backend: Sortformer is now the default diarization backend, replacing Diart
  • 6x faster processing: Reduced latency from ~2s to ~0.3s on CPU
  • Significantly improved speaker detection (Constraint: Currently supports up to 4 speakers maximum)
  • Shared model loading: A single Sortformer model SortformerDiarization is now shared across users and instances to reduce memory footprint. Speaker caches, frames, etc. are handled per user in SortformerDiarizationOnline
  • Enhanced alignment: Improved time and token synchronization between transcription and diarization results

0.2.6

21 Aug 12:35
Compare
Choose a tag to compare
  • Voice Activity Control (VAC) by Default: VAC is now enabled by default to improve transcription accuracy by filtering out non-speech segments before processing transcription & diarization. You can disable it with the --no-vac flag.

  • Simulstreaming Backend Enhancements:

    • The simulstreaming backend is now the default transcription backend.
    • Improved timestamp accuracy for audio segments longer than 30 seconds.
    • Backends models are now recycled to optimize resource usage, by removing whisper hooks at the end of a transcription
    • Added the ability to preload multiple backend models using the --preloaded_model_count argument, when several users are espected
  • Diarization with Silences: The diart diarization backend now correctly handles pauses and silences, improving speaker turn detection.

  • Time Handling: Aligned time handling between the backend and the frontend for better synchronization.

  • WebSocket Communication: Buffering is disabled during silent periods.

  • Default Model: The default model is now base.

0.2.5

13 Aug 08:24
Compare
Choose a tag to compare

Build & Dependencies

  • Migrated to pyproject.toml - Replaced setup.py with PEP-recommended packaging bda72b8
  • Removed NumPy version constraint - No longer restricted to numpy < 2.0.0 197293e

Backend Architecture

  • Refactored SimulStreaming backend separation - Improved architecture to allow multiple users to share the same backend Whisper model instance d098af3 197293e
  • Enhanced performance monitoring - Lag metrics now update every 0.1 seconds and are independent of token emission frequency 2bbdc70
  • Reduced hallucinations - SimulStreaming is now less likely to generate false transcriptions during silent periods 87b9ed6

Frontend Improvements

  • Enhanced silence indicators - Now displays three distinct types of silences:
    • Model-detected silences ([BLANK_AUDIO])
    • Token emission gaps
    • End-of-transcription silences
      38b4ebe
  • Dark theme support - Added dark mode 4e56130
  • Improved UX during transcription by @davidgumberg
    • Screen no longer goes to sleep while transcribing 7f93c4b
    • Auto-scroll to latest transcription text 3b96fb8
Screenshot 2025-08-11 at 17 53 50

0.2.4

02 Aug 12:14
Compare
Choose a tag to compare

Bug Fixes

  • Diarization Queue Audio Overlap Fixed a bug where diarization_queue was sent the entire self.pcm_buffer on every iteration, instead of just the latest chunk. PR by @choomegan (commit)

  • License Display Error Fixed dual license warning display when using simulstreaming backend. 46efbdf

Enhancements

  • Improved Punctuation Splitting for Diarization Enhanced the use_punctuation_split logic to improve diarization results. Commits: 3ad3683, 5b9977c, 56114d3

  • Deployment Guide Update Fixed and clarified the Deployment Guide in the README. PR by @luisla-rivas (commit)

  • Architecture Update e40b5a3

  • Dockerfile Improvements Updated Dockerfile to install build-essential and update the PyTorch version. - (Idea from @callumgarven) (commit)

Core Updates

0.2.2

04 Jul 15:10
Compare
Choose a tag to compare

New:

  • Replace ffmpeg-python with raw ffmpeg calls:

    • Fixes systematic crashes after 9 minutes on some machines
    • Improves reboot and restart handling
    • Allows ffmpeg to restart without crashing the server on conversion errors
  • Update to latest SimulWhisper:

  • Prevent buffer from growing indefinitely when no tokens are created

  • Fix Hugging Face token file handling in Docker

  • Remove default 8000 port in WebSocket when no port is provided

0.2.1

27 Jun 12:16
Compare
Choose a tag to compare

New SimulStreaming backend for transcription. Associated preprint: https://arxiv.org/abs/2506.17077

0.1.9

19 Jun 14:44
Compare
Choose a tag to compare

Faster Diarization, Smarter Speaker Splitting

  • Faster diarization, with buffering logic and fixed-size audio chunks, now aligned with --min-chunk-size for improved real-time performance
  • Punctuation-based speaker splitting (beta): enables more natural transitions using --punctuation-split
  • Custom diarization models: use --segmentation-model and --embedding-model to specify alternate backends. See here to get a list of available models

0.1.8

16 Jun 14:58
Compare
Choose a tag to compare

Changed

  • TranscriptionEngine (ex WhisperLiveKit ) can now be initialized with parameters directly via its constructor (e.g., TranscriptionEngine(backend="faster-whisper", model="small"), for greater flexibility for programmatic use in addition to command-line argument parsing.

Moved

  • New module whisperlivekit.parse_args for handling command-line argument parsing.
  • New module whisperlivekit.web.web_interface for serving the web interface HTML.

0.1.7 -> 0.1.8: 993a835

0.1.7

28 May 11:36
Compare
Choose a tag to compare

Changelog - v0.1.7

Bugs Fixes

  • Fixed #127 : Transcription with VAC is functional again

Enhancements

  • Backend logs now update the lag value even when no audio is detected
  • Added explicit error message when ffmpeg is not found
  • Frontend now indicates when no audio is detected