CapsLockX 2.0: Rust rewrite with voice, brainstorm, TTS by snomiao · Pull Request #123 · snolab/CapsLockX

snomiao · 2026-03-21T11:34:58Z

Summary

Ground-up Rust rewrite targeting cross-platform (macOS first)
SenseVoice local STT (95%+ accuracy) + Gemini cloud STT + MLX local LLM correction
Brainstorm agent (Space+B) with 4 LLM providers, 10+ tools, persistent chat history
TTS with 5-tier fallback chain (ElevenLabs → Gemini → OpenAI → msedge → native)
Sandboxed JS engine (rquickjs) + Wolfram math engine (Woxi)
Voice overlay with waveform, subtitles, drag/resize/close
Window cycling via CGWindowID (stable across arrange/minimize)
Mouse clamp to screen bounds (multi-monitor safe)
Launch at login via LaunchAgent
Browser WASM adapter with SenseVoice voice input

Test plan

Build: cd rs && cargo build -p capslockx-macos --release
Voice input: Space+V hold to record, release to transcribe
Brainstorm: Space+B with selected text, Enter to send
Window cycling: Space+Z forward, Space+C arrange
Mouse: Space+WASD movement clamped to screen edges
TTS: Agent auto-speaks translations

🤖 Generated with Claude Code

- Add brainstorm_voice() with MCI waveaudio recording (V to record, V to send, ESC to cancel) - Auto-send after 60s timeout to prevent super long recordings - Show default mic device name in tooltip while recording - Attach recorded WAV as audio field in form data (same pattern as image) - Add "both" capture mode (clipboard text + window screenshot) - Refactor: merge brainstorm_prompt/brainstorm_quick_capture into unified brainstorm_ask() - Extract brainstorm_heatup(), remove dead/commented-out code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the CapsLockX macOS platform adapter with: - CGEventTap hook for intercepting keyboard events at HID level - Raw FFI callback to properly suppress events (returns NULL) - Works around core-graphics crate bug where None still passes events - CGEventPost output for injecting keyboard, mouse, and scroll events - Self-injection detection via EVENT_SOURCE_USER_DATA tagging - Full macOS virtual keycode ↔ KeyCode bidirectional mapping - FlagsChanged event handling for modifier key press/release detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mouse cursor now stops at screen edges (union of all display bounds) instead of disappearing off-screen. Also adjusts macOS scroll speed to account for LINE scroll units being much larger than on Windows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Switch scroll from LINE to PIXEL units for smooth 1px-granularity trackpad-like scrolling - Implement real window cycling (Space+Z) using CGWindowListCopyWindowInfo + NSRunningApplication instead of Cmd+Tab — directly activates the next/prev app window like Alt+Tab on Windows - Add arrange_windows: Ctrl+Cmd+F for fullscreen, Ctrl+Up for Mission Control - Reset scroll speed to default 720 (appropriate for pixel units) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…elds) macOS disables CGEventTaps during secure input (password dialogs, FileVault, etc.). The tap was staying disabled afterwards, making CapsLockX unresponsive. Now detects TapDisabledByTimeout and TapDisabledByUserInput events and automatically re-enables the tap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Window cycling (Space+Z) now cycles individual windows (not just apps) using Accessibility API (AXUIElement) to raise specific windows - Window tiling (Space+C) uses AX API to set position/size within the visible work area (excluding menu bar and Dock): - Plain: cascade with 48px offset - Shift: sqrt-based grid layout - Both cycle and arrange use stable (pid, title) ordering - Ctrl+Space+N/P sends Ctrl+Tab / Ctrl+Shift+Tab (switch browser tabs) via new key_tap_ctrl_shifted platform method with proper CGEvent flags - Cmd+Space now passes through to macOS Spotlight (Win+Space bypasses on Windows too for input language switcher) - Added drag bug to TODO.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ci-rust.yml: add build-macos job (macos-latest, aarch64-apple-darwin) with check, clippy, test, and release build - release-rust.yml: add build-macos job that builds capslockx-macos and uploads capslockx-macos-arm64 binary to GitHub Release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- bin/capslockx.mjs: detects OS/arch and runs the correct Rust binary (Windows, macOS arm64/x64, Linux x64) - Looks for binary in: repo root, local cargo build, then auto-downloads from latest GitHub Release as fallback - package.json: bin field points to the cross-platform launcher - package.json: files field includes all platform binaries - release-rust.yml: commit macOS binary to main for npm inclusion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The AHK launcher is not the Rust binary — falling back to it would run the wrong thing. If clx-rust.exe isn't found locally, the auto-download from GitHub Releases handles it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously the bypass path suppressed the original event and re-injected Space via key_tap, which stripped modifier flags — macOS saw naked Space instead of Cmd+Space. Now the bypass returns PassThrough so the original event reaches the OS with all modifier flags intact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- White X icon when inactive, blue X when CapsLockX mode is active - Menu with "Quit CapsLockX" item - NSApplication initialized as Accessory (no dock icon) - Icon updates dispatched to main queue for AppKit thread safety - Uses raw Objective-C FFI, no additional dependencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Preferences UI with same Catppuccin Mocha theme as Windows - 5 trigger key checkboxes + 3 speed sliders - WKWebView with JS bridge (webkit.messageHandlers) instead of Tauri - Custom ObjC classes created at runtime for WKScriptMessageHandler and menu item action target - Config persisted to ~/.config/CapsLockX/config.json - Tray menu: "Preferences…" (Cmd+,) + separator + "Quit CapsLockX" - ~300 lines of raw ObjC FFI, no framework dependency, ~0 binary overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three bugs prevented Cmd+Space from working: 1. Fast key combos caused Cmd's FlagsChanged-up to arrive before Space-down, removing LWin from held_keys. Fix: sync CGEvent modifier flags into held_keys on every KeyDown event. 2. Space key-up was always suppressed for trigger keys, so macOS never saw the complete down+up cycle. Fix: track bypass state with trigger_bypassed AtomicBool and pass through both events. 3. FlagsChanged events for modifiers could be suppressed, preventing macOS from tracking modifier state. Fix: always pass through FlagsChanged on macOS. Added docs/dev/modifier-bypass.md documenting the full solution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Modifier keys (Shift, Ctrl, Option, Cmd) are now passed through as-is to the output key — no platform-specific word_modifier/doc_modifier mapping needed. Users press their native modifier combos. - Removed word_modifier()/doc_modifier() from Platform trait - Renamed AccModel phase strings from Chinese to English: 横中键→H_MIDKEY, 纵中键→V_MIDKEY, 移动→MOVE, 止动→STOP Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Design doc and TODO for the voice input feature: - Toggle mode (click V) and hold mode (hold V) - VAD-based audio segmentation with 25s max chunks - 3-stage transcription pipeline: local→server Whisper→LLM typo-fix - Each stage replaces previous text in-place at cursor - Architecture: cpal audio + webrtc-vad + HTTP to brainstorm server See plan/voice-input/README.md for full architecture and plan/voice-input/TODO.md for implementation checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Client (Rust): - VoiceModule: V key state machine (toggle click / hold release) Wired into Modules dispatcher for on_key_down/on_key_up/stop_all - AudioCapture: cpal-based cross-platform mic capture (16kHz mono f32) with ring buffer, start/stop/take_samples API Server (brainstorm): - POST /api/voice-transcribe: streaming NDJSON endpoint Stage 1: Whisper transcription → {stage:"transcribed"} Stage 2: gpt-4o-mini typo-fix → {stage:"polished", is_final:true} CORS enabled for cross-origin Rust client Next: wire audio capture into VoiceModule, add VAD, HTTP client Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full voice pipeline wired together: - Energy-based VAD: 20ms frames, RMS threshold, 500ms silence = chunk end - Force-splits at 25s for Whisper's 30s limit - WAV encoder (16-bit PCM mono) for server upload - HTTP POST via ureq to brainstorm voice-transcribe endpoint - Parses streaming NDJSON response, types final text at cursor - AudioCapture created on background thread (cpal Stream is !Send) - Platform::type_text() default impl maps ASCII to key_tap calls - Server URL configurable via CLX_VOICE_SERVER env var Usage: hold Space+V to dictate, release to send. Or tap to toggle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-phase transcription pipeline: 1. Local Whisper (whisper.cpp via whisper-rs): instant rough draft typed at cursor (~200ms on Apple Silicon) 2. Server Whisper + LLM: polished text replaces rough via Backspace - Model: ggml-tiny.en.bin at ~/.cache/capslockx/ (~75MB) - Auto-downloads instructions printed if model missing - Graceful fallback to server-only if model unavailable - Metal GPU acceleration auto-detected on macOS - Platform::type_text() for typing transcription at cursor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Resample 48kHz mic audio to 16kHz before Whisper (fixes hallucinations) - Skip chunks shorter than 1 second - Preload Whisper model at startup (first Space+V is instant) - Persistent bg thread (model stays loaded between sessions) - Server: skip empty transcriptions, install nodemailer dep - macOS: key_tap_with_mods embeds modifier flags on CGEvent atomically (fixes Shift+HJKL text selection) - Quiet verbose logs (window snapshots, VAD events, audio capture) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- macOS type_text: CGEventKeyboardSetUnicodeString for full Unicode (Chinese, Japanese, emoji — not just ASCII) - Whisper auto-scaling: budget-based (wall clock vs inference time), non-blocking background model loading (old model keeps working), persists tier across restarts via whisper-tier.txt - Noise filter: skip bracketed annotations, hallucinations, <3 chars - Voice server URL reads from ~/.config/CapsLockX/config.json - VAD events logged again (speech start/end) - rs/dev-watch.sh: cargo-watch auto-rebuild + restart on file change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace RMS energy VAD with TEN VAD neural network (308KB ONNX model) — reliably distinguishes human speech from keyboard clicks, music, ambient noise. 10ms frames, 0.16ms inference on M-series. - Native Core Graphics waveform overlay: transparent floating window at bottom-center, green when speaking, gray when silent. Custom NSView with drawRect via raw ObjC FFI, ~20fps. - Resample 48kHz→16kHz before VAD (not after) for consistency. - Platform trait: show/hide/update_voice_overlay methods. - Whisper auto-scaling: non-blocking background model loading, budget-based (wall clock), persists tier across restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace RMS energy VAD with TEN VAD neural network (308KB model) Distinguishes human speech from keyboard clicks, music, ambient noise - Space+V = mic only (as before) - Space+Shift+V = mic + system audio (ScreenCaptureKit) Captures meeting audio, YouTube, etc. mixed with your voice - SystemAudioStream trait + MacPlatform override - ScreenCaptureKit stub with CMSampleBuffer handler ready (full async SCShareableContent flow TODO) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of buffering all continuous speech until silence (up to 25s), emit partial chunks every 3 seconds so text appears incrementally as the user speaks — like an input method editor. Also removed corrupted ggml-small.bin (incomplete download). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Emit partial transcription every 1s of continuous speech instead of 3s. Use rolling buffer with VAD speech detection — transcribe on fixed interval while speech is active, flush remainder on speech end. Base model: 1s audio → ~180ms = 1.2s total latency Small model: 1s audio → ~575ms = 1.6s total latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of typing the full transcription each time, compute the diff between what was already typed and the new transcription: - Find common prefix - Backspace the diverging suffix - Type only the new suffix This makes continuous speech feel like an IME — text flows in incrementally without repeating. Server polishes the full utterance on speech end by replacing all typed text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Transcribe every 0.5s of new speech (was 1s) — Whisper inference is ~constant time regardless of audio length, so faster intervals are essentially free - Server polishing now runs in a background thread so it doesn't block the streaming transcription loop - Forced base model for lower latency (~450ms vs ~1300ms small) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of re-transcribing the entire growing buffer (causing massive diffs when Whisper changes its mind about old text), use a committed prefix approach: - Only the last ~5s of audio (pending buffer) gets re-transcribed - After 5s, text is frozen/committed and never changed - Diffs are small and local — no more -400 char backspace storms - Session log at /tmp/clx-voice.log shows full text state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Upgrade threshold 1.25x → 5x (prevent unnecessary model switches) - Samples before scaling 3 → 10 (more evidence needed) - Commit window 5s → 3s (freeze text faster, less Whisper instability) - Force base model for consistent low-latency streaming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of committing at a fixed 3s interval (which causes large diffs right before commit), wait until the transcription is stable — same text for 2 consecutive inference cycles. This means: - Commits at natural phrase boundaries where Whisper has settled - No more large pre-commit backspace storms - Force-commit at 5s as safety net for very long utterances Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VoiceProcessingIO ducks (lowers volume of) other system audio by default. Added kAUVoiceIOProperty_OtherAudioDuckingConfiguration (property 2108) with minimum ducking level. Speakers should now stay audible during echo-cancelled mic capture. Re-enabled VoiceProcessingIO for Shift+V (system audio capture mode). Normal Space+V still uses cpal (no ducking at all). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Speakers audible (ducking minimized) ✓ - VoiceProcessingIO starts successfully ✓ - Format query fails (-10877) — assuming 48kHz mono f32 - AEC not effective yet — both tracks still show same content - TODO: fix format to get actual echo-cancelled audio from VPIO Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: VPIO on macOS reports 9ch non-interleaved format but AudioUnitRender with 1-buffer mono succeeds and returns very quiet echo-cancelled audio (rms 0.001-0.036 vs normal 0.05-0.3). The AEC IS working — speaker bleed is cancelled. The signal is just extremely quiet. Applied 30x gain amplification to bring levels back to normal for VAD/Whisper processing. Also added test-vpio standalone binary for debugging VPIO independently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test-vpio + Whisper test with English YouTube on speakers and Japanese audio into mic shows 7/8 transcriptions as Japanese. English speaker bleed is successfully cancelled. VPIO (30x gain) → Whisper correctly separates: 🎤 VPIO: "はい、オートです。" (Japanese from mic) 🎤 VPIO: "さぁ、ちょっと、開いたいです。" (Japanese from mic) vs speakers playing English YouTube → cancelled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ary) The 9-buffer non-interleaved approach caused render status=-50. Reverted to simple 1-buffer mono render which the test binary proved works correctly. AEC effectiveness: 91% (10/11 non-English, 1 leak). Standalone test confirmed: Japanese mic audio separated from English speakers with VoiceProcessingIO + 30x gain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously truncated the combined string, cutting off the 🎤 line and emoji when text was long. Now each line (mic/sys) is truncated independently at 80 chars, preserving both emoji labels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Also fix subtitle overlay: split lines on \n before parsing emoji tags, add transparent-bg newline between lines for proper NSTextField rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pure Rust NLMS (Normalized Least Mean Squares) adaptive filter: - 4800 taps (300ms at 16kHz) learns speaker→mic acoustic path - Subtracts predicted echo from mic signal - Cross-platform: works on macOS/Windows/Linux - Stacks with VoiceProcessingIO: VPIO removes ~91%, NLMS cleans rest - Also added noise gate (0.002 threshold) in VPIO voice_capture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Moved gain+noise gate from voice_capture.rs callback to voice.rs AFTER the NLMS filter. NLMS now operates on raw VPIO signal (pre-amplification) where echo residual is tiny. After NLMS cleans it, 40x gain amplifies clean voice only. Result: 0 English leaks in both standalone test and CapsLockX session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Truncation now keeps first 2 chars (emoji + space) as prefix, then "..." + last 74 chars. Previously took last 77 chars from the end, cutting off the emoji label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bottleneck: two Whisper instances running sequentially (~400ms/cycle). Fixes: - Sys Whisper now runs 3x less often (larger streaming interval) - Mic Whisper gets priority for faster response - NLMS filter reduced 4800→1600 taps (3x less CPU, 100ms still enough) Net effect: mic transcription ~200ms/cycle instead of ~400ms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mic-only is faster (no AEC/NLMS/system audio overhead). Dual capture with echo cancellation only when explicitly requested. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously used stale 2s snapshot, causing index drift when user manually clicked a window between Z presses. Now: - Always takes fresh window snapshot (handles open/close) - Detects frontmost app via NSWorkspace.frontmostApplication - Starts cycling from the actually-focused window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…esses - Removed (pid, title) sort — keep natural z-order from CGWindowList (front-to-back, most recently used first) - Fresh snapshot on first press or after 2s pause - Reuse snapshot during rapid cycling (prevents ping-pong) - Index 0 = frontmost, so first Z goes to 2nd most recent window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ling Major features: - SenseVoice (sherpa-onnx) as default STT engine with Whisper fallback - LLM-based STT error correction via Gemini/OpenAI/Anthropic - Brainstorm agent (Space+B) with web_search, fetch_url, js_eval (Boa), math_eval (Woxi) - Non-modal brainstorm prompt panel (doesn't block voice overlay) - Voice overlay: auto-resize, drag handle on hover, hidden from screen share - Window cycling: CGWindowID-based stable ordering, frontmost detection - Browser voice: SenseVoice WASM with server-mode streaming - STT benchmark suite (SenseVoice vs Whisper tiny/base/small/medium/large-v3) - Dual-track STT worker architecture (non-blocking audio loop) - Preferences UI: STT engine, LLM API key, model selection - Config persistence across restarts (overlay positions, settings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add voice-standalone binary: tests Space+V pipeline without full CLX (VoiceProcessingIO AEC mic + ScreenCaptureKit sys audio + overlay) - Fix NSSize/NSRect ABI mismatch in voice_overlay auto_resize (ARM64) - Fix byte-slice panics on Chinese UTF-8 in subtitle debug logs - Fix sys track subtitle never updating (add sys_subtitle_dirty path) - Fix STT channel saturation: pre-accumulate before sending (MIC 200ms, SYS 500ms — reduces production from ~40/s to ~7/s) - Fix mic VAD false-hold: cap mic_pending_buf to last 2s on nospeech so growing-silent-buffer doesn't waste CPU on ever-growing chunks - Fix unbounded sys_committed buffer: cap at 200 chars after each commit - Raise SPEECH_START_FRAMES 4→8 (128ms) to filter ambient noise - Add 10s force-finalize in STT worker for runaway mic pending buf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

process_mic_streaming now returns Option<bool>: - Some(true) = committed → clear pending_buf - Some(false) = speech in-progress → keep accumulating (up to 5s cap) - None = nospeech → trim to 1s to avoid noise contamination Previously, every non-commit call trimmed the mic buffer to 1s regardless of whether real speech was detected. This meant SenseVoice always saw ≤1s of audio and mic_stable never reached 2, so committed text was stale. Now the buffer grows naturally during speech, giving multi-phrase context. Also applies last_n_chars(200) cap to sys_subtitle_dirty path and process_mic_streaming subtitle build. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Agent: - Added tools: js_eval (rquickjs, 8x faster than boa), math_eval (woxi), speak (TTS queue), wait, read_file_range, task management (background tasks with timeout), read_screen, read_clipboard, screenshot - Deduplicate consecutive identical tool calls - Context compaction when history exceeds 60K chars - Large tool outputs saved to file, agent reads via read_file_range - Speech queue: serial playback, no overlap, fire-and-forget LLM: - Added Ollama provider (local, OpenAI-compatible API) - Auto-discover best model from provider APIs (Gemini, OpenAI, Ollama) - Anthropic uses claude-opus-4-latest alias - Fallback chain: Gemini → GPT-4o → Claude Opus → Ollama local TTS: - Fallback chain: ElevenLabs → Gemini → OpenAI → msedge-tts → macOS say - Speech queue thread for serial playback Brainstorm: - Keep/read histories checkbox (persists across restarts) - Prompt format: clipboard\n---\n\n=== - Result overlay non-focusable (like voice overlay) - Selected text via AX API (no clipboard pollution) Refactor: - Renamed key_tap_ctrl → key_tap_cmd_or_ctrl (cross-platform clarity) - Removed key_tap_cmd_or_ctrl_shifted (use key_tap_with_mods instead) - rquickjs for native JS eval, boa_engine kept as WASM fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Voice pipeline now uses Gemini cloud STT for final transcription when GEMINI_API_KEY is available. SenseVoice still handles streaming (instant feedback), but at speech end the full utterance is re-transcribed by Gemini for higher accuracy (100% JA vs 94.7% local, 96.1% EN vs 95.6%). Fallback chain: Gemini cloud → SenseVoice + LLM correction → SenseVoice raw. Added: - cloud_stt.rs: Gemini generateContent with base64 WAV audio - stt-compare binary: benchmark SenseVoice vs Gemini on test audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After each commit, mic_pending_buf keeps 1s of context. Two inferences later (200ms), the context re-transcribes to the same text → stability fires → immediate duplicate commit. This repeated indefinitely. Fix: track mic_new_samples_since_commit (resets on commit/SpeechEnd). Stability gate now requires mic_new_samples > 16_000 (1s of genuinely new audio) before a stability commit can fire. Also: tail-based stability comparison (last 30 chars) so appending new words at sentence end doesn't reset the stable counter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- MLX server auto-detected at :8321 as local LLM fallback (Ollama broken on M5) - Voice overlay toolbar: horizontal bar with ⠿ Move, model info, ✕ Close - Resize grip ⇲ at bottom-right corner (drag to resize overlay) - Shift+R/F = horizontal scroll (R/F = vertical, matching AHK) - Brainstorm: ESC closes dialog, AX selected text captured on event tap thread - Brainstorm: clipboard save/restore when falling back to Cmd+C - Brainstorm result overlay non-focusable (doesn't steal focus) - Speech queue: serial playback, dedup consecutive identical tool calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…st-byte - Fixed crash: removed resize grip setFrame from background thread - Subtitle shows line-per-commit (no more horizontal scrolling) - VAD end threshold raised 0.4→0.5 (drops out of speech faster) - MIC_SEND_THRESHOLD halved 3200→1600 (first-byte ~180ms, was ~280ms) - STT polishing cascade: MLX local (~160ms) → Gemini cloud → LLM corrector - Debug logging for speech buffer and subtitle updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete status table of all implemented features, benchmarks, architecture diagram, and next steps for v2.0/2.1/2.2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces CapsLockX 2.0 as a Rust rewrite (macOS-first) and adds major AI/voice features (STT, TTS, brainstorm agent), plus new platform adapters, tooling, docs, and CI/release support.

Changes:

Adds new Rust core capabilities: voice pipeline (local+cloud STT, correction), TTS fallback chain, agent/tools, background task manager, and audio capture.
Adds macOS adapter implementation (CGEventTap hook, tray + prefs UI, voice overlay/capture/system audio).
Updates packaging/CI: npm launcher downloads per-platform binaries; builds macOS in CI/release; expands documentation and plans.

Reviewed changes

Copilot reviewed 72 out of 93 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tmp/capslockx-mac	Adds submodule pointer for macOS adapter snapshot
tmp/brainstorm	Adds submodule pointer for brainstorm snapshot
rs/test-results.txt	Adds recorded manual test results artifact
rs/test-manual.ahk	Adds manual QA helper for Shift+HJKL selection
rs/dev-watch.sh	Adds cargo-watch script for auto rebuild/restart
rs/core/src/tts.rs	Introduces multi-tier TTS fallback implementation
rs/core/src/task_manager.rs	Adds background task manager with timeouts
rs/core/src/stt_corrector.rs	Adds incremental LLM-based STT correction
rs/core/src/state.rs	Extends config with STT/brainstorm/LLM settings
rs/core/src/platform.rs	Extends platform trait (audio, text input, mods)
rs/core/src/modules/voice_player.html	Adds HTML utility to play voice notes with subtitles
rs/core/src/modules/mouse.rs	Updates mouse/scroll phases & adds shift-scroll behavior
rs/core/src/modules/mod.rs	Adds brainstorm + voice modules and wiring
rs/core/src/modules/edit.rs	Adds atomic modifier-aware key tapping for macOS
rs/core/src/local_whisper.rs	Adds local whisper.cpp wrapper with autoscaling
rs/core/src/local_sherpa.rs	Adds local SenseVoice wrapper with auto-download
rs/core/src/lib.rs	Exposes new core modules (agent, STT, TTS, etc.)
rs/core/src/engine.rs	Improves trigger bypass logic + held-key syncing
rs/core/src/cloud_stt.rs	Adds Gemini cloud STT transcription helper
rs/core/src/bin/test-llm.rs	Adds quick LLM client test binary
rs/core/src/bin/test-agent.rs	Adds agent tool test runner binary
rs/core/src/bin/stt-server.rs	Adds persistent local STT server binary
rs/core/src/bin/stt-quick.rs	Adds one-shot STT helper binary
rs/core/src/bin/stt-compare.rs	Adds local vs cloud STT benchmark/compare tool
rs/core/src/bin/stt-bench.rs	Adds SenseVoice vs Whisper benchmark with WER/CER
rs/core/src/bin/sherpa-test.rs	Adds standalone mic capture + SenseVoice test binary
rs/core/src/bin/clx-agent.rs	Adds standalone CLI agent chat tool
rs/core/src/audio_capture.rs	Adds cross-platform mic capture via cpal
rs/core/src/acc_model.rs	Adds option to drive ticks externally (hook thread)
rs/core/Cargo.toml	Adds dependencies for audio/STT/TTS/JS/math engines
rs/adapters/windows/src/output.rs	Updates close_tab to cmd-or-ctrl helper
rs/adapters/windows/src/hook.rs	Drives AccModel ticks via SetTimer on hook thread
rs/adapters/macos/src/tray.rs	Adds macOS tray icon via ObjC FFI
rs/adapters/macos/src/prefs_html.html	Adds macOS preferences UI (WKWebView content)
rs/adapters/macos/src/main.rs	Adds macOS adapter entry point
rs/adapters/macos/src/key_map.rs	Adds macOS keycode mapping
rs/adapters/macos/src/hook.rs	Adds CGEventTap hook w/ pass-through/suppress logic
rs/adapters/macos/src/config_store.rs	Adds persistent config store for macOS
rs/adapters/macos/src/bin/voice-standalone.rs	Adds standalone voice pipeline binary
rs/adapters/macos/src/bin/test-vpio.rs	Adds VoiceProcessingIO AEC test binary
rs/adapters/macos/src/bin/test-cycle.rs	Adds direct window cycling stability test
rs/adapters/macos/build.rs	Links required macOS frameworks
rs/adapters/macos/Cargo.toml	Adds macOS adapter crate
rs/adapters/linux/src/output.rs	Updates close_tab to cmd-or-ctrl helper
rs/adapters/browser/www/vite.config.js	Adds dev server config for browser adapter
rs/adapters/browser/www/sherpa/sherpa-onnx-vad.js	Adds sherpa VAD JS helper
rs/adapters/browser/www/sherpa/.gitignore	Ignores wasm/data model artifacts
rs/adapters/browser/www/package.json	Adds browser adapter web package config
rs/adapters/browser/www/.gitignore	Ignores node_modules
rs/adapters/browser/src/platform.rs	Updates close_tab to cmd-or-ctrl helper
rs/Cargo.toml	Adds macOS adapter to workspace
plan/voice-input/voice-modes.md	Adds voice modes design notes
plan/voice-input/TODO.md	Adds voice implementation plan checklist
plan/voice-input/README.md	Adds voice feature spec
package.json	Switches npm bin to node launcher & includes new artifacts
bin/capslockx.mjs	Adds cross-platform launcher/downloader script
TODO.md	Adds voice feature plan reference + macOS drag issue note
.playwright-cli/page-2026-03-18T17-55-02-733Z.yml	Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-53-48-281Z.yml	Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-51-22-570Z.yml	Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-50-03-292Z.yml	Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-49-54-727Z.yml	Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-49-44-295Z.yml	Adds Playwright snapshot artifact
.github/workflows/release-rust.yml	Adds macOS release build + alters artifact matching behavior
.github/workflows/ci-rust.yml	Adds macOS CI build/test/clippy
docs/dev/window-cycle-stability.md	Documents window cycling stability fixes
docs/dev/modifier-bypass.md	Documents modifier+Space bypass design
docs/dev/dual-track-stt.md	Documents dual-track STT architecture
docs/dev/agent-test-matrix.md	Documents agent tool test matrix
docs/Roadmap.md	Rewrites roadmap for Rust v2.0 scope/plans

Files not reviewed (1)

rs/adapters/browser/www/package-lock.json: Language not supported

Comments suppressed due to low confidence (15)

rs/core/src/modules/mouse.rs:1

Horizontal scrolling is applied multiple times: dx scrolls inside both branches and then again unconditionally on line 139. This will double-scroll whenever dx != 0 (and also double-scroll in Shift mode). Remove the unconditional if dx != 0 { p.scroll_h(dx * 3); } or restructure so each axis is emitted exactly once per call.
rs/core/src/platform.rs:1
The implementation always uses LCtrl, but the docstring says it should use Cmd on macOS. This will break common macOS shortcuts (e.g., Cmd+W for close tab). Make the modifier conditional on target_os (use KeyCode::LWin on macOS; KeyCode::LCtrl elsewhere) or provide a platform override for macOS and keep the default consistent with its behavior.
rs/core/src/task_manager.rs:1
task_kill sets a flag and updates status, but the running task never observes kill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing the Arc<AtomicBool> (or a cancellation token/channel) into func, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
rs/core/src/task_manager.rs:1
task_kill sets a flag and updates status, but the running task never observes kill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing the Arc<AtomicBool> (or a cancellation token/channel) into func, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
rs/adapters/browser/www/vite.config.js:1
The dev server binds to all interfaces (0.0.0.0) and requires TLS keys from hard-coded /tmp paths. This is easy to misconfigure and can unintentionally expose a dev server on a LAN. Prefer defaulting host to 127.0.0.1 and sourcing key/cert paths (or HTTPS enablement) from environment variables with safe fallbacks.
rs/core/src/local_whisper.rs:1
The module-level docs say upgrade happens when inference is >1.25x realtime, but UPGRADE_THRESHOLD is 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
rs/core/src/local_whisper.rs:1
The module-level docs say upgrade happens when inference is >1.25x realtime, but UPGRADE_THRESHOLD is 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
rs/core/src/tts.rs:1
Invalid base64 characters are silently treated as 0 via unwrap_or(0), which can corrupt audio output without surfacing an error. Return an error when a character is not found in the alphabet (except = padding), or use a well-tested base64 crate to decode and validate input.
rs/core/src/tts.rs:1
Audio is written to fixed filenames in /tmp. Concurrent or overlapping speak() calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
rs/core/src/tts.rs:1
Audio is written to fixed filenames in /tmp. Concurrent or overlapping speak() calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
rs/core/src/platform.rs:1
The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/core/src/platform.rs:1
The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/core/src/platform.rs:1
The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/dev-watch.sh:1
pkill -f \"target/release/capslockx\" can match and kill unrelated processes whose command lines happen to contain that substring. Consider using a PID file, pkill -x with an exact process name when possible, or filtering by the full path to the spawned binary to avoid collateral termination.
rs/test-results.txt:1
The PR description marks the test plan items as completed, but the committed rs/test-results.txt shows failing cases (5/8 passed). If this file is meant as an authoritative test artifact, it should reflect a passing run (or be excluded from the PR) to avoid conflicting signals about readiness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T11:37:35Z

+      # Commit binary to main so npm package includes it.
+      - name: Commit macOS binary to main
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+          git pull origin main --rebase
+          git add capslockx-macos-arm64
+          git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}"
+          git push origin main
+


Pushing release-built binaries back to main from a release workflow is risky (mutates default branch, can create unexpected CI loops, merge conflicts, and provenance issues). Prefer attaching artifacts only to GitHub Releases and publishing npm packages from the release artifacts (or a dedicated distribution branch), rather than committing binaries into source control.

Suggested change

# Commit binary to main so npm package includes it.

- name: Commit macOS binary to main

run: |

git config user.name "github-actions[bot]"

git config user.email "github-actions[bot]@users.noreply.github.com"

git pull origin main --rebase

git add capslockx-macos-arm64

git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}"

git push origin main

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

snomiao and others added 30 commits March 14, 2026 20:05

'stage'

9df947d

snomiao and others added 24 commits March 21, 2026 00:58

swap: Space+V = dual capture (default), Shift+Space+V = mic only

9aaebb5

Also fix subtitle overlay: split lines on \n before parsing emoji tags, add transparent-bg newline between lines for proper NSTextField rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

swap back: Space+V = mic only (default), Shift+Space+V = dual capture

587717e

Mic-only is faster (no AEC/NLMS/system audio overhead). Dual capture with echo cancellation only when explicitly requested. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: tidy repo — remove voice2.html, gitignore node_modules

e5b9fab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: update roadmap for CapsLockX 2.0 (Rust rewrite)

8beaeb9

Complete status table of all implemented features, benchmarks, architecture diagram, and next steps for v2.0/2.1/2.2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 21, 2026 11:35

Copilot AI reviewed Mar 21, 2026

View reviewed changes

snomiao enabled auto-merge March 21, 2026 11:42

merge: reconcile main into beta (keep beta's rebased state)

67a6aff

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

snomiao merged commit 1249420 into main Mar 21, 2026
3 of 6 checks passed

snomiao deleted the beta branch March 21, 2026 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CapsLockX 2.0: Rust rewrite with voice, brainstorm, TTS#123

CapsLockX 2.0: Rust rewrite with voice, brainstorm, TTS#123
snomiao merged 78 commits intomainfrom
beta

snomiao commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

snomiao commented Mar 21, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants