Conversation
- Add brainstorm_voice() with MCI waveaudio recording (V to record, V to send, ESC to cancel) - Auto-send after 60s timeout to prevent super long recordings - Show default mic device name in tooltip while recording - Attach recorded WAV as audio field in form data (same pattern as image) - Add "both" capture mode (clipboard text + window screenshot) - Refactor: merge brainstorm_prompt/brainstorm_quick_capture into unified brainstorm_ask() - Extract brainstorm_heatup(), remove dead/commented-out code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the CapsLockX macOS platform adapter with: - CGEventTap hook for intercepting keyboard events at HID level - Raw FFI callback to properly suppress events (returns NULL) - Works around core-graphics crate bug where None still passes events - CGEventPost output for injecting keyboard, mouse, and scroll events - Self-injection detection via EVENT_SOURCE_USER_DATA tagging - Full macOS virtual keycode ↔ KeyCode bidirectional mapping - FlagsChanged event handling for modifier key press/release detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mouse cursor now stops at screen edges (union of all display bounds) instead of disappearing off-screen. Also adjusts macOS scroll speed to account for LINE scroll units being much larger than on Windows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch scroll from LINE to PIXEL units for smooth 1px-granularity trackpad-like scrolling - Implement real window cycling (Space+Z) using CGWindowListCopyWindowInfo + NSRunningApplication instead of Cmd+Tab — directly activates the next/prev app window like Alt+Tab on Windows - Add arrange_windows: Ctrl+Cmd+F for fullscreen, Ctrl+Up for Mission Control - Reset scroll speed to default 720 (appropriate for pixel units) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…elds) macOS disables CGEventTaps during secure input (password dialogs, FileVault, etc.). The tap was staying disabled afterwards, making CapsLockX unresponsive. Now detects TapDisabledByTimeout and TapDisabledByUserInput events and automatically re-enables the tap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Window cycling (Space+Z) now cycles individual windows (not just apps) using Accessibility API (AXUIElement) to raise specific windows - Window tiling (Space+C) uses AX API to set position/size within the visible work area (excluding menu bar and Dock): - Plain: cascade with 48px offset - Shift: sqrt-based grid layout - Both cycle and arrange use stable (pid, title) ordering - Ctrl+Space+N/P sends Ctrl+Tab / Ctrl+Shift+Tab (switch browser tabs) via new key_tap_ctrl_shifted platform method with proper CGEvent flags - Cmd+Space now passes through to macOS Spotlight (Win+Space bypasses on Windows too for input language switcher) - Added drag bug to TODO.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ci-rust.yml: add build-macos job (macos-latest, aarch64-apple-darwin) with check, clippy, test, and release build - release-rust.yml: add build-macos job that builds capslockx-macos and uploads capslockx-macos-arm64 binary to GitHub Release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- bin/capslockx.mjs: detects OS/arch and runs the correct Rust binary (Windows, macOS arm64/x64, Linux x64) - Looks for binary in: repo root, local cargo build, then auto-downloads from latest GitHub Release as fallback - package.json: bin field points to the cross-platform launcher - package.json: files field includes all platform binaries - release-rust.yml: commit macOS binary to main for npm inclusion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The AHK launcher is not the Rust binary — falling back to it would run the wrong thing. If clx-rust.exe isn't found locally, the auto-download from GitHub Releases handles it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the bypass path suppressed the original event and re-injected Space via key_tap, which stripped modifier flags — macOS saw naked Space instead of Cmd+Space. Now the bypass returns PassThrough so the original event reaches the OS with all modifier flags intact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- White X icon when inactive, blue X when CapsLockX mode is active - Menu with "Quit CapsLockX" item - NSApplication initialized as Accessory (no dock icon) - Icon updates dispatched to main queue for AppKit thread safety - Uses raw Objective-C FFI, no additional dependencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Preferences UI with same Catppuccin Mocha theme as Windows - 5 trigger key checkboxes + 3 speed sliders - WKWebView with JS bridge (webkit.messageHandlers) instead of Tauri - Custom ObjC classes created at runtime for WKScriptMessageHandler and menu item action target - Config persisted to ~/.config/CapsLockX/config.json - Tray menu: "Preferences…" (Cmd+,) + separator + "Quit CapsLockX" - ~300 lines of raw ObjC FFI, no framework dependency, ~0 binary overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three bugs prevented Cmd+Space from working: 1. Fast key combos caused Cmd's FlagsChanged-up to arrive before Space-down, removing LWin from held_keys. Fix: sync CGEvent modifier flags into held_keys on every KeyDown event. 2. Space key-up was always suppressed for trigger keys, so macOS never saw the complete down+up cycle. Fix: track bypass state with trigger_bypassed AtomicBool and pass through both events. 3. FlagsChanged events for modifiers could be suppressed, preventing macOS from tracking modifier state. Fix: always pass through FlagsChanged on macOS. Added docs/dev/modifier-bypass.md documenting the full solution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Modifier keys (Shift, Ctrl, Option, Cmd) are now passed through as-is to the output key — no platform-specific word_modifier/doc_modifier mapping needed. Users press their native modifier combos. - Removed word_modifier()/doc_modifier() from Platform trait - Renamed AccModel phase strings from Chinese to English: 横中键→H_MIDKEY, 纵中键→V_MIDKEY, 移动→MOVE, 止动→STOP Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design doc and TODO for the voice input feature: - Toggle mode (click V) and hold mode (hold V) - VAD-based audio segmentation with 25s max chunks - 3-stage transcription pipeline: local→server Whisper→LLM typo-fix - Each stage replaces previous text in-place at cursor - Architecture: cpal audio + webrtc-vad + HTTP to brainstorm server See plan/voice-input/README.md for full architecture and plan/voice-input/TODO.md for implementation checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Client (Rust):
- VoiceModule: V key state machine (toggle click / hold release)
Wired into Modules dispatcher for on_key_down/on_key_up/stop_all
- AudioCapture: cpal-based cross-platform mic capture (16kHz mono f32)
with ring buffer, start/stop/take_samples API
Server (brainstorm):
- POST /api/voice-transcribe: streaming NDJSON endpoint
Stage 1: Whisper transcription → {stage:"transcribed"}
Stage 2: gpt-4o-mini typo-fix → {stage:"polished", is_final:true}
CORS enabled for cross-origin Rust client
Next: wire audio capture into VoiceModule, add VAD, HTTP client
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full voice pipeline wired together: - Energy-based VAD: 20ms frames, RMS threshold, 500ms silence = chunk end - Force-splits at 25s for Whisper's 30s limit - WAV encoder (16-bit PCM mono) for server upload - HTTP POST via ureq to brainstorm voice-transcribe endpoint - Parses streaming NDJSON response, types final text at cursor - AudioCapture created on background thread (cpal Stream is !Send) - Platform::type_text() default impl maps ASCII to key_tap calls - Server URL configurable via CLX_VOICE_SERVER env var Usage: hold Space+V to dictate, release to send. Or tap to toggle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-phase transcription pipeline: 1. Local Whisper (whisper.cpp via whisper-rs): instant rough draft typed at cursor (~200ms on Apple Silicon) 2. Server Whisper + LLM: polished text replaces rough via Backspace - Model: ggml-tiny.en.bin at ~/.cache/capslockx/ (~75MB) - Auto-downloads instructions printed if model missing - Graceful fallback to server-only if model unavailable - Metal GPU acceleration auto-detected on macOS - Platform::type_text() for typing transcription at cursor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resample 48kHz mic audio to 16kHz before Whisper (fixes hallucinations) - Skip chunks shorter than 1 second - Preload Whisper model at startup (first Space+V is instant) - Persistent bg thread (model stays loaded between sessions) - Server: skip empty transcriptions, install nodemailer dep - macOS: key_tap_with_mods embeds modifier flags on CGEvent atomically (fixes Shift+HJKL text selection) - Quiet verbose logs (window snapshots, VAD events, audio capture) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- macOS type_text: CGEventKeyboardSetUnicodeString for full Unicode (Chinese, Japanese, emoji — not just ASCII) - Whisper auto-scaling: budget-based (wall clock vs inference time), non-blocking background model loading (old model keeps working), persists tier across restarts via whisper-tier.txt - Noise filter: skip bracketed annotations, hallucinations, <3 chars - Voice server URL reads from ~/.config/CapsLockX/config.json - VAD events logged again (speech start/end) - rs/dev-watch.sh: cargo-watch auto-rebuild + restart on file change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace RMS energy VAD with TEN VAD neural network (308KB ONNX model) — reliably distinguishes human speech from keyboard clicks, music, ambient noise. 10ms frames, 0.16ms inference on M-series. - Native Core Graphics waveform overlay: transparent floating window at bottom-center, green when speaking, gray when silent. Custom NSView with drawRect via raw ObjC FFI, ~20fps. - Resample 48kHz→16kHz before VAD (not after) for consistency. - Platform trait: show/hide/update_voice_overlay methods. - Whisper auto-scaling: non-blocking background model loading, budget-based (wall clock), persists tier across restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace RMS energy VAD with TEN VAD neural network (308KB model) Distinguishes human speech from keyboard clicks, music, ambient noise - Space+V = mic only (as before) - Space+Shift+V = mic + system audio (ScreenCaptureKit) Captures meeting audio, YouTube, etc. mixed with your voice - SystemAudioStream trait + MacPlatform override - ScreenCaptureKit stub with CMSampleBuffer handler ready (full async SCShareableContent flow TODO) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of buffering all continuous speech until silence (up to 25s), emit partial chunks every 3 seconds so text appears incrementally as the user speaks — like an input method editor. Also removed corrupted ggml-small.bin (incomplete download). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit partial transcription every 1s of continuous speech instead of 3s. Use rolling buffer with VAD speech detection — transcribe on fixed interval while speech is active, flush remainder on speech end. Base model: 1s audio → ~180ms = 1.2s total latency Small model: 1s audio → ~575ms = 1.6s total latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of typing the full transcription each time, compute the diff between what was already typed and the new transcription: - Find common prefix - Backspace the diverging suffix - Type only the new suffix This makes continuous speech feel like an IME — text flows in incrementally without repeating. Server polishes the full utterance on speech end by replacing all typed text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Transcribe every 0.5s of new speech (was 1s) — Whisper inference is ~constant time regardless of audio length, so faster intervals are essentially free - Server polishing now runs in a background thread so it doesn't block the streaming transcription loop - Forced base model for lower latency (~450ms vs ~1300ms small) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of re-transcribing the entire growing buffer (causing massive diffs when Whisper changes its mind about old text), use a committed prefix approach: - Only the last ~5s of audio (pending buffer) gets re-transcribed - After 5s, text is frozen/committed and never changed - Diffs are small and local — no more -400 char backspace storms - Session log at /tmp/clx-voice.log shows full text state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade threshold 1.25x → 5x (prevent unnecessary model switches) - Samples before scaling 3 → 10 (more evidence needed) - Commit window 5s → 3s (freeze text faster, less Whisper instability) - Force base model for consistent low-latency streaming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of committing at a fixed 3s interval (which causes large diffs right before commit), wait until the transcription is stable — same text for 2 consecutive inference cycles. This means: - Commits at natural phrase boundaries where Whisper has settled - No more large pre-commit backspace storms - Force-commit at 5s as safety net for very long utterances Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VoiceProcessingIO ducks (lowers volume of) other system audio by default. Added kAUVoiceIOProperty_OtherAudioDuckingConfiguration (property 2108) with minimum ducking level. Speakers should now stay audible during echo-cancelled mic capture. Re-enabled VoiceProcessingIO for Shift+V (system audio capture mode). Normal Space+V still uses cpal (no ducking at all). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Speakers audible (ducking minimized) ✓ - VoiceProcessingIO starts successfully ✓ - Format query fails (-10877) — assuming 48kHz mono f32 - AEC not effective yet — both tracks still show same content - TODO: fix format to get actual echo-cancelled audio from VPIO Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: VPIO on macOS reports 9ch non-interleaved format but AudioUnitRender with 1-buffer mono succeeds and returns very quiet echo-cancelled audio (rms 0.001-0.036 vs normal 0.05-0.3). The AEC IS working — speaker bleed is cancelled. The signal is just extremely quiet. Applied 30x gain amplification to bring levels back to normal for VAD/Whisper processing. Also added test-vpio standalone binary for debugging VPIO independently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test-vpio + Whisper test with English YouTube on speakers and Japanese audio into mic shows 7/8 transcriptions as Japanese. English speaker bleed is successfully cancelled. VPIO (30x gain) → Whisper correctly separates: 🎤 VPIO: "はい、オートです。" (Japanese from mic) 🎤 VPIO: "さぁ、ちょっと、開いたいです。" (Japanese from mic) vs speakers playing English YouTube → cancelled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ary) The 9-buffer non-interleaved approach caused render status=-50. Reverted to simple 1-buffer mono render which the test binary proved works correctly. AEC effectiveness: 91% (10/11 non-English, 1 leak). Standalone test confirmed: Japanese mic audio separated from English speakers with VoiceProcessingIO + 30x gain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously truncated the combined string, cutting off the 🎤 line and emoji when text was long. Now each line (mic/sys) is truncated independently at 80 chars, preserving both emoji labels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also fix subtitle overlay: split lines on \n before parsing emoji tags, add transparent-bg newline between lines for proper NSTextField rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pure Rust NLMS (Normalized Least Mean Squares) adaptive filter: - 4800 taps (300ms at 16kHz) learns speaker→mic acoustic path - Subtracts predicted echo from mic signal - Cross-platform: works on macOS/Windows/Linux - Stacks with VoiceProcessingIO: VPIO removes ~91%, NLMS cleans rest - Also added noise gate (0.002 threshold) in VPIO voice_capture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moved gain+noise gate from voice_capture.rs callback to voice.rs AFTER the NLMS filter. NLMS now operates on raw VPIO signal (pre-amplification) where echo residual is tiny. After NLMS cleans it, 40x gain amplifies clean voice only. Result: 0 English leaks in both standalone test and CapsLockX session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncation now keeps first 2 chars (emoji + space) as prefix, then "..." + last 74 chars. Previously took last 77 chars from the end, cutting off the emoji label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bottleneck: two Whisper instances running sequentially (~400ms/cycle). Fixes: - Sys Whisper now runs 3x less often (larger streaming interval) - Mic Whisper gets priority for faster response - NLMS filter reduced 4800→1600 taps (3x less CPU, 100ms still enough) Net effect: mic transcription ~200ms/cycle instead of ~400ms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mic-only is faster (no AEC/NLMS/system audio overhead). Dual capture with echo cancellation only when explicitly requested. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously used stale 2s snapshot, causing index drift when user manually clicked a window between Z presses. Now: - Always takes fresh window snapshot (handles open/close) - Detects frontmost app via NSWorkspace.frontmostApplication - Starts cycling from the actually-focused window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esses - Removed (pid, title) sort — keep natural z-order from CGWindowList (front-to-back, most recently used first) - Fresh snapshot on first press or after 2s pause - Reuse snapshot during rapid cycling (prevents ping-pong) - Index 0 = frontmost, so first Z goes to 2nd most recent window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ling Major features: - SenseVoice (sherpa-onnx) as default STT engine with Whisper fallback - LLM-based STT error correction via Gemini/OpenAI/Anthropic - Brainstorm agent (Space+B) with web_search, fetch_url, js_eval (Boa), math_eval (Woxi) - Non-modal brainstorm prompt panel (doesn't block voice overlay) - Voice overlay: auto-resize, drag handle on hover, hidden from screen share - Window cycling: CGWindowID-based stable ordering, frontmost detection - Browser voice: SenseVoice WASM with server-mode streaming - STT benchmark suite (SenseVoice vs Whisper tiny/base/small/medium/large-v3) - Dual-track STT worker architecture (non-blocking audio loop) - Preferences UI: STT engine, LLM API key, model selection - Config persistence across restarts (overlay positions, settings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add voice-standalone binary: tests Space+V pipeline without full CLX (VoiceProcessingIO AEC mic + ScreenCaptureKit sys audio + overlay) - Fix NSSize/NSRect ABI mismatch in voice_overlay auto_resize (ARM64) - Fix byte-slice panics on Chinese UTF-8 in subtitle debug logs - Fix sys track subtitle never updating (add sys_subtitle_dirty path) - Fix STT channel saturation: pre-accumulate before sending (MIC 200ms, SYS 500ms — reduces production from ~40/s to ~7/s) - Fix mic VAD false-hold: cap mic_pending_buf to last 2s on nospeech so growing-silent-buffer doesn't waste CPU on ever-growing chunks - Fix unbounded sys_committed buffer: cap at 200 chars after each commit - Raise SPEECH_START_FRAMES 4→8 (128ms) to filter ambient noise - Add 10s force-finalize in STT worker for runaway mic pending buf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
process_mic_streaming now returns Option<bool>: - Some(true) = committed → clear pending_buf - Some(false) = speech in-progress → keep accumulating (up to 5s cap) - None = nospeech → trim to 1s to avoid noise contamination Previously, every non-commit call trimmed the mic buffer to 1s regardless of whether real speech was detected. This meant SenseVoice always saw ≤1s of audio and mic_stable never reached 2, so committed text was stale. Now the buffer grows naturally during speech, giving multi-phrase context. Also applies last_n_chars(200) cap to sys_subtitle_dirty path and process_mic_streaming subtitle build. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent: - Added tools: js_eval (rquickjs, 8x faster than boa), math_eval (woxi), speak (TTS queue), wait, read_file_range, task management (background tasks with timeout), read_screen, read_clipboard, screenshot - Deduplicate consecutive identical tool calls - Context compaction when history exceeds 60K chars - Large tool outputs saved to file, agent reads via read_file_range - Speech queue: serial playback, no overlap, fire-and-forget LLM: - Added Ollama provider (local, OpenAI-compatible API) - Auto-discover best model from provider APIs (Gemini, OpenAI, Ollama) - Anthropic uses claude-opus-4-latest alias - Fallback chain: Gemini → GPT-4o → Claude Opus → Ollama local TTS: - Fallback chain: ElevenLabs → Gemini → OpenAI → msedge-tts → macOS say - Speech queue thread for serial playback Brainstorm: - Keep/read histories checkbox (persists across restarts) - Prompt format: clipboard\n---\n\n=== - Result overlay non-focusable (like voice overlay) - Selected text via AX API (no clipboard pollution) Refactor: - Renamed key_tap_ctrl → key_tap_cmd_or_ctrl (cross-platform clarity) - Removed key_tap_cmd_or_ctrl_shifted (use key_tap_with_mods instead) - rquickjs for native JS eval, boa_engine kept as WASM fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Voice pipeline now uses Gemini cloud STT for final transcription when GEMINI_API_KEY is available. SenseVoice still handles streaming (instant feedback), but at speech end the full utterance is re-transcribed by Gemini for higher accuracy (100% JA vs 94.7% local, 96.1% EN vs 95.6%). Fallback chain: Gemini cloud → SenseVoice + LLM correction → SenseVoice raw. Added: - cloud_stt.rs: Gemini generateContent with base64 WAV audio - stt-compare binary: benchmark SenseVoice vs Gemini on test audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After each commit, mic_pending_buf keeps 1s of context. Two inferences later (200ms), the context re-transcribes to the same text → stability fires → immediate duplicate commit. This repeated indefinitely. Fix: track mic_new_samples_since_commit (resets on commit/SpeechEnd). Stability gate now requires mic_new_samples > 16_000 (1s of genuinely new audio) before a stability commit can fire. Also: tail-based stability comparison (last 30 chars) so appending new words at sentence end doesn't reset the stable counter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MLX server auto-detected at :8321 as local LLM fallback (Ollama broken on M5) - Voice overlay toolbar: horizontal bar with ⠿ Move, model info, ✕ Close - Resize grip ⇲ at bottom-right corner (drag to resize overlay) - Shift+R/F = horizontal scroll (R/F = vertical, matching AHK) - Brainstorm: ESC closes dialog, AX selected text captured on event tap thread - Brainstorm: clipboard save/restore when falling back to Cmd+C - Brainstorm result overlay non-focusable (doesn't steal focus) - Speech queue: serial playback, dedup consecutive identical tool calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…st-byte - Fixed crash: removed resize grip setFrame from background thread - Subtitle shows line-per-commit (no more horizontal scrolling) - VAD end threshold raised 0.4→0.5 (drops out of speech faster) - MIC_SEND_THRESHOLD halved 3200→1600 (first-byte ~180ms, was ~280ms) - STT polishing cascade: MLX local (~160ms) → Gemini cloud → LLM corrector - Debug logging for speech buffer and subtitle updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete status table of all implemented features, benchmarks, architecture diagram, and next steps for v2.0/2.1/2.2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces CapsLockX 2.0 as a Rust rewrite (macOS-first) and adds major AI/voice features (STT, TTS, brainstorm agent), plus new platform adapters, tooling, docs, and CI/release support.
Changes:
- Adds new Rust core capabilities: voice pipeline (local+cloud STT, correction), TTS fallback chain, agent/tools, background task manager, and audio capture.
- Adds macOS adapter implementation (CGEventTap hook, tray + prefs UI, voice overlay/capture/system audio).
- Updates packaging/CI: npm launcher downloads per-platform binaries; builds macOS in CI/release; expands documentation and plans.
Reviewed changes
Copilot reviewed 72 out of 93 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tmp/capslockx-mac | Adds submodule pointer for macOS adapter snapshot |
| tmp/brainstorm | Adds submodule pointer for brainstorm snapshot |
| rs/test-results.txt | Adds recorded manual test results artifact |
| rs/test-manual.ahk | Adds manual QA helper for Shift+HJKL selection |
| rs/dev-watch.sh | Adds cargo-watch script for auto rebuild/restart |
| rs/core/src/tts.rs | Introduces multi-tier TTS fallback implementation |
| rs/core/src/task_manager.rs | Adds background task manager with timeouts |
| rs/core/src/stt_corrector.rs | Adds incremental LLM-based STT correction |
| rs/core/src/state.rs | Extends config with STT/brainstorm/LLM settings |
| rs/core/src/platform.rs | Extends platform trait (audio, text input, mods) |
| rs/core/src/modules/voice_player.html | Adds HTML utility to play voice notes with subtitles |
| rs/core/src/modules/mouse.rs | Updates mouse/scroll phases & adds shift-scroll behavior |
| rs/core/src/modules/mod.rs | Adds brainstorm + voice modules and wiring |
| rs/core/src/modules/edit.rs | Adds atomic modifier-aware key tapping for macOS |
| rs/core/src/local_whisper.rs | Adds local whisper.cpp wrapper with autoscaling |
| rs/core/src/local_sherpa.rs | Adds local SenseVoice wrapper with auto-download |
| rs/core/src/lib.rs | Exposes new core modules (agent, STT, TTS, etc.) |
| rs/core/src/engine.rs | Improves trigger bypass logic + held-key syncing |
| rs/core/src/cloud_stt.rs | Adds Gemini cloud STT transcription helper |
| rs/core/src/bin/test-llm.rs | Adds quick LLM client test binary |
| rs/core/src/bin/test-agent.rs | Adds agent tool test runner binary |
| rs/core/src/bin/stt-server.rs | Adds persistent local STT server binary |
| rs/core/src/bin/stt-quick.rs | Adds one-shot STT helper binary |
| rs/core/src/bin/stt-compare.rs | Adds local vs cloud STT benchmark/compare tool |
| rs/core/src/bin/stt-bench.rs | Adds SenseVoice vs Whisper benchmark with WER/CER |
| rs/core/src/bin/sherpa-test.rs | Adds standalone mic capture + SenseVoice test binary |
| rs/core/src/bin/clx-agent.rs | Adds standalone CLI agent chat tool |
| rs/core/src/audio_capture.rs | Adds cross-platform mic capture via cpal |
| rs/core/src/acc_model.rs | Adds option to drive ticks externally (hook thread) |
| rs/core/Cargo.toml | Adds dependencies for audio/STT/TTS/JS/math engines |
| rs/adapters/windows/src/output.rs | Updates close_tab to cmd-or-ctrl helper |
| rs/adapters/windows/src/hook.rs | Drives AccModel ticks via SetTimer on hook thread |
| rs/adapters/macos/src/tray.rs | Adds macOS tray icon via ObjC FFI |
| rs/adapters/macos/src/prefs_html.html | Adds macOS preferences UI (WKWebView content) |
| rs/adapters/macos/src/main.rs | Adds macOS adapter entry point |
| rs/adapters/macos/src/key_map.rs | Adds macOS keycode mapping |
| rs/adapters/macos/src/hook.rs | Adds CGEventTap hook w/ pass-through/suppress logic |
| rs/adapters/macos/src/config_store.rs | Adds persistent config store for macOS |
| rs/adapters/macos/src/bin/voice-standalone.rs | Adds standalone voice pipeline binary |
| rs/adapters/macos/src/bin/test-vpio.rs | Adds VoiceProcessingIO AEC test binary |
| rs/adapters/macos/src/bin/test-cycle.rs | Adds direct window cycling stability test |
| rs/adapters/macos/build.rs | Links required macOS frameworks |
| rs/adapters/macos/Cargo.toml | Adds macOS adapter crate |
| rs/adapters/linux/src/output.rs | Updates close_tab to cmd-or-ctrl helper |
| rs/adapters/browser/www/vite.config.js | Adds dev server config for browser adapter |
| rs/adapters/browser/www/sherpa/sherpa-onnx-vad.js | Adds sherpa VAD JS helper |
| rs/adapters/browser/www/sherpa/.gitignore | Ignores wasm/data model artifacts |
| rs/adapters/browser/www/package.json | Adds browser adapter web package config |
| rs/adapters/browser/www/.gitignore | Ignores node_modules |
| rs/adapters/browser/src/platform.rs | Updates close_tab to cmd-or-ctrl helper |
| rs/Cargo.toml | Adds macOS adapter to workspace |
| plan/voice-input/voice-modes.md | Adds voice modes design notes |
| plan/voice-input/TODO.md | Adds voice implementation plan checklist |
| plan/voice-input/README.md | Adds voice feature spec |
| package.json | Switches npm bin to node launcher & includes new artifacts |
| bin/capslockx.mjs | Adds cross-platform launcher/downloader script |
| TODO.md | Adds voice feature plan reference + macOS drag issue note |
| .playwright-cli/page-2026-03-18T17-55-02-733Z.yml | Adds Playwright snapshot artifact |
| .playwright-cli/page-2026-03-18T17-53-48-281Z.yml | Adds Playwright snapshot artifact |
| .playwright-cli/page-2026-03-18T17-51-22-570Z.yml | Adds Playwright snapshot artifact |
| .playwright-cli/page-2026-03-18T17-50-03-292Z.yml | Adds Playwright snapshot artifact |
| .playwright-cli/page-2026-03-18T17-49-54-727Z.yml | Adds Playwright snapshot artifact |
| .playwright-cli/page-2026-03-18T17-49-44-295Z.yml | Adds Playwright snapshot artifact |
| .github/workflows/release-rust.yml | Adds macOS release build + alters artifact matching behavior |
| .github/workflows/ci-rust.yml | Adds macOS CI build/test/clippy |
| docs/dev/window-cycle-stability.md | Documents window cycling stability fixes |
| docs/dev/modifier-bypass.md | Documents modifier+Space bypass design |
| docs/dev/dual-track-stt.md | Documents dual-track STT architecture |
| docs/dev/agent-test-matrix.md | Documents agent tool test matrix |
| docs/Roadmap.md | Rewrites roadmap for Rust v2.0 scope/plans |
Files not reviewed (1)
- rs/adapters/browser/www/package-lock.json: Language not supported
Comments suppressed due to low confidence (15)
rs/core/src/modules/mouse.rs:1
- Horizontal scrolling is applied multiple times:
dxscrolls inside both branches and then again unconditionally on line 139. This will double-scroll wheneverdx != 0(and also double-scroll in Shift mode). Remove the unconditionalif dx != 0 { p.scroll_h(dx * 3); }or restructure so each axis is emitted exactly once per call.
rs/core/src/platform.rs:1 - The implementation always uses
LCtrl, but the docstring says it should use Cmd on macOS. This will break common macOS shortcuts (e.g., Cmd+W for close tab). Make the modifier conditional ontarget_os(useKeyCode::LWinon macOS;KeyCode::LCtrlelsewhere) or provide a platform override for macOS and keep the default consistent with its behavior.
rs/core/src/task_manager.rs:1 task_killsets a flag and updates status, but the running task never observeskill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing theArc<AtomicBool>(or a cancellation token/channel) intofunc, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
rs/core/src/task_manager.rs:1task_killsets a flag and updates status, but the running task never observeskill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing theArc<AtomicBool>(or a cancellation token/channel) intofunc, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
rs/adapters/browser/www/vite.config.js:1- The dev server binds to all interfaces (
0.0.0.0) and requires TLS keys from hard-coded/tmppaths. This is easy to misconfigure and can unintentionally expose a dev server on a LAN. Prefer defaultinghostto127.0.0.1and sourcing key/cert paths (or HTTPS enablement) from environment variables with safe fallbacks.
rs/core/src/local_whisper.rs:1 - The module-level docs say upgrade happens when inference is >1.25x realtime, but
UPGRADE_THRESHOLDis 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
rs/core/src/local_whisper.rs:1 - The module-level docs say upgrade happens when inference is >1.25x realtime, but
UPGRADE_THRESHOLDis 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
rs/core/src/tts.rs:1 - Invalid base64 characters are silently treated as 0 via
unwrap_or(0), which can corrupt audio output without surfacing an error. Return an error when a character is not found in the alphabet (except=padding), or use a well-tested base64 crate to decode and validate input.
rs/core/src/tts.rs:1 - Audio is written to fixed filenames in
/tmp. Concurrent or overlappingspeak()calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
rs/core/src/tts.rs:1 - Audio is written to fixed filenames in
/tmp. Concurrent or overlappingspeak()calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
rs/core/src/platform.rs:1 - The default
type_textsilently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/core/src/platform.rs:1 - The default
type_textsilently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/core/src/platform.rs:1 - The default
type_textsilently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
rs/dev-watch.sh:1 pkill -f \"target/release/capslockx\"can match and kill unrelated processes whose command lines happen to contain that substring. Consider using a PID file,pkill -xwith an exact process name when possible, or filtering by the full path to the spawned binary to avoid collateral termination.
rs/test-results.txt:1- The PR description marks the test plan items as completed, but the committed
rs/test-results.txtshows failing cases (5/8 passed). If this file is meant as an authoritative test artifact, it should reflect a passing run (or be excluded from the PR) to avoid conflicting signals about readiness.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Commit binary to main so npm package includes it. | ||
| - name: Commit macOS binary to main | ||
| run: | | ||
| git config user.name "github-actions[bot]" | ||
| git config user.email "github-actions[bot]@users.noreply.github.com" | ||
| git pull origin main --rebase | ||
| git add capslockx-macos-arm64 | ||
| git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}" | ||
| git push origin main | ||
|
|
There was a problem hiding this comment.
Pushing release-built binaries back to main from a release workflow is risky (mutates default branch, can create unexpected CI loops, merge conflicts, and provenance issues). Prefer attaching artifacts only to GitHub Releases and publishing npm packages from the release artifacts (or a dedicated distribution branch), rather than committing binaries into source control.
| # Commit binary to main so npm package includes it. | |
| - name: Commit macOS binary to main | |
| run: | | |
| git config user.name "github-actions[bot]" | |
| git config user.email "github-actions[bot]@users.noreply.github.com" | |
| git pull origin main --rebase | |
| git add capslockx-macos-arm64 | |
| git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}" | |
| git push origin main |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Test plan
cd rs && cargo build -p capslockx-macos --release🤖 Generated with Claude Code