Skip to content

CapsLockX 2.0: Rust rewrite with voice, brainstorm, TTS#123

Merged
snomiao merged 78 commits intomainfrom
beta
Mar 21, 2026
Merged

CapsLockX 2.0: Rust rewrite with voice, brainstorm, TTS#123
snomiao merged 78 commits intomainfrom
beta

Conversation

@snomiao
Copy link
Copy Markdown
Member

@snomiao snomiao commented Mar 21, 2026

Summary

  • Ground-up Rust rewrite targeting cross-platform (macOS first)
  • SenseVoice local STT (95%+ accuracy) + Gemini cloud STT + MLX local LLM correction
  • Brainstorm agent (Space+B) with 4 LLM providers, 10+ tools, persistent chat history
  • TTS with 5-tier fallback chain (ElevenLabs → Gemini → OpenAI → msedge → native)
  • Sandboxed JS engine (rquickjs) + Wolfram math engine (Woxi)
  • Voice overlay with waveform, subtitles, drag/resize/close
  • Window cycling via CGWindowID (stable across arrange/minimize)
  • Mouse clamp to screen bounds (multi-monitor safe)
  • Launch at login via LaunchAgent
  • Browser WASM adapter with SenseVoice voice input

Test plan

  • Build: cd rs && cargo build -p capslockx-macos --release
  • Voice input: Space+V hold to record, release to transcribe
  • Brainstorm: Space+B with selected text, Enter to send
  • Window cycling: Space+Z forward, Space+C arrange
  • Mouse: Space+WASD movement clamped to screen edges
  • TTS: Agent auto-speaks translations

🤖 Generated with Claude Code

snomiao and others added 30 commits March 14, 2026 20:05
- Add brainstorm_voice() with MCI waveaudio recording (V to record, V to send, ESC to cancel)
- Auto-send after 60s timeout to prevent super long recordings
- Show default mic device name in tooltip while recording
- Attach recorded WAV as audio field in form data (same pattern as image)
- Add "both" capture mode (clipboard text + window screenshot)
- Refactor: merge brainstorm_prompt/brainstorm_quick_capture into unified brainstorm_ask()
- Extract brainstorm_heatup(), remove dead/commented-out code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the CapsLockX macOS platform adapter with:
- CGEventTap hook for intercepting keyboard events at HID level
- Raw FFI callback to properly suppress events (returns NULL)
  - Works around core-graphics crate bug where None still passes events
- CGEventPost output for injecting keyboard, mouse, and scroll events
- Self-injection detection via EVENT_SOURCE_USER_DATA tagging
- Full macOS virtual keycode ↔ KeyCode bidirectional mapping
- FlagsChanged event handling for modifier key press/release detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mouse cursor now stops at screen edges (union of all display bounds)
instead of disappearing off-screen. Also adjusts macOS scroll speed
to account for LINE scroll units being much larger than on Windows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch scroll from LINE to PIXEL units for smooth 1px-granularity
  trackpad-like scrolling
- Implement real window cycling (Space+Z) using CGWindowListCopyWindowInfo
  + NSRunningApplication instead of Cmd+Tab — directly activates the
  next/prev app window like Alt+Tab on Windows
- Add arrange_windows: Ctrl+Cmd+F for fullscreen, Ctrl+Up for
  Mission Control
- Reset scroll speed to default 720 (appropriate for pixel units)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…elds)

macOS disables CGEventTaps during secure input (password dialogs,
FileVault, etc.). The tap was staying disabled afterwards, making
CapsLockX unresponsive. Now detects TapDisabledByTimeout and
TapDisabledByUserInput events and automatically re-enables the tap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Window cycling (Space+Z) now cycles individual windows (not just apps)
  using Accessibility API (AXUIElement) to raise specific windows
- Window tiling (Space+C) uses AX API to set position/size within the
  visible work area (excluding menu bar and Dock):
  - Plain: cascade with 48px offset
  - Shift: sqrt-based grid layout
- Both cycle and arrange use stable (pid, title) ordering
- Ctrl+Space+N/P sends Ctrl+Tab / Ctrl+Shift+Tab (switch browser tabs)
  via new key_tap_ctrl_shifted platform method with proper CGEvent flags
- Cmd+Space now passes through to macOS Spotlight (Win+Space bypasses
  on Windows too for input language switcher)
- Added drag bug to TODO.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ci-rust.yml: add build-macos job (macos-latest, aarch64-apple-darwin)
  with check, clippy, test, and release build
- release-rust.yml: add build-macos job that builds capslockx-macos and
  uploads capslockx-macos-arm64 binary to GitHub Release

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- bin/capslockx.mjs: detects OS/arch and runs the correct Rust binary
  (Windows, macOS arm64/x64, Linux x64)
- Looks for binary in: repo root, local cargo build, then auto-downloads
  from latest GitHub Release as fallback
- package.json: bin field points to the cross-platform launcher
- package.json: files field includes all platform binaries
- release-rust.yml: commit macOS binary to main for npm inclusion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The AHK launcher is not the Rust binary — falling back to it would
run the wrong thing. If clx-rust.exe isn't found locally, the
auto-download from GitHub Releases handles it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the bypass path suppressed the original event and re-injected
Space via key_tap, which stripped modifier flags — macOS saw naked Space
instead of Cmd+Space. Now the bypass returns PassThrough so the original
event reaches the OS with all modifier flags intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- White X icon when inactive, blue X when CapsLockX mode is active
- Menu with "Quit CapsLockX" item
- NSApplication initialized as Accessory (no dock icon)
- Icon updates dispatched to main queue for AppKit thread safety
- Uses raw Objective-C FFI, no additional dependencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Preferences UI with same Catppuccin Mocha theme as Windows
- 5 trigger key checkboxes + 3 speed sliders
- WKWebView with JS bridge (webkit.messageHandlers) instead of Tauri
- Custom ObjC classes created at runtime for WKScriptMessageHandler
  and menu item action target
- Config persisted to ~/.config/CapsLockX/config.json
- Tray menu: "Preferences…" (Cmd+,) + separator + "Quit CapsLockX"
- ~300 lines of raw ObjC FFI, no framework dependency, ~0 binary overhead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three bugs prevented Cmd+Space from working:

1. Fast key combos caused Cmd's FlagsChanged-up to arrive before
   Space-down, removing LWin from held_keys. Fix: sync CGEvent
   modifier flags into held_keys on every KeyDown event.

2. Space key-up was always suppressed for trigger keys, so macOS
   never saw the complete down+up cycle. Fix: track bypass state
   with trigger_bypassed AtomicBool and pass through both events.

3. FlagsChanged events for modifiers could be suppressed, preventing
   macOS from tracking modifier state. Fix: always pass through
   FlagsChanged on macOS.

Added docs/dev/modifier-bypass.md documenting the full solution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Modifier keys (Shift, Ctrl, Option, Cmd) are now passed through as-is
  to the output key — no platform-specific word_modifier/doc_modifier
  mapping needed. Users press their native modifier combos.
- Removed word_modifier()/doc_modifier() from Platform trait
- Renamed AccModel phase strings from Chinese to English:
  横中键→H_MIDKEY, 纵中键→V_MIDKEY, 移动→MOVE, 止动→STOP

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design doc and TODO for the voice input feature:
- Toggle mode (click V) and hold mode (hold V)
- VAD-based audio segmentation with 25s max chunks
- 3-stage transcription pipeline: local→server Whisper→LLM typo-fix
- Each stage replaces previous text in-place at cursor
- Architecture: cpal audio + webrtc-vad + HTTP to brainstorm server

See plan/voice-input/README.md for full architecture and
plan/voice-input/TODO.md for implementation checklist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Client (Rust):
- VoiceModule: V key state machine (toggle click / hold release)
  Wired into Modules dispatcher for on_key_down/on_key_up/stop_all
- AudioCapture: cpal-based cross-platform mic capture (16kHz mono f32)
  with ring buffer, start/stop/take_samples API

Server (brainstorm):
- POST /api/voice-transcribe: streaming NDJSON endpoint
  Stage 1: Whisper transcription → {stage:"transcribed"}
  Stage 2: gpt-4o-mini typo-fix → {stage:"polished", is_final:true}
  CORS enabled for cross-origin Rust client

Next: wire audio capture into VoiceModule, add VAD, HTTP client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full voice pipeline wired together:
- Energy-based VAD: 20ms frames, RMS threshold, 500ms silence = chunk end
- Force-splits at 25s for Whisper's 30s limit
- WAV encoder (16-bit PCM mono) for server upload
- HTTP POST via ureq to brainstorm voice-transcribe endpoint
- Parses streaming NDJSON response, types final text at cursor
- AudioCapture created on background thread (cpal Stream is !Send)
- Platform::type_text() default impl maps ASCII to key_tap calls
- Server URL configurable via CLX_VOICE_SERVER env var

Usage: hold Space+V to dictate, release to send. Or tap to toggle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-phase transcription pipeline:
1. Local Whisper (whisper.cpp via whisper-rs): instant rough draft
   typed at cursor (~200ms on Apple Silicon)
2. Server Whisper + LLM: polished text replaces rough via Backspace

- Model: ggml-tiny.en.bin at ~/.cache/capslockx/ (~75MB)
- Auto-downloads instructions printed if model missing
- Graceful fallback to server-only if model unavailable
- Metal GPU acceleration auto-detected on macOS
- Platform::type_text() for typing transcription at cursor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resample 48kHz mic audio to 16kHz before Whisper (fixes hallucinations)
- Skip chunks shorter than 1 second
- Preload Whisper model at startup (first Space+V is instant)
- Persistent bg thread (model stays loaded between sessions)
- Server: skip empty transcriptions, install nodemailer dep
- macOS: key_tap_with_mods embeds modifier flags on CGEvent atomically
  (fixes Shift+HJKL text selection)
- Quiet verbose logs (window snapshots, VAD events, audio capture)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- macOS type_text: CGEventKeyboardSetUnicodeString for full Unicode
  (Chinese, Japanese, emoji — not just ASCII)
- Whisper auto-scaling: budget-based (wall clock vs inference time),
  non-blocking background model loading (old model keeps working),
  persists tier across restarts via whisper-tier.txt
- Noise filter: skip bracketed annotations, hallucinations, <3 chars
- Voice server URL reads from ~/.config/CapsLockX/config.json
- VAD events logged again (speech start/end)
- rs/dev-watch.sh: cargo-watch auto-rebuild + restart on file change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace RMS energy VAD with TEN VAD neural network (308KB ONNX model)
  — reliably distinguishes human speech from keyboard clicks, music,
  ambient noise. 10ms frames, 0.16ms inference on M-series.
- Native Core Graphics waveform overlay: transparent floating window
  at bottom-center, green when speaking, gray when silent.
  Custom NSView with drawRect via raw ObjC FFI, ~20fps.
- Resample 48kHz→16kHz before VAD (not after) for consistency.
- Platform trait: show/hide/update_voice_overlay methods.
- Whisper auto-scaling: non-blocking background model loading,
  budget-based (wall clock), persists tier across restarts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace RMS energy VAD with TEN VAD neural network (308KB model)
  Distinguishes human speech from keyboard clicks, music, ambient noise
- Space+V = mic only (as before)
- Space+Shift+V = mic + system audio (ScreenCaptureKit)
  Captures meeting audio, YouTube, etc. mixed with your voice
- SystemAudioStream trait + MacPlatform override
- ScreenCaptureKit stub with CMSampleBuffer handler ready
  (full async SCShareableContent flow TODO)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of buffering all continuous speech until silence (up to 25s),
emit partial chunks every 3 seconds so text appears incrementally
as the user speaks — like an input method editor.

Also removed corrupted ggml-small.bin (incomplete download).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit partial transcription every 1s of continuous speech instead of 3s.
Use rolling buffer with VAD speech detection — transcribe on fixed
interval while speech is active, flush remainder on speech end.

Base model: 1s audio → ~180ms = 1.2s total latency
Small model: 1s audio → ~575ms = 1.6s total latency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of typing the full transcription each time, compute the diff
between what was already typed and the new transcription:
- Find common prefix
- Backspace the diverging suffix
- Type only the new suffix

This makes continuous speech feel like an IME — text flows in
incrementally without repeating. Server polishes the full utterance
on speech end by replacing all typed text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Transcribe every 0.5s of new speech (was 1s) — Whisper inference
  is ~constant time regardless of audio length, so faster intervals
  are essentially free
- Server polishing now runs in a background thread so it doesn't
  block the streaming transcription loop
- Forced base model for lower latency (~450ms vs ~1300ms small)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of re-transcribing the entire growing buffer (causing massive
diffs when Whisper changes its mind about old text), use a committed
prefix approach:
- Only the last ~5s of audio (pending buffer) gets re-transcribed
- After 5s, text is frozen/committed and never changed
- Diffs are small and local — no more -400 char backspace storms
- Session log at /tmp/clx-voice.log shows full text state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade threshold 1.25x → 5x (prevent unnecessary model switches)
- Samples before scaling 3 → 10 (more evidence needed)
- Commit window 5s → 3s (freeze text faster, less Whisper instability)
- Force base model for consistent low-latency streaming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of committing at a fixed 3s interval (which causes large diffs
right before commit), wait until the transcription is stable — same
text for 2 consecutive inference cycles. This means:
- Commits at natural phrase boundaries where Whisper has settled
- No more large pre-commit backspace storms
- Force-commit at 5s as safety net for very long utterances

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
snomiao and others added 24 commits March 21, 2026 00:58
VoiceProcessingIO ducks (lowers volume of) other system audio by default.
Added kAUVoiceIOProperty_OtherAudioDuckingConfiguration (property 2108)
with minimum ducking level. Speakers should now stay audible during
echo-cancelled mic capture.

Re-enabled VoiceProcessingIO for Shift+V (system audio capture mode).
Normal Space+V still uses cpal (no ducking at all).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Speakers audible (ducking minimized) ✓
- VoiceProcessingIO starts successfully ✓
- Format query fails (-10877) — assuming 48kHz mono f32
- AEC not effective yet — both tracks still show same content
- TODO: fix format to get actual echo-cancelled audio from VPIO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: VPIO on macOS reports 9ch non-interleaved format but
AudioUnitRender with 1-buffer mono succeeds and returns very quiet
echo-cancelled audio (rms 0.001-0.036 vs normal 0.05-0.3).

The AEC IS working — speaker bleed is cancelled. The signal is just
extremely quiet. Applied 30x gain amplification to bring levels
back to normal for VAD/Whisper processing.

Also added test-vpio standalone binary for debugging VPIO independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test-vpio + Whisper test with English YouTube on speakers and
Japanese audio into mic shows 7/8 transcriptions as Japanese.
English speaker bleed is successfully cancelled.

VPIO (30x gain) → Whisper correctly separates:
  🎤 VPIO: "はい、オートです。" (Japanese from mic)
  🎤 VPIO: "さぁ、ちょっと、開いたいです。" (Japanese from mic)
  vs speakers playing English YouTube → cancelled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ary)

The 9-buffer non-interleaved approach caused render status=-50.
Reverted to simple 1-buffer mono render which the test binary proved
works correctly. AEC effectiveness: 91% (10/11 non-English, 1 leak).

Standalone test confirmed: Japanese mic audio separated from English
speakers with VoiceProcessingIO + 30x gain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously truncated the combined string, cutting off the 🎤 line
and emoji when text was long. Now each line (mic/sys) is truncated
independently at 80 chars, preserving both emoji labels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also fix subtitle overlay: split lines on \n before parsing emoji tags,
add transparent-bg newline between lines for proper NSTextField rendering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pure Rust NLMS (Normalized Least Mean Squares) adaptive filter:
- 4800 taps (300ms at 16kHz) learns speaker→mic acoustic path
- Subtracts predicted echo from mic signal
- Cross-platform: works on macOS/Windows/Linux
- Stacks with VoiceProcessingIO: VPIO removes ~91%, NLMS cleans rest
- Also added noise gate (0.002 threshold) in VPIO voice_capture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moved gain+noise gate from voice_capture.rs callback to voice.rs
AFTER the NLMS filter. NLMS now operates on raw VPIO signal
(pre-amplification) where echo residual is tiny. After NLMS
cleans it, 40x gain amplifies clean voice only.

Result: 0 English leaks in both standalone test and CapsLockX session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncation now keeps first 2 chars (emoji + space) as prefix,
then "..." + last 74 chars. Previously took last 77 chars from
the end, cutting off the emoji label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bottleneck: two Whisper instances running sequentially (~400ms/cycle).
Fixes:
- Sys Whisper now runs 3x less often (larger streaming interval)
- Mic Whisper gets priority for faster response
- NLMS filter reduced 4800→1600 taps (3x less CPU, 100ms still enough)

Net effect: mic transcription ~200ms/cycle instead of ~400ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mic-only is faster (no AEC/NLMS/system audio overhead).
Dual capture with echo cancellation only when explicitly requested.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously used stale 2s snapshot, causing index drift when user
manually clicked a window between Z presses. Now:
- Always takes fresh window snapshot (handles open/close)
- Detects frontmost app via NSWorkspace.frontmostApplication
- Starts cycling from the actually-focused window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esses

- Removed (pid, title) sort — keep natural z-order from CGWindowList
  (front-to-back, most recently used first)
- Fresh snapshot on first press or after 2s pause
- Reuse snapshot during rapid cycling (prevents ping-pong)
- Index 0 = frontmost, so first Z goes to 2nd most recent window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ling

Major features:
- SenseVoice (sherpa-onnx) as default STT engine with Whisper fallback
- LLM-based STT error correction via Gemini/OpenAI/Anthropic
- Brainstorm agent (Space+B) with web_search, fetch_url, js_eval (Boa), math_eval (Woxi)
- Non-modal brainstorm prompt panel (doesn't block voice overlay)
- Voice overlay: auto-resize, drag handle on hover, hidden from screen share
- Window cycling: CGWindowID-based stable ordering, frontmost detection
- Browser voice: SenseVoice WASM with server-mode streaming
- STT benchmark suite (SenseVoice vs Whisper tiny/base/small/medium/large-v3)
- Dual-track STT worker architecture (non-blocking audio loop)
- Preferences UI: STT engine, LLM API key, model selection
- Config persistence across restarts (overlay positions, settings)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add voice-standalone binary: tests Space+V pipeline without full CLX
  (VoiceProcessingIO AEC mic + ScreenCaptureKit sys audio + overlay)
- Fix NSSize/NSRect ABI mismatch in voice_overlay auto_resize (ARM64)
- Fix byte-slice panics on Chinese UTF-8 in subtitle debug logs
- Fix sys track subtitle never updating (add sys_subtitle_dirty path)
- Fix STT channel saturation: pre-accumulate before sending
  (MIC 200ms, SYS 500ms — reduces production from ~40/s to ~7/s)
- Fix mic VAD false-hold: cap mic_pending_buf to last 2s on nospeech
  so growing-silent-buffer doesn't waste CPU on ever-growing chunks
- Fix unbounded sys_committed buffer: cap at 200 chars after each commit
- Raise SPEECH_START_FRAMES 4→8 (128ms) to filter ambient noise
- Add 10s force-finalize in STT worker for runaway mic pending buf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
process_mic_streaming now returns Option<bool>:
- Some(true)  = committed → clear pending_buf
- Some(false) = speech in-progress → keep accumulating (up to 5s cap)
- None        = nospeech → trim to 1s to avoid noise contamination

Previously, every non-commit call trimmed the mic buffer to 1s regardless
of whether real speech was detected. This meant SenseVoice always saw ≤1s
of audio and mic_stable never reached 2, so committed text was stale.
Now the buffer grows naturally during speech, giving multi-phrase context.

Also applies last_n_chars(200) cap to sys_subtitle_dirty path and
process_mic_streaming subtitle build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent:
- Added tools: js_eval (rquickjs, 8x faster than boa), math_eval (woxi),
  speak (TTS queue), wait, read_file_range, task management (background tasks
  with timeout), read_screen, read_clipboard, screenshot
- Deduplicate consecutive identical tool calls
- Context compaction when history exceeds 60K chars
- Large tool outputs saved to file, agent reads via read_file_range
- Speech queue: serial playback, no overlap, fire-and-forget

LLM:
- Added Ollama provider (local, OpenAI-compatible API)
- Auto-discover best model from provider APIs (Gemini, OpenAI, Ollama)
- Anthropic uses claude-opus-4-latest alias
- Fallback chain: Gemini → GPT-4o → Claude Opus → Ollama local

TTS:
- Fallback chain: ElevenLabs → Gemini → OpenAI → msedge-tts → macOS say
- Speech queue thread for serial playback

Brainstorm:
- Keep/read histories checkbox (persists across restarts)
- Prompt format: clipboard\n---\n\n===
- Result overlay non-focusable (like voice overlay)
- Selected text via AX API (no clipboard pollution)

Refactor:
- Renamed key_tap_ctrl → key_tap_cmd_or_ctrl (cross-platform clarity)
- Removed key_tap_cmd_or_ctrl_shifted (use key_tap_with_mods instead)
- rquickjs for native JS eval, boa_engine kept as WASM fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Voice pipeline now uses Gemini cloud STT for final transcription when
GEMINI_API_KEY is available. SenseVoice still handles streaming (instant
feedback), but at speech end the full utterance is re-transcribed by
Gemini for higher accuracy (100% JA vs 94.7% local, 96.1% EN vs 95.6%).

Fallback chain: Gemini cloud → SenseVoice + LLM correction → SenseVoice raw.

Added:
- cloud_stt.rs: Gemini generateContent with base64 WAV audio
- stt-compare binary: benchmark SenseVoice vs Gemini on test audio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After each commit, mic_pending_buf keeps 1s of context. Two inferences
later (200ms), the context re-transcribes to the same text → stability
fires → immediate duplicate commit. This repeated indefinitely.

Fix: track mic_new_samples_since_commit (resets on commit/SpeechEnd).
Stability gate now requires mic_new_samples > 16_000 (1s of genuinely
new audio) before a stability commit can fire.

Also: tail-based stability comparison (last 30 chars) so appending new
words at sentence end doesn't reset the stable counter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MLX server auto-detected at :8321 as local LLM fallback (Ollama broken on M5)
- Voice overlay toolbar: horizontal bar with ⠿ Move, model info, ✕ Close
- Resize grip ⇲ at bottom-right corner (drag to resize overlay)
- Shift+R/F = horizontal scroll (R/F = vertical, matching AHK)
- Brainstorm: ESC closes dialog, AX selected text captured on event tap thread
- Brainstorm: clipboard save/restore when falling back to Cmd+C
- Brainstorm result overlay non-focusable (doesn't steal focus)
- Speech queue: serial playback, dedup consecutive identical tool calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…st-byte

- Fixed crash: removed resize grip setFrame from background thread
- Subtitle shows line-per-commit (no more horizontal scrolling)
- VAD end threshold raised 0.4→0.5 (drops out of speech faster)
- MIC_SEND_THRESHOLD halved 3200→1600 (first-byte ~180ms, was ~280ms)
- STT polishing cascade: MLX local (~160ms) → Gemini cloud → LLM corrector
- Debug logging for speech buffer and subtitle updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete status table of all implemented features, benchmarks,
architecture diagram, and next steps for v2.0/2.1/2.2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 21, 2026 11:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces CapsLockX 2.0 as a Rust rewrite (macOS-first) and adds major AI/voice features (STT, TTS, brainstorm agent), plus new platform adapters, tooling, docs, and CI/release support.

Changes:

  • Adds new Rust core capabilities: voice pipeline (local+cloud STT, correction), TTS fallback chain, agent/tools, background task manager, and audio capture.
  • Adds macOS adapter implementation (CGEventTap hook, tray + prefs UI, voice overlay/capture/system audio).
  • Updates packaging/CI: npm launcher downloads per-platform binaries; builds macOS in CI/release; expands documentation and plans.

Reviewed changes

Copilot reviewed 72 out of 93 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tmp/capslockx-mac Adds submodule pointer for macOS adapter snapshot
tmp/brainstorm Adds submodule pointer for brainstorm snapshot
rs/test-results.txt Adds recorded manual test results artifact
rs/test-manual.ahk Adds manual QA helper for Shift+HJKL selection
rs/dev-watch.sh Adds cargo-watch script for auto rebuild/restart
rs/core/src/tts.rs Introduces multi-tier TTS fallback implementation
rs/core/src/task_manager.rs Adds background task manager with timeouts
rs/core/src/stt_corrector.rs Adds incremental LLM-based STT correction
rs/core/src/state.rs Extends config with STT/brainstorm/LLM settings
rs/core/src/platform.rs Extends platform trait (audio, text input, mods)
rs/core/src/modules/voice_player.html Adds HTML utility to play voice notes with subtitles
rs/core/src/modules/mouse.rs Updates mouse/scroll phases & adds shift-scroll behavior
rs/core/src/modules/mod.rs Adds brainstorm + voice modules and wiring
rs/core/src/modules/edit.rs Adds atomic modifier-aware key tapping for macOS
rs/core/src/local_whisper.rs Adds local whisper.cpp wrapper with autoscaling
rs/core/src/local_sherpa.rs Adds local SenseVoice wrapper with auto-download
rs/core/src/lib.rs Exposes new core modules (agent, STT, TTS, etc.)
rs/core/src/engine.rs Improves trigger bypass logic + held-key syncing
rs/core/src/cloud_stt.rs Adds Gemini cloud STT transcription helper
rs/core/src/bin/test-llm.rs Adds quick LLM client test binary
rs/core/src/bin/test-agent.rs Adds agent tool test runner binary
rs/core/src/bin/stt-server.rs Adds persistent local STT server binary
rs/core/src/bin/stt-quick.rs Adds one-shot STT helper binary
rs/core/src/bin/stt-compare.rs Adds local vs cloud STT benchmark/compare tool
rs/core/src/bin/stt-bench.rs Adds SenseVoice vs Whisper benchmark with WER/CER
rs/core/src/bin/sherpa-test.rs Adds standalone mic capture + SenseVoice test binary
rs/core/src/bin/clx-agent.rs Adds standalone CLI agent chat tool
rs/core/src/audio_capture.rs Adds cross-platform mic capture via cpal
rs/core/src/acc_model.rs Adds option to drive ticks externally (hook thread)
rs/core/Cargo.toml Adds dependencies for audio/STT/TTS/JS/math engines
rs/adapters/windows/src/output.rs Updates close_tab to cmd-or-ctrl helper
rs/adapters/windows/src/hook.rs Drives AccModel ticks via SetTimer on hook thread
rs/adapters/macos/src/tray.rs Adds macOS tray icon via ObjC FFI
rs/adapters/macos/src/prefs_html.html Adds macOS preferences UI (WKWebView content)
rs/adapters/macos/src/main.rs Adds macOS adapter entry point
rs/adapters/macos/src/key_map.rs Adds macOS keycode mapping
rs/adapters/macos/src/hook.rs Adds CGEventTap hook w/ pass-through/suppress logic
rs/adapters/macos/src/config_store.rs Adds persistent config store for macOS
rs/adapters/macos/src/bin/voice-standalone.rs Adds standalone voice pipeline binary
rs/adapters/macos/src/bin/test-vpio.rs Adds VoiceProcessingIO AEC test binary
rs/adapters/macos/src/bin/test-cycle.rs Adds direct window cycling stability test
rs/adapters/macos/build.rs Links required macOS frameworks
rs/adapters/macos/Cargo.toml Adds macOS adapter crate
rs/adapters/linux/src/output.rs Updates close_tab to cmd-or-ctrl helper
rs/adapters/browser/www/vite.config.js Adds dev server config for browser adapter
rs/adapters/browser/www/sherpa/sherpa-onnx-vad.js Adds sherpa VAD JS helper
rs/adapters/browser/www/sherpa/.gitignore Ignores wasm/data model artifacts
rs/adapters/browser/www/package.json Adds browser adapter web package config
rs/adapters/browser/www/.gitignore Ignores node_modules
rs/adapters/browser/src/platform.rs Updates close_tab to cmd-or-ctrl helper
rs/Cargo.toml Adds macOS adapter to workspace
plan/voice-input/voice-modes.md Adds voice modes design notes
plan/voice-input/TODO.md Adds voice implementation plan checklist
plan/voice-input/README.md Adds voice feature spec
package.json Switches npm bin to node launcher & includes new artifacts
bin/capslockx.mjs Adds cross-platform launcher/downloader script
TODO.md Adds voice feature plan reference + macOS drag issue note
.playwright-cli/page-2026-03-18T17-55-02-733Z.yml Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-53-48-281Z.yml Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-51-22-570Z.yml Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-50-03-292Z.yml Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-49-54-727Z.yml Adds Playwright snapshot artifact
.playwright-cli/page-2026-03-18T17-49-44-295Z.yml Adds Playwright snapshot artifact
.github/workflows/release-rust.yml Adds macOS release build + alters artifact matching behavior
.github/workflows/ci-rust.yml Adds macOS CI build/test/clippy
docs/dev/window-cycle-stability.md Documents window cycling stability fixes
docs/dev/modifier-bypass.md Documents modifier+Space bypass design
docs/dev/dual-track-stt.md Documents dual-track STT architecture
docs/dev/agent-test-matrix.md Documents agent tool test matrix
docs/Roadmap.md Rewrites roadmap for Rust v2.0 scope/plans
Files not reviewed (1)
  • rs/adapters/browser/www/package-lock.json: Language not supported
Comments suppressed due to low confidence (15)

rs/core/src/modules/mouse.rs:1

  • Horizontal scrolling is applied multiple times: dx scrolls inside both branches and then again unconditionally on line 139. This will double-scroll whenever dx != 0 (and also double-scroll in Shift mode). Remove the unconditional if dx != 0 { p.scroll_h(dx * 3); } or restructure so each axis is emitted exactly once per call.
    rs/core/src/platform.rs:1
  • The implementation always uses LCtrl, but the docstring says it should use Cmd on macOS. This will break common macOS shortcuts (e.g., Cmd+W for close tab). Make the modifier conditional on target_os (use KeyCode::LWin on macOS; KeyCode::LCtrl elsewhere) or provide a platform override for macOS and keep the default consistent with its behavior.
    rs/core/src/task_manager.rs:1
  • task_kill sets a flag and updates status, but the running task never observes kill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing the Arc<AtomicBool> (or a cancellation token/channel) into func, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
    rs/core/src/task_manager.rs:1
  • task_kill sets a flag and updates status, but the running task never observes kill, so nothing is actually stopped. To make kill functional, run tasks cooperatively by passing the Arc<AtomicBool> (or a cancellation token/channel) into func, and ensure the task checks it periodically and exits early when set; alternatively, remove the public kill API and clearly indicate tasks are non-cancellable.
    rs/adapters/browser/www/vite.config.js:1
  • The dev server binds to all interfaces (0.0.0.0) and requires TLS keys from hard-coded /tmp paths. This is easy to misconfigure and can unintentionally expose a dev server on a LAN. Prefer defaulting host to 127.0.0.1 and sourcing key/cert paths (or HTTPS enablement) from environment variables with safe fallbacks.
    rs/core/src/local_whisper.rs:1
  • The module-level docs say upgrade happens when inference is >1.25x realtime, but UPGRADE_THRESHOLD is 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
    rs/core/src/local_whisper.rs:1
  • The module-level docs say upgrade happens when inference is >1.25x realtime, but UPGRADE_THRESHOLD is 5.0 (and the code uses a 'budget ratio' derived from wall-available time). Update the documentation to match the actual scaling heuristic to avoid misleading tuning/expectations.
    rs/core/src/tts.rs:1
  • Invalid base64 characters are silently treated as 0 via unwrap_or(0), which can corrupt audio output without surfacing an error. Return an error when a character is not found in the alphabet (except = padding), or use a well-tested base64 crate to decode and validate input.
    rs/core/src/tts.rs:1
  • Audio is written to fixed filenames in /tmp. Concurrent or overlapping speak() calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
    rs/core/src/tts.rs:1
  • Audio is written to fixed filenames in /tmp. Concurrent or overlapping speak() calls can race and overwrite each other's output, producing wrong audio playback. Use unique temp paths (e.g., include PID + timestamp/random suffix) and consider deleting the temp file after playback completes.
    rs/core/src/platform.rs:1
  • The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
    rs/core/src/platform.rs:1
  • The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
    rs/core/src/platform.rs:1
  • The default type_text silently drops many common characters (e.g., '-', '=', ',', '/', ';', '''), which can corrupt typed STT output or user prompts. Either extend the mapping to cover a complete US-ANSI punctuation set, or change the fallback behavior to something lossless (e.g., clipboard paste) when encountering an unmapped character.
    rs/dev-watch.sh:1
  • pkill -f \"target/release/capslockx\" can match and kill unrelated processes whose command lines happen to contain that substring. Consider using a PID file, pkill -x with an exact process name when possible, or filtering by the full path to the spawned binary to avoid collateral termination.
    rs/test-results.txt:1
  • The PR description marks the test plan items as completed, but the committed rs/test-results.txt shows failing cases (5/8 passed). If this file is meant as an authoritative test artifact, it should reflect a passing run (or be excluded from the PR) to avoid conflicting signals about readiness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +105
# Commit binary to main so npm package includes it.
- name: Commit macOS binary to main
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git pull origin main --rebase
git add capslockx-macos-arm64
git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}"
git push origin main

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing release-built binaries back to main from a release workflow is risky (mutates default branch, can create unexpected CI loops, merge conflicts, and provenance issues). Prefer attaching artifacts only to GitHub Releases and publishing npm packages from the release artifacts (or a dedicated distribution branch), rather than committing binaries into source control.

Suggested change
# Commit binary to main so npm package includes it.
- name: Commit macOS binary to main
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git pull origin main --rebase
git add capslockx-macos-arm64
git diff --cached --quiet || git commit -m "chore: update macOS binary for ${{ github.ref_name }}"
git push origin main

Copilot uses AI. Check for mistakes.
@snomiao snomiao enabled auto-merge March 21, 2026 11:42
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snomiao snomiao merged commit 1249420 into main Mar 21, 2026
3 of 6 checks passed
@snomiao snomiao deleted the beta branch March 21, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants