Skip to content

Conversation

TerminallyLazy
Copy link
Contributor

🎤 Kokoro TTS Voice Selection Enhancement

This PR implements a comprehensive voice selection system for Agent Zero's Kokoro TTS integration, transforming the hardcoded voice configuration into a rich, user-friendly interface with modern styling.

✨ Key Features

🌍 54+ High-Quality Voices

  • American English: 20 voices (af_heart, af_alloy, am_adam, am_onyx, etc.)
  • British English: 8 voices (bf_emma, bm_george, etc.)
  • Japanese: 5 voices (jf_rei, jm_kaito, etc.)
  • Chinese (Mandarin): 8 voices (cf_qianyun, cm_yunyang, etc.)
  • Spanish: 3 voices (sf_maria, sm_diego, etc.)
  • French, Hindi, Italian, Portuguese: Additional language support

🎛️ Advanced Voice Controls

  • Voice Blending: Mix two voices with adjustable ratios (0.0-1.0)
  • Speed Control: Adjust speech rate from 0.5x to 2.0x
  • Quality Grades: A/A-/B+/B/B- ratings for voice selection guidance
  • Rich Metadata: Gender, language, and descriptive information for each voice

🎨 Professional UI/UX

  • Hover Tooltips: Detailed voice information with quality grades and descriptions
  • Agent Zero Design: Consistent styling with existing theme system
  • Cross-Browser Support: Enhanced dropdown styling for Chrome, Firefox, Safari
  • Light/Dark Mode: Full theming support with proper contrast
  • Responsive Design: Mobile-friendly voice selection interface

🔧 Technical Implementation

Backend Changes

  • settings.py:
    • Added comprehensive voice database with HuggingFace metadata
    • Enhanced TypedDict schema with new voice configuration fields
    • Implemented rich dropdown options with tooltip data
  • kokoro_tts.py:
    • Dynamic voice configuration from settings
    • Robust voice validation and error handling
    • Voice blending logic with fallback to single voice
    • Fixed "No blending" placeholder issue that caused 404 errors

Frontend Changes

  • speech-store.js: Added new voice settings to speech store model
  • settings.css: Professional tooltip system with animations and theming
  • index.html: Enhanced select templates with Alpine.js voice selection
  • index.css: Global dropdown consistency across browsers

🛡️ Error Handling & Validation

Voice Validation

Prevents invalid voices from reaching the API

if (not primary_voice or primary_voice in ["", "No blending"]):
primary_voice = "af_alloy" # Safe fallback

Only blend when valid secondary voice is selected

if (secondary_voice and secondary_voice not in ["", "No blending"]):
voice_string = f"{primary_voice},{secondary_voice}"

API Protection

  • Validates voice parameters before API calls
  • Graceful fallback to default voice on errors
  • Speed bounds checking (0.1x - 5.0x range)
  • Prevents "No blending" from being passed to Kokoro API

🎯 User Experience Improvements

Before: Hardcoded voices, no user control

_voice = "am_puck,am_onyx" # Fixed configuration
_speed = 1.1 # No user adjustment

After: Rich voice selection with metadata

  • Dropdown with 54+ voices and quality information
  • Professional tooltips: "Grade: A | High-quality female American voice | Optimal: 100-200 tokens"
  • Voice blending with visual ratio control
  • Speed adjustment with real-time feedback

📱 Cross-Platform Compatibility

Browser Support

  • Chrome/Edge: Enhanced dropdown styling with custom arrows
  • Firefox: Specialized CSS for Mozilla rendering engine
  • Safari: WebKit-specific optimizations
  • Mobile: Touch-friendly controls and responsive layout

Theme Integration

  • Seamless light/dark mode transitions
  • Agent Zero color palette consistency
  • CSS custom properties for maintainable theming
  • Professional hover states and animations

🚀 Performance & Quality

Voice Quality Grades

  • Grade A: Premium voices (af_heart, am_adam, bm_george)
  • Grade A-: High-quality voices (af_alloy, am_onyx, bf_emma)
  • Grade B+: Good quality voices (af_bella, am_eric, cf_qianyun)
  • Grade B/B-: Standard quality voices for broader language support

Optimization

  • Lazy-loaded voice metadata
  • Efficient tooltip rendering
  • Minimal DOM manipulation
  • Settings persistence and migration

🔄 Backward Compatibility

  • Existing voice settings are preserved during migration
  • Default fallbacks ensure system continues working
  • Graceful handling of invalid or missing voice configurations
  • No breaking changes to existing TTS functionality

🧪 Testing Considerations

  • Voice selection dropdown functionality
  • Tooltip display and content accuracy
  • Voice blending with different combinations
  • Speed control validation and limits
  • Cross-browser dropdown styling
  • Light/dark mode theme switching
  • Settings persistence across sessions

This enhancement transforms Agent Zero's TTS system from a basic hardcoded configuration into a professional, user-friendly voice selection interface that rivals commercial TTS platforms while maintaining the system's open-source accessibility and customization capabilities.

• Added 54+ Kokoro voices with rich metadata and quality grades
• Implemented voice selection dropdown with professional hover tooltips
• Added voice blending and speed control features
• Enhanced settings UI with Agent Zero design consistency
• Fixed voice validation to prevent API errors with invalid voices
• Updated TypedDict schema and speech store with new voice settings
• Added cross-browser dropdown styling with proper theming support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant