-
-
Notifications
You must be signed in to change notification settings - Fork 392
Description
Scope check
- This is core LLM communication (not application logic)
- This benefits most users (not just my use case)
- This can't be solved in application code with current RubyLLM
- I read the Contributing Guide
Due diligence
- I searched existing issues
- I checked the documentation
What problem does this solve?
RubyLLM provides a beautiful, unified interface for LLM capabilities — chat, paint, embed, transcribe. But audio is only half the story: we can turn speech into text (transcribe), but not text into speech.
Developers building voice-enabled apps, accessibility features, or content pipelines currently have to drop out of RubyLLM's DSL to wire up TTS manually — choosing an HTTP client, handling binary responses, managing provider auth separately. This breaks the "one gem, consistent interface" experience that makes RubyLLM great.
This issue proposes two related features:
RubyLLM.speak— core TTS API (primary focus)- SSML Builder DSL — Ruby DSL for building SSML documents (future phase, inspired by
RubyLLM::Schema)
Proposed solution
API
Simple usage:
speech = RubyLLM.speak("Hello, world!")
speech.save("hello.mp3")With options:
speech = RubyLLM.speak("Hello, world!", model: "tts-1-hd", format: "wav")
speech.save("hello.wav")Per-call context:
context = RubyLLM.context { |c| c.openai_api_key = "sk-..." }
speech = context.speak("Hello!")Response Object: RubyLLM::Speech
Following the pattern of Transcription, Image, Embedding:
class RubyLLM::Speech
attr_reader :data # Audio bytes (binary string)
attr_reader :model # Model ID used
attr_reader :format # Output format (mp3, wav, aac, flac, pcm)
def save(path) # Write audio to file
def to_blob # Raw binary data
def mime_type # e.g. "audio/mpeg"
endProvider Examples
OpenAI (/v1/audio/speech):
# lib/ruby_llm/providers/openai/speech.rb
module RubyLLM::Providers::OpenAI::Speech
def speak(text, model:, format: "mp3", **options)
response = connection.post("/v1/audio/speech", {
model: model,
input: text,
voice: "alloy", # Single default voice for now
response_format: format
})
{ audio_data: response.body, format: format }
end
endAzure (/cognitiveservices/v1):
# lib/ruby_llm/providers/azure/speech.rb
module RubyLLM::Providers::Azure::Speech
def speak(input, model:, format: "mp3", **options)
ssml = ssml?(input) ? input : wrap_in_ssml(input, voice: "en-US-AvaMultilingualNeural")
response = connection.post("cognitiveservices/v1") do |req|
req.headers["Content-Type"] = "application/ssml+xml"
req.headers["X-Microsoft-OutputFormat"] = audio_format(format)
req.headers["Ocp-Apim-Subscription-Key"] = config.azure_speech_api_key
req.body = ssml
end
{ audio_data: response.body, format: format }
end
private
def ssml?(input)
input.strip.start_with?("<speak")
end
def wrap_in_ssml(text, voice:)
<<~SSML
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="#{voice}">#{text}</voice>
</speak>
SSML
end
endConfiguration
RubyLLM.configure do |config|
config.default_speech_model = "tts-1" # New config attribute
endFiles to Add/Modify
| File | Change |
|---|---|
lib/ruby_llm.rb |
Add self.speak method |
lib/ruby_llm/speech.rb |
New Speech class |
lib/ruby_llm/configuration.rb |
Add default_speech_model |
lib/ruby_llm/providers/openai.rb |
Include Speech module |
lib/ruby_llm/providers/openai/speech.rb |
New provider implementation |
lib/ruby_llm/providers/openai/capabilities.rb |
Add speech: true |
lib/ruby_llm/providers/azure/speech.rb |
New provider implementation |
| Model registry | Register TTS models (tts-1, tts-1-hd) |
Why this belongs in RubyLLM
TTS isn't a simple API wrapper you'd write in application code. It requires:
- Model resolution across providers — the same model name might map to different endpoints on OpenAI vs Azure vs Google. RubyLLM's
Models.resolvealready handles this. - Provider abstraction — each TTS provider has different auth mechanisms, endpoints, request/response formats, and audio output options. App code shouldn't know these details.
- Configuration management — API keys, defaults, per-call overrides. RubyLLM's
ConfigurationandContextsystem already solves this. - Binary response handling — TTS returns audio bytes, not JSON. This needs different connection/parsing logic that belongs in the provider layer.
Most importantly: transcribe (audio → text) is already in RubyLLM. speak (text → audio) is its natural counterpart. Leaving it out means developers use RubyLLM for 90% of their LLM needs but have to roll their own for this one capability — exactly the fragmentation RubyLLM was built to eliminate.
Non-Goals (for initial PR)
- Multiple voices / voice selection — each provider uses a single sensible default voice for now
- SSML support — separate issue
- Streaming audio — can layer on later
- Other providers beyond OpenAI + Azure — can be added incrementally
Related
- Originated from discussion: Any plans for Text-to-Speech (TTS) support? #637
- SSML Builder DSL: [will open as separate issue]