Skip to content

[FEATURE] Text-to-Speech support (RubyLLM.speak) #651

@salidux

Description

@salidux

Scope check

  • This is core LLM communication (not application logic)
  • This benefits most users (not just my use case)
  • This can't be solved in application code with current RubyLLM
  • I read the Contributing Guide

Due diligence

  • I searched existing issues
  • I checked the documentation

What problem does this solve?

RubyLLM provides a beautiful, unified interface for LLM capabilities — chat, paint, embed, transcribe. But audio is only half the story: we can turn speech into text (transcribe), but not text into speech.

Developers building voice-enabled apps, accessibility features, or content pipelines currently have to drop out of RubyLLM's DSL to wire up TTS manually — choosing an HTTP client, handling binary responses, managing provider auth separately. This breaks the "one gem, consistent interface" experience that makes RubyLLM great.

This issue proposes two related features:

  1. RubyLLM.speak — core TTS API (primary focus)
  2. SSML Builder DSL — Ruby DSL for building SSML documents (future phase, inspired by RubyLLM::Schema)

Proposed solution

API

Simple usage:

speech = RubyLLM.speak("Hello, world!")
speech.save("hello.mp3")

With options:

speech = RubyLLM.speak("Hello, world!", model: "tts-1-hd", format: "wav")
speech.save("hello.wav")

Per-call context:

context = RubyLLM.context { |c| c.openai_api_key = "sk-..." }
speech = context.speak("Hello!")

Response Object: RubyLLM::Speech

Following the pattern of Transcription, Image, Embedding:

class RubyLLM::Speech
  attr_reader :data      # Audio bytes (binary string)
  attr_reader :model     # Model ID used
  attr_reader :format    # Output format (mp3, wav, aac, flac, pcm)

  def save(path)         # Write audio to file
  def to_blob            # Raw binary data
  def mime_type          # e.g. "audio/mpeg"
end

Provider Examples

OpenAI (/v1/audio/speech):

# lib/ruby_llm/providers/openai/speech.rb
module RubyLLM::Providers::OpenAI::Speech
  def speak(text, model:, format: "mp3", **options)
    response = connection.post("/v1/audio/speech", {
      model: model,
      input: text,
      voice: "alloy",          # Single default voice for now
      response_format: format
    })

    { audio_data: response.body, format: format }
  end
end

Azure (/cognitiveservices/v1):

# lib/ruby_llm/providers/azure/speech.rb
module RubyLLM::Providers::Azure::Speech
  def speak(input, model:, format: "mp3", **options)
    ssml = ssml?(input) ? input : wrap_in_ssml(input, voice: "en-US-AvaMultilingualNeural")

    response = connection.post("cognitiveservices/v1") do |req|
      req.headers["Content-Type"] = "application/ssml+xml"
      req.headers["X-Microsoft-OutputFormat"] = audio_format(format)
      req.headers["Ocp-Apim-Subscription-Key"] = config.azure_speech_api_key
      req.body = ssml
    end

    { audio_data: response.body, format: format }
  end

  private

  def ssml?(input)
    input.strip.start_with?("<speak")
  end

  def wrap_in_ssml(text, voice:)
    <<~SSML
      <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
        <voice name="#{voice}">#{text}</voice>
      </speak>
    SSML
  end
end

Configuration

RubyLLM.configure do |config|
  config.default_speech_model = "tts-1"  # New config attribute
end

Files to Add/Modify

File Change
lib/ruby_llm.rb Add self.speak method
lib/ruby_llm/speech.rb New Speech class
lib/ruby_llm/configuration.rb Add default_speech_model
lib/ruby_llm/providers/openai.rb Include Speech module
lib/ruby_llm/providers/openai/speech.rb New provider implementation
lib/ruby_llm/providers/openai/capabilities.rb Add speech: true
lib/ruby_llm/providers/azure/speech.rb New provider implementation
Model registry Register TTS models (tts-1, tts-1-hd)

Why this belongs in RubyLLM

TTS isn't a simple API wrapper you'd write in application code. It requires:

  • Model resolution across providers — the same model name might map to different endpoints on OpenAI vs Azure vs Google. RubyLLM's Models.resolve already handles this.
  • Provider abstraction — each TTS provider has different auth mechanisms, endpoints, request/response formats, and audio output options. App code shouldn't know these details.
  • Configuration management — API keys, defaults, per-call overrides. RubyLLM's Configuration and Context system already solves this.
  • Binary response handling — TTS returns audio bytes, not JSON. This needs different connection/parsing logic that belongs in the provider layer.

Most importantly: transcribe (audio → text) is already in RubyLLM. speak (text → audio) is its natural counterpart. Leaving it out means developers use RubyLLM for 90% of their LLM needs but have to roll their own for this one capability — exactly the fragmentation RubyLLM was built to eliminate.

Non-Goals (for initial PR)

  • Multiple voices / voice selection — each provider uses a single sensible default voice for now
  • SSML support — separate issue
  • Streaming audio — can layer on later
  • Other providers beyond OpenAI + Azure — can be added incrementally

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions