Skip to content

Python SDK azure-cognitiveservices-speech: SpeechSynthesizer fails with 401 when using EntraID and custom subdomain (identical SpeechConfig succeeds with SpeechRecognizer) #2890

@epopisces

Description

@epopisces

Referred to this as the correct place to raise this issue, from the Azure SDK for Python Issue #42206

  • Package Name:: azure-cognitiveservices-speech
  • Package Version: 1.45.0
  • Operating System: Windows 11
  • Python Version: 3.10.8

Describe the bug
When using the Python Speech SDK with:

  • speech services of an Azure AI Services resource
  • a custom subdomain
  • Entra ID authentication via AzureDefaultCredential(). Az login is used; same results reproducable with AzureCliCredential() which is used in the reproducable code below for specificity
  • Using an account that has Cognitive Services User permissions on the Azure AI Services resources

Speech to Text operations with SpeechRecognizer work just fine with SpeechConfig's credential and endpoint arguments populated. Specifically tested with recognize_once_async().

Text to Speech operations with SpeechSynthesizer and an identical SpeechConfig fail for each method attempted. The error received in result.cancellation_details.error_details is:

WebSocket upgrade failed: Authentication error (401). Please check subscription information and region name. USP state: Sending. Received audio size: 0 bytes.

To Reproduce
Steps to reproduce the behavior:

With an Azure AI Services resource with Entra ID authentication enabled and a custom subdomain named 'this':

  1. Authenticate with az login: az login --scope https://cognitiveservices.azure.com/.default
  2. Run the following code:
import azure.cognitiveservices.speech as speechsdk
from azure.identity import AzureCliCredential

credential = AzureCliCredential()

subdomain = "this"
endpoint = f'https://{subdomain}.cognitiveservices.azure.com'
text = "Hello, this is a test"

speech_config = speechsdk.SpeechConfig(token_credential=credential, endpoint=endpoint)
speech_config.speech_synthesizer_language = "en-US"
speech_config.speech_synthesis_voice_name = 'en-US-AriaNeural'

audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

result = speech_synthesizer.speak_text_async(text).get()

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print(f"Speech synthesized for text {text}")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print(f"Speech synthesis canceled: {cancellation_details.reason}")
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        if cancellation_details.error_details:
            print(f"Error details: {cancellation_details.error_details}")

The resulting error is:

WebSocket upgrade failed: Authentication error (401). Please check subscription information and region name. USP state: Sending. Received audio size: 0 bytes.

The same error occurs with every method I tried, including:

  • get_voices_async()
  • speak_text()
  • speak_text_async()

Expected behavior
The following works just fine, using the same az login session, same source host running the script, and against the same Azure AI Services resource:

  1. Authenticate with az login: az login --scope https://cognitiveservices.azure.com/.default
  2. Run the following code:
import azure.cognitiveservices.speech as speechsdk
from azure.identity import AzureCliCredential

credential = AzureCliCredential()

subdomain = "this"
endpoint = f'https://{subdomain}.cognitiveservices.azure.com'
filename = "Sample.wav"

speech_config = speechsdk.SpeechConfig(token_credential=credential, endpoint=endpoint)
speech_config.speech_recognition_language = "en-US"

audio_config = speechsdk.audio.AudioConfig(filename=filename)

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

result = speech_recognizer.recognize_once_async().get()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: Text={}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

This results in:
Recognized: Text=My sample text.

Additional context

This is blocking use of Entra ID with text to speech services. Entra ID authentication is recommended by Microsoft (despite being grossly underrepresented in documentation and examples), and it would be great if it could be used consistently across services.

Metadata

Metadata

Assignees

Labels

acceptedIssue moved to product team backlog. Will be closed when addressed.service-side issuetext-to-speechText-to-Speech

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions