Skip to content

OpenAI Realtime API compatibility? #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vvolhejn opened this issue Apr 3, 2025 · 4 comments
Open

OpenAI Realtime API compatibility? #245

vvolhejn opened this issue Apr 3, 2025 · 4 comments

Comments

@vvolhejn
Copy link
Contributor

vvolhejn commented Apr 3, 2025

When used as an API server with websockets, FastRTC provides similar functionality to OpenAI's Realtime API.

How about designing the websocket protocol to match OpenAI's, so that FastRTC can be used as a drop-in replacement? This would make adoption very easy for people who are already using the Realtime API.

A similar strategy has worked well for vLLM in the text LLM space: https://github.com/vllm-project/vllm

Perhaps this could work by just taking a StreamHandler/AsyncStreamHandler and running a FastAPI server that would format the messages appropriately. Extra client messages could be passed as AdditionalOutput, not sure about the extra server messages though.

@freddyaboulton
Copy link
Collaborator

This is a great suggestion! I think changing the input audio messages to match input_audio_buffer.append and changing the output audio messages to match response.audio.delta is straightforward.

The tricky thing will be mapping the AdditionalOutputs of a handler to either the response.audio.transcript or input.audio.transcript events. Perhaps we just assume that if the output is {'role': "user", "content": ...} it corresponds to input.audio.transcript and same for the response.audio.transcript if the role is "assistant".

I'm not sure if we can support some events (like conversation.item.create) in the general sense. Some handlers may not even have a concept of a conversation.

@vvolhejn
Copy link
Contributor Author

vvolhejn commented Apr 3, 2025

The tricky thing will be mapping the AdditionalOutputs of a handler to either the response.audio.transcript or input.audio.transcript events. Perhaps we just assume that if the output is {'role': "user", "content": ...} it corresponds to input.audio.transcript and same for the response.audio.transcript if the role is "assistant".

I think a better approach would be to have some subclass of AdditionalOutput that's more structured and allows you to specify that information. Perhaps even something like OpenAIRealtimeApiAdditionalOutput (a bit too long though) that would literally allow you to send specific OpenAI events. Since there is no semantics defined on how AdditionalOutput should be interpreted, this seems easy to do.

It should be pretty easy to then create a function to be used in additional_outputs_handler that would update a Chatbot element based on a received OpenAIRealtimeApiAdditionalOutput.

I'm not sure if we can support some events (like conversation.item.create) in the general sense. Some handlers may not even have a concept of a conversation.

That's a good point. I don't think it'd be possible to make every handler work as a drop-in OpenAI replacement, but it should be easy to create one if you want to. Ideally we would have some loud error if the client tries to send a message that's not supported, but we would need to somehow know what's supported and what isn't.

A good test of whether it works would be using the server as a replacement in the "OpenAI Realtime Console" demo (specifically, the websocket version): https://github.com/openai/openai-realtime-console?tab=readme-ov-file

I'll play around with this and try to create some standalone code that would wrap an AsyncStreamHandler and then we can see if it could be integrated into FastRTC?

@freddyaboulton
Copy link
Collaborator

I think a better approach would be to have some subclass of AdditionalOutput that's more structured

Yes I agree. Perhaps it can be called RealtimeMessage. And you're right that we can map an instance of RealtimeMessage to a chatbot UI update in a straightforward manner.

but it should be easy to create one if you want to. Ideally we would have some loud error if the client tries to send a message that's not supported, but we would need to somehow know what's supported and what isn't.

I think for now let us sidestep this.

I'll play around with this and try to create some standalone code that would wrap an AsyncStreamHandler and then we can see if it could be integrated into FastRTC?

Awesome really looking forward to seeing this!

@marcusvaltonen
Copy link
Contributor

marcusvaltonen commented Apr 8, 2025

I would also be interested in having a custom (possibly subclass) of AdditionalOutputs, to which you can add some custom logic. E.g., an AdditionalJSONOutputs, where I can call json.dumps() directly on the onject and send the data in a stream to the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants