-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Library name and version
SemanticKernel 1.56.0
Describe the bug
We are short circuiting long running calls (past 5 seconds, used to be past 10) to our model deployment. A majority of the calls will land within 0-2 seconds, however some never come back. Strangely, they fit mostly within a certain request length bucket, of 4000-5000. It feels like this might be some weird hot path / sharding strategy based on token size or request length behind the scenes, but I'm not sure if the SDK is doing something strange with the requests or retrying behind the scenes as 2000~ tokens on a 4o-mini PTU instance should be really, really quick.
The SemanticKernel team pointed me here also. A support ticket has been raised async.
Expected behavior
Responses within 0-2 seconds.
Actual behavior
Responses exceeding 10 seconds.
Reproduction Steps
Stand up a regional PTU for 4o-mini and hit it with a request of a length that coincides with the bucket outlined above.
Environment
- .NET 9