-
Notifications
You must be signed in to change notification settings - Fork 1.7k
ci: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models
#8458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ci: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models
#8458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for a new max_ensemble_inflight_responses parameter to ensemble models for preventing unbounded memory growth in scenarios with decoupled models and slow consumers.
- Implements backpressure mechanism to limit concurrent responses in ensemble pipelines
- Adds comprehensive test coverage including valid/invalid parameter validation
- Creates new test models for decoupled producer and slow consumer scenarios
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| qa/L0_simple_ensemble/test.sh | Adds backpressure testing logic and invalid parameter validation |
| qa/L0_simple_ensemble/models/slow_consumer/config.pbtxt | Configures Python backend model with intentional processing delay |
| qa/L0_simple_ensemble/models/slow_consumer/1/model.py | Implements model that adds 200ms delay per request to simulate slow processing |
| qa/L0_simple_ensemble/models/ensemble_enabled_max_inflight_responses/config.pbtxt | Ensemble configuration with backpressure parameter set to 4 |
| qa/L0_simple_ensemble/models/ensemble_disabled_max_inflight_responses/config.pbtxt | Baseline ensemble configuration without backpressure parameter |
| qa/L0_simple_ensemble/models/decoupled_producer/config.pbtxt | Configures decoupled Python model for multiple response generation |
| qa/L0_simple_ensemble/models/decoupled_producer/1/model.py | Implements decoupled model that produces N responses based on input value |
| qa/L0_simple_ensemble/ensemble_backpressure_test.py | Comprehensive test suite for backpressure functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
qa/L0_simple_ensemble/backpressure_test_models/decoupled_producer/1/model.py
Show resolved
Hide resolved
...imple_ensemble/backpressure_test_models/ensemble_disabled_max_inflight_requests/config.pbtxt
Show resolved
Hide resolved
max_inflight_responses parameter to prevent unbounded memory growth in ensemble models
max_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_responses parameter to prevent unbounded memory growth in ensemble models
| each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc) | ||
|
|
||
| ### Knowing When a Decoupled Inference Request is Complete | ||
| ### Using Decoupled Models in Ensembles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only happening in decoupled model, or any models with big processing speed difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, since the normal model step will have only one response, a slow processing step will automatically block the request. In this case, memory usage will increase at a normal rate (always only one response per step accumulates in memory, unlike in a decoupled model) and may not need any additional backpressure at the step level. To effectively manage overall memory usage, a rate limiter could be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am thinking a different senario or I am misunderstanding. Consider a simple two-step ensemble and first step processes at a much faster speed. The client sends request at a constant rate (e.g. one request per millisecond). Based on https://github.com/triton-inference-server/core/blob/b354d4dc13c5855b50a36eeec0d4d3aa443a01f3/src/ensemble_scheduler/ensemble_scheduler.cc#L1422, ScheduleSteps is non-blocking and the queue size increases in an constant rate.
In your change, since ScheduleSteps is called in EnsembleScheduler::Enqueue (same thread of http/grpc handler thread) and ResponseComplete, in the above example it will also blocks grpc/http handler and limit the total memory growth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, you are correct. I have updated the code to apply backpressure only to the downstream steps (exclude step 0). This change ensures that we do not block the EnsembleScheduler::Enqueue path (request by HTTP/GRPC handler thread). And, backpressure will only be utilized in the ResponseComplete path.
| ### When to Use This Feature | ||
|
|
||
| Use `max_inflight_responses` when your ensemble includes: | ||
| * **Decoupled models** that produce multiple responses per request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
max_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_requests parameter to prevent unbounded memory growth in ensemble models
What does the PR do?
This PR adds testing for a new parameter
max_inflight_requests.Added
ensemble_backpressure_test.pywith custom decoupled producer and slow consumer models to validate the new feature.Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs: triton-inference-server/core#455, triton-inference-server/common#141
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)