Skip to content

Conversation

@pskiran1
Copy link
Member

@pskiran1 pskiran1 commented Oct 13, 2025

What does the PR do?

This PR adds testing for a new parameter max_inflight_requests.
Added ensemble_backpressure_test.py with custom decoupled producer and slow consumer models to validate the new feature.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs: triton-inference-server/core#455, triton-inference-server/common#141

Where should the reviewer start?

Test plan:

  • CI Pipeline ID: 37193671

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@pskiran1 pskiran1 requested a review from Copilot October 13, 2025 17:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for a new max_ensemble_inflight_responses parameter to ensemble models for preventing unbounded memory growth in scenarios with decoupled models and slow consumers.

  • Implements backpressure mechanism to limit concurrent responses in ensemble pipelines
  • Adds comprehensive test coverage including valid/invalid parameter validation
  • Creates new test models for decoupled producer and slow consumer scenarios

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
qa/L0_simple_ensemble/test.sh Adds backpressure testing logic and invalid parameter validation
qa/L0_simple_ensemble/models/slow_consumer/config.pbtxt Configures Python backend model with intentional processing delay
qa/L0_simple_ensemble/models/slow_consumer/1/model.py Implements model that adds 200ms delay per request to simulate slow processing
qa/L0_simple_ensemble/models/ensemble_enabled_max_inflight_responses/config.pbtxt Ensemble configuration with backpressure parameter set to 4
qa/L0_simple_ensemble/models/ensemble_disabled_max_inflight_responses/config.pbtxt Baseline ensemble configuration without backpressure parameter
qa/L0_simple_ensemble/models/decoupled_producer/config.pbtxt Configures decoupled Python model for multiple response generation
qa/L0_simple_ensemble/models/decoupled_producer/1/model.py Implements decoupled model that produces N responses based on input value
qa/L0_simple_ensemble/ensemble_backpressure_test.py Comprehensive test suite for backpressure functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@pskiran1 pskiran1 added the PR: ci Changes to our CI configuration files and scripts label Oct 13, 2025
@pskiran1 pskiran1 requested a review from Copilot October 13, 2025 18:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@pskiran1 pskiran1 changed the title ci: Add support for max_ensemble_inflight_responses parameter to prevent unbounded memory growth in ensemble models ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models Oct 17, 2025
@pskiran1 pskiran1 changed the title ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models Oct 17, 2025
@pskiran1 pskiran1 changed the title ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models Oct 17, 2025
@pskiran1 pskiran1 requested a review from yinggeh October 19, 2025 17:04
each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)

### Knowing When a Decoupled Inference Request is Complete
### Using Decoupled Models in Ensembles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only happening in decoupled model, or any models with big processing speed difference?

Copy link
Member Author

@pskiran1 pskiran1 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, since the normal model step will have only one response, a slow processing step will automatically block the request. In this case, memory usage will increase at a normal rate (always only one response per step accumulates in memory, unlike in a decoupled model) and may not need any additional backpressure at the step level. To effectively manage overall memory usage, a rate limiter could be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am thinking a different senario or I am misunderstanding. Consider a simple two-step ensemble and first step processes at a much faster speed. The client sends request at a constant rate (e.g. one request per millisecond). Based on https://github.com/triton-inference-server/core/blob/b354d4dc13c5855b50a36eeec0d4d3aa443a01f3/src/ensemble_scheduler/ensemble_scheduler.cc#L1422, ScheduleSteps is non-blocking and the queue size increases in an constant rate.

In your change, since ScheduleSteps is called in EnsembleScheduler::Enqueue (same thread of http/grpc handler thread) and ResponseComplete, in the above example it will also blocks grpc/http handler and limit the total memory growth.

Copy link
Member Author

@pskiran1 pskiran1 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, you are correct. I have updated the code to apply backpressure only to the downstream steps (exclude step 0). This change ensures that we do not block the EnsembleScheduler::Enqueue path (request by HTTP/GRPC handler thread). And, backpressure will only be utilized in the ResponseComplete path.

### When to Use This Feature

Use `max_inflight_responses` when your ensemble includes:
* **Decoupled models** that produce multiple responses per request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

pskiran1 and others added 3 commits October 23, 2025 12:21
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
@pskiran1 pskiran1 requested a review from yinggeh October 23, 2025 08:29
whoisj
whoisj previously approved these changes Oct 24, 2025
@pskiran1 pskiran1 requested a review from yinggeh October 24, 2025 15:58
@pskiran1 pskiran1 changed the title ci: Add support for max_inflight_responses parameter to prevent unbounded memory growth in ensemble models ci: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: ci Changes to our CI configuration files and scripts

Development

Successfully merging this pull request may close these issues.

3 participants