[ML] Prevent the trained model deployment memory estimation from double-counting allocations. #131918

valeriy42 · 2025-07-25T10:09:16Z

We observed that update of the number of deployment allocation lead to double-accounting of the function StartTrainedModelDeploymentAction.estimateMemoryUsageBytes. Further investigation showed that this bug was introduced in #104260.

This PR reverts changes from #104260 and refactors memory calculation in ML inference assignment planning to improve testability and prevent memory accounting issues.

Main Changes

Refactored memory calculation in AssignmentPlan.Builder:
- Added assignModelToNodeAndAccountForCurrentAllocations() in AssignmentPlan.java that handles both new and current allocations in a single call
- Moved memory accounting into the Builder class to reduce code duplication and potential bugs
- Used by ZoneAwareAssignmentPlanner and TrainedModelAssignmentRebalancer for consistent memory handling
Added dependency injection in Deployment for memory estimation functional
Created test to verify correct memory accounting when scaling allocations from 3 to 4 and confirmed no double-counting of memory during allocation scaling

This reverts commit 971cfb9.

… AssignmentPlan.Builder

elasticsearchmachine · 2025-07-25T10:10:05Z

Hi @valeriy42, I've created a changelog YAML for you.

elasticsearchmachine · 2025-07-25T12:24:11Z

Pinging @elastic/ml-core (Team:ML)

jan-elastic · 2025-07-25T13:26:36Z

...l/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlan.java

+            int currentAllocations = getCurrentAllocations(deployment, node);
+            if (currentAllocations > 0) {
+                long memoryForCurrentAllocations = deployment.estimateMemoryUsageBytes(currentAllocations);
+                accountMemory(deployment, node, memoryForCurrentAllocations);


I don't understand this. Why call:

assignModelToNode for the new allocations; and

accountMemory for the old ones?

I guess I also don't really understand what the state of AssignmentPlan exactly contains.

Isn't the old already accounted for? And what about the cores?

Furthermore, it feels like this shouldn't be doing that much, so I don't get why it's 500+ lines of similar confusing methods...

valeriy42 added 2 commits July 25, 2025 11:11

Revert "[ML] Refactor assignment planner code (elastic#104260)"

7d63eeb

This reverts commit 971cfb9.

refactor: consolidate model assignment and memory accounting logic in…

478ddfd

… AssignmentPlan.Builder

valeriy42 added >bug :ml Machine learning labels Jul 25, 2025

valeriy42 self-assigned this Jul 25, 2025

elasticsearchmachine added the v9.2.0 label Jul 25, 2025

Update docs/changelog/131918.yaml

a2cc385

github-actions bot deployed to docs-preview July 25, 2025 10:10 View deployment

clean up

172ed0d

github-actions bot deployed to docs-preview July 25, 2025 10:18 View deployment

make accountMemory private

45ddb68

github-actions bot deployed to docs-preview July 25, 2025 10:23 View deployment

refactoring

b394c32

github-actions bot deployed to docs-preview July 25, 2025 10:33 View deployment

valeriy42 added 2 commits July 25, 2025 12:35

clean up comments in AssignmentPlan

aa04044

test added

2b2ac9a

valeriy42 changed the title ~~[ML] Fix the bug with double allocation accouting~~ [ML] Prevent double allocation accounting in memory estimation of trained model deployments Jul 25, 2025

valeriy42 changed the title ~~[ML] Prevent double allocation accounting in memory estimation of trained model deployments~~ [ML] Prevent the trained model deployment memory estimation from double-counting allocations. Jul 25, 2025

docs and formatting

b2d319a

github-actions bot deployed to docs-preview July 25, 2025 12:22 View deployment

valeriy42 requested a review from jan-elastic July 25, 2025 12:23

valeriy42 marked this pull request as ready for review July 25, 2025 12:23

Merge branch 'main' into bug/allocations-2

64ae51d

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 25, 2025

github-actions bot deployed to docs-preview July 25, 2025 12:24 View deployment

valeriy42 added v9.1.1 v8.19.1 v9.0.5 labels Jul 25, 2025

valeriy42 added v8.18.5 auto-backport Automatically create backport pull requests when merged labels Jul 25, 2025

jan-elastic reviewed Jul 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Prevent the trained model deployment memory estimation from double-counting allocations. #131918

[ML] Prevent the trained model deployment memory estimation from double-counting allocations. #131918

Uh oh!

valeriy42 commented Jul 25, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

jan-elastic Jul 25, 2025 •

edited

Loading

Uh oh!

jan-elastic Jul 25, 2025

Uh oh!

Uh oh!

[ML] Prevent the trained model deployment memory estimation from double-counting allocations. #131918

Are you sure you want to change the base?

[ML] Prevent the trained model deployment memory estimation from double-counting allocations. #131918

Uh oh!

Conversation

valeriy42 commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Changes

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

jan-elastic Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

valeriy42 commented Jul 25, 2025 •

edited

Loading

jan-elastic Jul 25, 2025 •

edited

Loading