[jaccl] Fix race on local_staging in MeshImpl::all_reduce by kernelpool · Pull Request #3451 · ml-explore/mlx

kernelpool · 2026-04-25T23:30:35Z

This fixes a bug introduced in #3412. I observed this when running distributed tensor-parallel inference (MiniMax-M2.7-8bit on 2x M3 Ultra over jaccl using mlx 0.31.2/mlx-lm 0.31.3/macOS 26.4) where generation degenerates into sentence loops (or token-level garbage) for prompts above ~170 tokens.

The SEND-completion handler refilled local_staging(buff) the moment all peers ACK'd the previous send, regardless of whether that chunk had been consumed by the own-rank reduction step. RDMA timing made this non-deterministic, producing wrong sums for messages spanning multiple PIPELINE chunks.

Decouple the two: SEND completion only refills send_buffer (free once on the wire) and posts the next send. local_staging(b) is refilled in the reduce loop right after the own-rank reduction reads it, which also bumps recv_end[rank_] to gate the next step.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

The SEND-completion handler refilled local_staging(buff) the moment all peers ACK'd the previous send, regardless of whether that chunk had been consumed by the own-rank reduction step. RDMA timing made this non-deterministic, producing wrong sums for messages spanning multiple PIPELINE chunks. Decouple the two: SEND completion only refills send_buffer (free once on the wire) and posts the next send. local_staging(b) is refilled in the reduce loop right after the own-rank reduction reads it, which also bumps recv_end[rank_] to gate the next step.

kernelpool · 2026-04-26T07:21:11Z

I'm not sure this completely addresses the issue. I seem to still get repetitive behavior with Kimi-K2.6 over longer context (something i didn't observe when testing this model over several days previously using mlx 0.31.1 / mlx-lm 0.31.2). I'll do some more digging and see if I can more reliably reproduce it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jaccl] Fix race on local_staging in MeshImpl::all_reduce#3451

[jaccl] Fix race on local_staging in MeshImpl::all_reduce#3451
kernelpool wants to merge 1 commit intoml-explore:mainfrom
kernelpool:fix-jaccl-mesh-allreduce-staging-race

kernelpool commented Apr 25, 2026 •

edited

Loading

Uh oh!

kernelpool commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kernelpool commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

kernelpool commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kernelpool commented Apr 25, 2026 •

edited

Loading

kernelpool commented Apr 26, 2026 •

edited

Loading