Add RCCL write + read flush #10

ryanhankins · 2025-09-23T22:44:42Z

Summary

This implements a RDMA read-write flush mechanism in the NCCL OFI plugin to support hardware requirements found on AMD GPUs. Data consistency cannot always be guaranteed with the existing RDMA read (fi_read) flush; a preceding RDMA write (fi_write) flush ensures that all memory operations are synchronized. This series of commits modularizes the flush logic, adds configurable support for an RDMA write flush, and manages the necessary buffer resources.

Motivation

AMD GPUs (and potentially some other platforms) require an explicit fi_write flush to guarantee data visibility and correctness before performing the PCI fi_read flush. The existing flush mechanism was not sufficient for these requirements.

Description of Changes

Modularized Flush Logic:
The flush operation for receive communicators has been refactored. The RDMA read flush is now handled by a helper function that accepts explicit local and remote buffer pointers and memory registration handles, allowing for greater flexibility and clarity in buffer management. A parallel helper for RDMA write flush is added.
Configurable RDMA Write Flush:
A new environment variable, OFI_NCCL_ENABLE_FLUSH_RDMA_WRITE, controls whether an RDMA write flush should precede the RDMA read flush. When enabled, the flush sequence issues an fi_write to a GPU buffer (on the receiver), followed by the usual fi_read. This sequence is critical for AMD GPUs but can be disabled on platforms that do not require it to avoid unnecessary overhead. For now, the write flush will be enabled by environment variable.
Resource Management for Flush Buffers:
The flush buffer metadata structure is extended to track both host and GPU buffer pointers and their associated memory registration handles. Dedicated allocation and deallocation routines are implemented for these resources, including ROCm (AMD) support via hipExtMallocWithFlags and hipFree.
Completion Handling for RDMA Write Flushes:
The request completion logic introduces a new direction, NCCL_OFI_SENDRECV_RECV_IGNORE, to identify write flush requests. Upon completion, these requests are immediately freed rather than tracked or propagated, since the subsequent read flush guarantees overall flush completion and provides that completion to RCCL.

Commits

Refactor flush logic to modularize RDMA read, generalize buffer handling, and add optional RDMA write flush support.
Add request direction for write flushes and free their requests immediately on completion.
Factor out flush buffer cleanup, split allocation for host/GPU buffers, and ensure proper deallocation.
Extend flush buffer struct for host/GPU resources and implement ROCm GPU buffer management routines.

…GPU buffer management routines. This commit expands the flush buffer metadata structure to track both a host buffer and a GPU buffer, along with their respective memory registration handles. It implements platform-specific allocation and deallocation routines for GPU memory, supporting ROCm via hipExtMallocWithFlags and hipFree. The code is updated to use these routines for handling GPU flush buffers, and struct field names are clarified to distinguish between host and GPU resources. These changes lay the foundation for supporting multiple flush strategies that rely on distinct buffer types.

…s, and ensure proper deallocation. This commit introduces a dedicated helper function for deregistering and deallocating flush buffers, consolidating cleanup logic for both host and GPU flush buffers. The allocation logic for flush buffers is refactored to separately handle host (read) and GPU (write) buffers, with new functions for each type. A combined allocation function calls both, ensuring that both buffers are set up as needed. The communicator cleanup function is updated to properly free and deregister both types of buffers, issuing warnings if any deallocation fails. This modular approach improves maintainability and robustness, particularly when managing limited GPU memory resources.

…iately on completion. This commit adds a new enumeration value, NCCL_OFI_SENDRECV_RECV_IGNORE, to distinguish RDMA write flush requests from regular receive requests. The completion queue processing logic is updated so that when a completion for a write flush is detected, the request is immediately freed instead of being tracked or reported, since the completion of the subsequent RDMA read is sufficient to indicate flush completion. This prevents unnecessary handling of write flush completions and simplifies resource management.

…ing, and add optional RDMA write flush support. This commit restructures the receive communicator flush logic by extracting the RDMA read flush operation into its own helper function, making the code more modular. It generalizes the flush helper to accept explicit local and remote buffer pointers and memory registration handles, enhancing flexibility for buffer management. Additionally, it introduces a runtime parameter, ENABLE_FLUSH_RDMA_WRITE, to optionally enable an RDMA write flush before the read flush. A new RDMA write flush helper function is provided, and the main flush logic is updated to perform both write and read flushes sequentially if the parameter is enabled. This allows the flush mechanism to support platforms requiring explicit write flushes to maintain data consistency, while keeping the code organized and extensible for future flushing strategies.

ryanhankins force-pushed the flush3 branch 2 times, most recently from ea389d6 to b42850f Compare September 27, 2025 20:36

thananon approved these changes Sep 29, 2025

View reviewed changes

ryanhankins force-pushed the flush3 branch 5 times, most recently from edd4c77 to bdddaeb Compare October 5, 2025 21:06

ryanhankins changed the title ~~Add RCCL write + read flush. Draft PR -- do not merge.~~ Add RCCL write + read flush Oct 6, 2025

ryanhankins force-pushed the flush3 branch from bdddaeb to e66990f Compare October 8, 2025 19:07

ryanhankins added 4 commits October 13, 2025 08:44

ryanhankins force-pushed the flush3 branch from e66990f to b2106f0 Compare October 13, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RCCL write + read flush #10

Add RCCL write + read flush #10

Uh oh!

ryanhankins commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add RCCL write + read flush #10

Are you sure you want to change the base?

Add RCCL write + read flush #10

Uh oh!

Conversation

ryanhankins commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Description of Changes

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryanhankins commented Sep 23, 2025 •

edited

Loading