Skip to content

Conversation

@ryanhankins
Copy link
Contributor

@ryanhankins ryanhankins commented Sep 23, 2025

Summary

This implements a RDMA read-write flush mechanism in the NCCL OFI plugin to support hardware requirements found on AMD GPUs. Data consistency cannot always be guaranteed with the existing RDMA read (fi_read) flush; a preceding RDMA write (fi_write) flush ensures that all memory operations are synchronized. This series of commits modularizes the flush logic, adds configurable support for an RDMA write flush, and manages the necessary buffer resources.

Motivation

AMD GPUs (and potentially some other platforms) require an explicit fi_write flush to guarantee data visibility and correctness before performing the PCI fi_read flush. The existing flush mechanism was not sufficient for these requirements.

Description of Changes

  • Modularized Flush Logic:
    The flush operation for receive communicators has been refactored. The RDMA read flush is now handled by a helper function that accepts explicit local and remote buffer pointers and memory registration handles, allowing for greater flexibility and clarity in buffer management. A parallel helper for RDMA write flush is added.

  • Configurable RDMA Write Flush:
    A new environment variable, OFI_NCCL_ENABLE_FLUSH_RDMA_WRITE, controls whether an RDMA write flush should precede the RDMA read flush. When enabled, the flush sequence issues an fi_write to a GPU buffer (on the receiver), followed by the usual fi_read. This sequence is critical for AMD GPUs but can be disabled on platforms that do not require it to avoid unnecessary overhead. For now, the write flush will be enabled by environment variable.

  • Resource Management for Flush Buffers:
    The flush buffer metadata structure is extended to track both host and GPU buffer pointers and their associated memory registration handles. Dedicated allocation and deallocation routines are implemented for these resources, including ROCm (AMD) support via hipExtMallocWithFlags and hipFree.

  • Completion Handling for RDMA Write Flushes:
    The request completion logic introduces a new direction, NCCL_OFI_SENDRECV_RECV_IGNORE, to identify write flush requests. Upon completion, these requests are immediately freed rather than tracked or propagated, since the subsequent read flush guarantees overall flush completion and provides that completion to RCCL.

Commits

  • Refactor flush logic to modularize RDMA read, generalize buffer handling, and add optional RDMA write flush support.
  • Add request direction for write flushes and free their requests immediately on completion.
  • Factor out flush buffer cleanup, split allocation for host/GPU buffers, and ensure proper deallocation.
  • Extend flush buffer struct for host/GPU resources and implement ROCm GPU buffer management routines.

@ryanhankins ryanhankins force-pushed the flush3 branch 2 times, most recently from ea389d6 to b42850f Compare September 27, 2025 20:36
@ryanhankins ryanhankins force-pushed the flush3 branch 5 times, most recently from edd4c77 to bdddaeb Compare October 5, 2025 21:06
@ryanhankins ryanhankins changed the title Add RCCL write + read flush. Draft PR -- do not merge. Add RCCL write + read flush Oct 6, 2025
…GPU buffer management routines.

This commit expands the flush buffer metadata structure to track both a host
buffer and a GPU buffer, along with their respective memory registration
handles. It implements platform-specific allocation and deallocation routines
for GPU memory, supporting ROCm via hipExtMallocWithFlags and hipFree. The code
is updated to use these routines for handling GPU flush buffers, and struct
field names are clarified to distinguish between host and GPU resources. These
changes lay the foundation for supporting multiple flush strategies that rely on
distinct buffer types.
…s, and ensure proper deallocation.

This commit introduces a dedicated helper function for deregistering and
deallocating flush buffers, consolidating cleanup logic for both host and GPU
flush buffers. The allocation logic for flush buffers is refactored to
separately handle host (read) and GPU (write) buffers, with new functions for
each type. A combined allocation function calls both, ensuring that both buffers
are set up as needed. The communicator cleanup function is updated to properly
free and deregister both types of buffers, issuing warnings if any deallocation
fails. This modular approach improves maintainability and robustness,
particularly when managing limited GPU memory resources.
…iately on completion.

This commit adds a new enumeration value, NCCL_OFI_SENDRECV_RECV_IGNORE,
to distinguish RDMA write flush requests from regular receive requests.
The completion queue processing logic is updated so that when a
completion for a write flush is detected, the request is immediately
freed instead of being tracked or reported, since the completion of the
subsequent RDMA read is sufficient to indicate flush completion. This
prevents unnecessary handling of write flush completions and simplifies
resource management.
…ing, and add optional RDMA write flush support.

This commit restructures the receive communicator flush logic by
extracting the RDMA read flush operation into its own helper function,
making the code more modular. It generalizes the flush helper to accept
explicit local and remote buffer pointers and memory registration
handles, enhancing flexibility for buffer management. Additionally, it
introduces a runtime parameter, ENABLE_FLUSH_RDMA_WRITE, to optionally
enable an RDMA write flush before the read flush. A new RDMA write flush
helper function is provided, and the main flush logic is updated to
perform both write and read flushes sequentially if the parameter is
enabled. This allows the flush mechanism to support platforms requiring
explicit write flushes to maintain data consistency, while keeping the
code organized and extensible for future flushing strategies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants