Add RCCL write + read flush #10
                
     Open
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Summary
This implements a RDMA read-write flush mechanism in the NCCL OFI plugin to support hardware requirements found on AMD GPUs. Data consistency cannot always be guaranteed with the existing RDMA read (
fi_read) flush; a preceding RDMA write (fi_write) flush ensures that all memory operations are synchronized. This series of commits modularizes the flush logic, adds configurable support for an RDMA write flush, and manages the necessary buffer resources.Motivation
AMD GPUs (and potentially some other platforms) require an explicit
fi_writeflush to guarantee data visibility and correctness before performing the PCIfi_readflush. The existing flush mechanism was not sufficient for these requirements.Description of Changes
Modularized Flush Logic:
The flush operation for receive communicators has been refactored. The RDMA read flush is now handled by a helper function that accepts explicit local and remote buffer pointers and memory registration handles, allowing for greater flexibility and clarity in buffer management. A parallel helper for RDMA write flush is added.
Configurable RDMA Write Flush:
A new environment variable,
OFI_NCCL_ENABLE_FLUSH_RDMA_WRITE, controls whether an RDMA write flush should precede the RDMA read flush. When enabled, the flush sequence issues anfi_writeto a GPU buffer (on the receiver), followed by the usualfi_read. This sequence is critical for AMD GPUs but can be disabled on platforms that do not require it to avoid unnecessary overhead. For now, the write flush will be enabled by environment variable.Resource Management for Flush Buffers:
The flush buffer metadata structure is extended to track both host and GPU buffer pointers and their associated memory registration handles. Dedicated allocation and deallocation routines are implemented for these resources, including ROCm (AMD) support via
hipExtMallocWithFlagsandhipFree.Completion Handling for RDMA Write Flushes:
The request completion logic introduces a new direction,
NCCL_OFI_SENDRECV_RECV_IGNORE, to identify write flush requests. Upon completion, these requests are immediately freed rather than tracked or propagated, since the subsequent read flush guarantees overall flush completion and provides that completion to RCCL.Commits