Skip to content

Conversation

@MittalMakwana
Copy link

Summary:

Problem

The torchcomms build in the xlformers_llama4_flagship_conda environment is failing with:

error: too many arguments to function 'c10::cuda::CUDACachingAllocator::SnapshotInfo c10::cuda::CUDACachingAllocator::snapshot()'

D86907322 introduced a new method registerMemPreHook() that calls c10::cuda::CUDACachingAllocator::snapshot({device, 0}) with arguments.

This is caused by a PyTorch API breaking change in c10::cuda::CUDACachingAllocator::snapshot():

  • Old API (internal builds): snapshot({device, pool_id})
  • New API (conda builds): snapshot() - no arguments

Root Cause

Torchcomms code was written for the older PyTorch API where snapshot() accepted arguments, but conda environments use a newer PyTorch version with the updated API.

Solution

Added conditional compilation to handle both API versions:

  1. Defined TORCHCOMMS_CONDA_BUILD macro in CMakeLists.txt files for conda builds
  2. Updated affected C++ files to use the correct API based on build type:
    • comms/torchcomms/ncclx/TorchCommNCCLXCCA.cpp
    • comms/torchcomms/nccl/TorchCommNCCLCCA.cpp

This follows the same pattern used in ProcessGroupNCCLX.cpp with the NCCLX_CONDA_BUILD macro.

Differential Revision: D87413411

…nda builds

Summary:
## Problem
The torchcomms build in the xlformers_llama4_flagship_conda environment is failing with:
```
error: too many arguments to function 'c10::cuda::CUDACachingAllocator::SnapshotInfo c10::cuda::CUDACachingAllocator::snapshot()'
```
D86907322 introduced a new method *registerMemPreHook()* that calls ```c10::cuda::CUDACachingAllocator::snapshot({device, 0})``` with arguments.

This is caused by a **PyTorch API breaking change** in `c10::cuda::CUDACachingAllocator::snapshot()`:
- **Old API** (internal builds): `snapshot({device, pool_id})`
- **New API** (conda builds): `snapshot()` - no arguments

## Root Cause
Torchcomms code was written for the older PyTorch API where `snapshot()` accepted arguments, but conda environments use a newer PyTorch version with the updated API.

## Solution
Added conditional compilation to handle both API versions:
1. **Defined `TORCHCOMMS_CONDA_BUILD` macro** in CMakeLists.txt files for conda builds
2. **Updated affected C++ files** to use the correct API based on build type:
   - `comms/torchcomms/ncclx/TorchCommNCCLXCCA.cpp`
   - `comms/torchcomms/nccl/TorchCommNCCLCCA.cpp`

This follows the same pattern used in `ProcessGroupNCCLX.cpp` with the `NCCLX_CONDA_BUILD` macro.

Differential Revision: D87413411
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 19, 2025

@MittalMakwana has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87413411.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants