Fix PyTorch CUDACachingAllocator::snapshot() API compatibility for conda builds #65

MittalMakwana · 2025-11-19T22:49:40Z

Summary:

Problem

The torchcomms build in the xlformers_llama4_flagship_conda environment is failing with:

error: too many arguments to function 'c10::cuda::CUDACachingAllocator::SnapshotInfo c10::cuda::CUDACachingAllocator::snapshot()'

D86907322 introduced a new method registerMemPreHook() that calls c10::cuda::CUDACachingAllocator::snapshot({device, 0}) with arguments.

This is caused by a PyTorch API breaking change in c10::cuda::CUDACachingAllocator::snapshot():

Old API (internal builds): snapshot({device, pool_id})
New API (conda builds): snapshot() - no arguments

Root Cause

Torchcomms code was written for the older PyTorch API where snapshot() accepted arguments, but conda environments use a newer PyTorch version with the updated API.

Solution

Added conditional compilation to handle both API versions:

Defined TORCHCOMMS_CONDA_BUILD macro in CMakeLists.txt files for conda builds
Updated affected C++ files to use the correct API based on build type:
- comms/torchcomms/ncclx/TorchCommNCCLXCCA.cpp
- comms/torchcomms/nccl/TorchCommNCCLCCA.cpp

This follows the same pattern used in ProcessGroupNCCLX.cpp with the NCCLX_CONDA_BUILD macro.

Differential Revision: D87413411

…nda builds Summary: ## Problem The torchcomms build in the xlformers_llama4_flagship_conda environment is failing with: ``` error: too many arguments to function 'c10::cuda::CUDACachingAllocator::SnapshotInfo c10::cuda::CUDACachingAllocator::snapshot()' ``` D86907322 introduced a new method *registerMemPreHook()* that calls ```c10::cuda::CUDACachingAllocator::snapshot({device, 0})``` with arguments. This is caused by a **PyTorch API breaking change** in `c10::cuda::CUDACachingAllocator::snapshot()`: - **Old API** (internal builds): `snapshot({device, pool_id})` - **New API** (conda builds): `snapshot()` - no arguments ## Root Cause Torchcomms code was written for the older PyTorch API where `snapshot()` accepted arguments, but conda environments use a newer PyTorch version with the updated API. ## Solution Added conditional compilation to handle both API versions: 1. **Defined `TORCHCOMMS_CONDA_BUILD` macro** in CMakeLists.txt files for conda builds 2. **Updated affected C++ files** to use the correct API based on build type: - `comms/torchcomms/ncclx/TorchCommNCCLXCCA.cpp` - `comms/torchcomms/nccl/TorchCommNCCLCCA.cpp` This follows the same pattern used in `ProcessGroupNCCLX.cpp` with the `NCCLX_CONDA_BUILD` macro. Differential Revision: D87413411

meta-codesync · 2025-11-19T22:49:47Z

@MittalMakwana has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87413411.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix PyTorch CUDACachingAllocator::snapshot() API compatibility for conda builds #65

Fix PyTorch CUDACachingAllocator::snapshot() API compatibility for conda builds #65

MittalMakwana commented Nov 19, 2025

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix PyTorch CUDACachingAllocator::snapshot() API compatibility for conda builds #65

Are you sure you want to change the base?

Fix PyTorch CUDACachingAllocator::snapshot() API compatibility for conda builds #65

Conversation

MittalMakwana commented Nov 19, 2025

Problem

Root Cause

Solution

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants