Skip to content

NCCL backend segfaults on new_comm when built against PyTorch source #2406

Description

@thisisatharva-rh

When building torchcomms from source against a local PyTorch source build, any call to new_comm("nccl", ..) immediately segfaults inside ncclCommInitRankConfig.

Repro -

import faulthandler, torch, os
faulthandler.enable()

rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")

torch.cuda.set_device(local_rank)
torch.cuda.init()
torch.zeros(1, device=device)

from torchcomms import new_comm
comm = new_comm("nccl", device, name="test_comm")  # SIGSEGV here

# torchrun --nproc_per_node=2 repro.py

Environment -

  • OS: Fedora 41 (Container Image)
  • Python: 3.13.9
  • PyTorch: 2.12.0a0+git54d8d2a (built from source)
  • CUDA toolkit: 12.8
  • CUDA driver: 580.82.07 (CUDA 13.0)
  • NCCL: 2.28.9
  • GPU: 2x NVIDIA H200
  • torchcomms: 0.2.0 (built with USE_NCCLX=OFF USE_TRANSPORT=OFF)

acc claude - the issue is that the PyTorch-source-build path in the NCCL CMakeLists links libnccl_static.a, which bundles hidden-visibility stubs for cudaGetDriverEntryPoint that shadow the real libcudart.so at link time. NCCL's CUDA driver function pointers never get resolved, so ncclCommInitRankConfig segfaults on a NULL cuCtxGetCurrent call. Switching to libnccl.so (like the other build paths in CMakeLists currently do) fixes it.

I have verified the above on my env and happy to submit the fix.

cc: @d4l3k

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions