When building torchcomms from source against a local PyTorch source build, any call to new_comm("nccl", ..) immediately segfaults inside ncclCommInitRankConfig.
Repro -
import faulthandler, torch, os
faulthandler.enable()
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(local_rank)
torch.cuda.init()
torch.zeros(1, device=device)
from torchcomms import new_comm
comm = new_comm("nccl", device, name="test_comm") # SIGSEGV here
# torchrun --nproc_per_node=2 repro.py
Environment -
- OS: Fedora 41 (Container Image)
- Python: 3.13.9
- PyTorch: 2.12.0a0+git54d8d2a (built from source)
- CUDA toolkit: 12.8
- CUDA driver: 580.82.07 (CUDA 13.0)
- NCCL: 2.28.9
- GPU: 2x NVIDIA H200
- torchcomms: 0.2.0 (built with USE_NCCLX=OFF USE_TRANSPORT=OFF)
acc claude - the issue is that the PyTorch-source-build path in the NCCL CMakeLists links libnccl_static.a, which bundles hidden-visibility stubs for cudaGetDriverEntryPoint that shadow the real libcudart.so at link time. NCCL's CUDA driver function pointers never get resolved, so ncclCommInitRankConfig segfaults on a NULL cuCtxGetCurrent call. Switching to libnccl.so (like the other build paths in CMakeLists currently do) fixes it.
I have verified the above on my env and happy to submit the fix.
cc: @d4l3k
When building torchcomms from source against a local PyTorch source build, any call to
new_comm("nccl", ..)immediately segfaults insidencclCommInitRankConfig.Repro -
Environment -
acc claude - the issue is that the PyTorch-source-build path in the NCCL CMakeLists links
libnccl_static.a, which bundles hidden-visibility stubs forcudaGetDriverEntryPointthat shadow the reallibcudart.soat link time. NCCL's CUDA driver function pointers never get resolved, soncclCommInitRankConfigsegfaults on a NULLcuCtxGetCurrentcall. Switching tolibnccl.so(like the other build paths in CMakeLists currently do) fixes it.I have verified the above on my env and happy to submit the fix.
cc: @d4l3k