WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix by TimPietruskyRunPod · Pull Request #226 · runpod-workers/worker-comfyui

TimPietruskyRunPod · 2026-06-10T11:32:30Z

Summary

Fixes DR-1170 — recurring ComfyUI server (127.0.0.1:8188) not reachable after multiple retries on serverless workers — and moves all images onto the NVIDIA NGC PyTorch base so they run across the entire GPU fleet, including Blackwell (CUDA 13).

Root cause

The "server not reachable" error was a misleading symptom, not the cause. comfy-cli installs ComfyUI into its own workspace venv (/comfyui/.venv), but start.sh launches ComfyUI with /opt/venv's python. The launch venv was therefore missing ComfyUI's runtime deps — most recently sqlalchemy (pulled in by ComfyUI's new asset DB, imported at startup). ComfyUI crashed instantly on boot, the handler saw the process die, and reported it as "not reachable" (~100 ms execution time).

Two secondary landmines were also fixed: unbounded transformers / huggingface-hub pulling breaking 5.x / 1.x, and a GPU pre-flight that only ran driver calls (so a kernel/arch mismatch slipped through).

Changes

Dockerfile
- Mirror ComfyUI's full dependency set (core + custom-node requirements.txt) into /opt/venv so the launch venv is complete — the root-cause fix.
- Pin transformers>=4.50.3,<5 and huggingface-hub<1.0.
- Add a build-time smoke test (main.py --quick-test-for-ci --cpu) that fails the build if ComfyUI can't start — permanent guard against this class of silent breakage.
- Add BASE_PROVIDES_TORCH: when the base ships a tuned torch (NGC), build the venv with --system-site-packages and run comfy-cli with --skip-torch-or-directml so the bundled torch is reused, not clobbered.
- Default base image → nvcr.io/nvidia/pytorch:26.05-py3.
src/start.sh: pre-flight now launches a real CUDA kernel and prints sm_/torch/cuda versions, so an arch/kernel mismatch fails loudly at boot instead of as "server not reachable".
handler.py: optional REPORT_HOST_CUDA flag adds a host block (nvidia-smi CUDA, driver, GPU, torch) to job output for diagnostics. Off by default.
docker-bake.hcl: NGC is the default base for all targets (base, sdxl, sd3, flux1-*, z-image-turbo). Removed the cuda12.8.1 and experimental targets.
CI: dropped base-cuda12-8-1 from release.yml / manual-build-all.yml; removed the experimental dev-cuda-bases.yml workflow.

Validation

End-to-end SDXL txt2img on the NGC image across 15/15 serverless-available GPUs, all generating real 1024×1024 images:

Blackwell: B200, RTX 5090, RTX PRO 6000 ✅
Hopper: H100 SXM, H100 NVL, H200 ✅
Ada: RTX 4090, RTX 2000 Ada, L4, L40S ✅
Ampere: A100 SXM, A100 PCIe, A40, A6000, RTX 3090 ✅
B300 not offered as a serverless GPU (untested)

Tradeoff

Every image now inherits the ~20 GB NGC base, so all variants are larger and cold-start slower than the previous cu126 images. Accepted intentionally for fleet-wide CUDA-13/Blackwell support.

Investigating DR-1170 "ComfyUI server unreachable" on CUDA 13 / Blackwell hosts. Root cause hypothesis: the shipped cu126 torch has no sm_120 kernels, so the worker boots but dies on the first GPU op, surfacing as the misleading "server not reachable" handler error. - Dockerfile: add BASE_PROVIDES_TORCH so a base image that already ships a tuned torch (NGC nvcr.io/nvidia/pytorch) is reused instead of clobbered — venv built with --system-site-packages and comfy-cli run with --skip-torch-or-directml. - docker-bake.hcl: add base-ngc target (nvcr.io/nvidia/pytorch:26.05-py3, CUDA 13.2, Blackwell kernels + cuda-compat forward-compat libs). - start.sh: pre-flight now launches a real CUDA kernel and prints sm_/torch/cuda versions, so an arch/kernel mismatch fails loudly at boot with a clear cause. - CI: add dev-cuda-bases workflow to build base-ngc and base-cuda12-8-1 on push so we can A/B test the images on real GPUs.

…sting SDXL is public (no HF token). Bakes the checkpoint into the image so a serverless worker can run txt2img with no network volume — needed to validate end-to-end image generation across GPUs/CUDA versions for DR-1170.

…ke test Root cause of DR-1170 'ComfyUI server not reachable': ComfyUI's requirements have no upper bound on transformers/huggingface-hub, so fresh installs pulled transformers 5.x + hf-hub 1.x, whose breaking changes crash ComfyUI at startup on EVERY GPU (confirmed: identical instant-crash on both NGC and cu128 images across Ampere/Ada/Hopper/Blackwell). Pin to last good majors and add a --quick-test-for-ci smoke test so startup breakage fails the build, not the worker.

comfy-cli installs ComfyUI into its own /comfyui/.venv, but start.sh launches ComfyUI with /opt/venv's python. The launch venv was missing ComfyUI's runtime deps (sqlalchemy from the new asset DB, etc.), so ComfyUI crashed at startup — surfacing as 'ComfyUI server not reachable'. Mirror ComfyUI core + custom-node requirements into /opt/venv. Caught by the new build-time smoke test.

…UDA) When REPORT_HOST_CUDA=true, job results include a 'host' block with the host's actual CUDA version (nvidia-smi), driver, GPU, and torch build CUDA. Lets us validate the exact CUDA version a serverless worker landed on — the endpoint API only exposes the min-cuda floor. Off by default; cached after first call.

- docker-bake.hcl: default BASE_IMAGE = nvcr.io/nvidia/pytorch:26.05-py3 with BASE_PROVIDES_TORCH=true, applied to all model targets (base, sdxl, sd3, flux1-*, z-image-turbo). Removed the experimental base-ngc/sdxl-ngc/ sdxl-cuda128 and the cuda12.8.1 targets. - Dockerfile: default ARG BASE_IMAGE -> NGC, BASE_PROVIDES_TORCH default -> true. - release.yml + manual-build-all.yml: drop base-cuda12-8-1 from build matrix. - remove experimental dev-cuda-bases.yml workflow. NGC base validated end-to-end on all 15 serverless GPUs incl. all Blackwell (DR-1170).

TimPietruskyRunPod · 2026-06-10T13:57:44Z

Re-scoped. The DR-1170 startup-crash fix (the urgent part) is split out into #227 against the current cu126 base — no base-image changes, ships independently.

This PR now tracks only the NGC PyTorch base migration (Blackwell/CUDA-13). Parked because keeping NVIDIA's tuned torch while adding torchaudio is non-trivial: NGC ships torch+torchvision but not torchaudio, and PyPI torchaudio is ABI-incompatible with NGC's custom 2.12.0a0 torch. Needs a decision (build torchaudio from source vs drop it for image-only workflows vs stock-torch stack) before it's mergeable.

ComfyUI's runtime deps are now installed into the venv that start.sh launches it with, fixing the startup crash that surfaced as 'server not reachable'. Pins transformers<5 / huggingface-hub<1, adds a build-time smoke test, and hardens the GPU pre-flight. Base image unchanged (cu126); NGC migration tracked separately in #226.

TimPietrusky added 6 commits June 9, 2026 15:34

TimPietruskyRunPod mentioned this pull request Jun 10, 2026

fix(DR-1170): fix ComfyUI startup crash ("server not reachable") #227

Merged

TimPietruskyRunPod changed the title ~~fix(DR-1170): fix ComfyUI startup crash + move to NGC PyTorch base for all GPUs~~ WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix Jun 10, 2026

TimPietruskyRunPod marked this pull request as draft June 10, 2026 13:57

ssube-runpod approved these changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226

WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226
TimPietruskyRunPod wants to merge 6 commits into
mainfrom
fix/dr-1170-ngc-base-blackwell

TimPietruskyRunPod commented Jun 10, 2026

Uh oh!

TimPietruskyRunPod commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TimPietruskyRunPod commented Jun 10, 2026

Summary

Root cause

Changes

Validation

Tradeoff

Uh oh!

TimPietruskyRunPod commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants