WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226
Draft
TimPietruskyRunPod wants to merge 6 commits into
Draft
WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226TimPietruskyRunPod wants to merge 6 commits into
TimPietruskyRunPod wants to merge 6 commits into
Conversation
Investigating DR-1170 "ComfyUI server unreachable" on CUDA 13 / Blackwell hosts. Root cause hypothesis: the shipped cu126 torch has no sm_120 kernels, so the worker boots but dies on the first GPU op, surfacing as the misleading "server not reachable" handler error. - Dockerfile: add BASE_PROVIDES_TORCH so a base image that already ships a tuned torch (NGC nvcr.io/nvidia/pytorch) is reused instead of clobbered — venv built with --system-site-packages and comfy-cli run with --skip-torch-or-directml. - docker-bake.hcl: add base-ngc target (nvcr.io/nvidia/pytorch:26.05-py3, CUDA 13.2, Blackwell kernels + cuda-compat forward-compat libs). - start.sh: pre-flight now launches a real CUDA kernel and prints sm_/torch/cuda versions, so an arch/kernel mismatch fails loudly at boot with a clear cause. - CI: add dev-cuda-bases workflow to build base-ngc and base-cuda12-8-1 on push so we can A/B test the images on real GPUs.
…sting SDXL is public (no HF token). Bakes the checkpoint into the image so a serverless worker can run txt2img with no network volume — needed to validate end-to-end image generation across GPUs/CUDA versions for DR-1170.
…ke test Root cause of DR-1170 'ComfyUI server not reachable': ComfyUI's requirements have no upper bound on transformers/huggingface-hub, so fresh installs pulled transformers 5.x + hf-hub 1.x, whose breaking changes crash ComfyUI at startup on EVERY GPU (confirmed: identical instant-crash on both NGC and cu128 images across Ampere/Ada/Hopper/Blackwell). Pin to last good majors and add a --quick-test-for-ci smoke test so startup breakage fails the build, not the worker.
comfy-cli installs ComfyUI into its own /comfyui/.venv, but start.sh launches ComfyUI with /opt/venv's python. The launch venv was missing ComfyUI's runtime deps (sqlalchemy from the new asset DB, etc.), so ComfyUI crashed at startup — surfacing as 'ComfyUI server not reachable'. Mirror ComfyUI core + custom-node requirements into /opt/venv. Caught by the new build-time smoke test.
…UDA) When REPORT_HOST_CUDA=true, job results include a 'host' block with the host's actual CUDA version (nvidia-smi), driver, GPU, and torch build CUDA. Lets us validate the exact CUDA version a serverless worker landed on — the endpoint API only exposes the min-cuda floor. Off by default; cached after first call.
- docker-bake.hcl: default BASE_IMAGE = nvcr.io/nvidia/pytorch:26.05-py3 with BASE_PROVIDES_TORCH=true, applied to all model targets (base, sdxl, sd3, flux1-*, z-image-turbo). Removed the experimental base-ngc/sdxl-ngc/ sdxl-cuda128 and the cuda12.8.1 targets. - Dockerfile: default ARG BASE_IMAGE -> NGC, BASE_PROVIDES_TORCH default -> true. - release.yml + manual-build-all.yml: drop base-cuda12-8-1 from build matrix. - remove experimental dev-cuda-bases.yml workflow. NGC base validated end-to-end on all 15 serverless GPUs incl. all Blackwell (DR-1170).
Contributor
Author
|
Re-scoped. The DR-1170 startup-crash fix (the urgent part) is split out into #227 against the current cu126 base — no base-image changes, ships independently. This PR now tracks only the NGC PyTorch base migration (Blackwell/CUDA-13). Parked because keeping NVIDIA's tuned torch while adding torchaudio is non-trivial: NGC ships torch+torchvision but not torchaudio, and PyPI torchaudio is ABI-incompatible with NGC's custom |
ssube-runpod
approved these changes
Jun 10, 2026
TimPietruskyRunPod
added a commit
that referenced
this pull request
Jun 17, 2026
ComfyUI's runtime deps are now installed into the venv that start.sh launches it with, fixing the startup crash that surfaced as 'server not reachable'. Pins transformers<5 / huggingface-hub<1, adds a build-time smoke test, and hardens the GPU pre-flight. Base image unchanged (cu126); NGC migration tracked separately in #226.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes DR-1170 — recurring
ComfyUI server (127.0.0.1:8188) not reachable after multiple retrieson serverless workers — and moves all images onto the NVIDIA NGC PyTorch base so they run across the entire GPU fleet, including Blackwell (CUDA 13).Root cause
The "server not reachable" error was a misleading symptom, not the cause. comfy-cli installs ComfyUI into its own workspace venv (
/comfyui/.venv), butstart.shlaunches ComfyUI with/opt/venv's python. The launch venv was therefore missing ComfyUI's runtime deps — most recentlysqlalchemy(pulled in by ComfyUI's new asset DB, imported at startup). ComfyUI crashed instantly on boot, the handler saw the process die, and reported it as "not reachable" (~100 ms execution time).Two secondary landmines were also fixed: unbounded
transformers/huggingface-hubpulling breaking 5.x / 1.x, and a GPU pre-flight that only ran driver calls (so a kernel/arch mismatch slipped through).Changes
Dockerfilerequirements.txt) into/opt/venvso the launch venv is complete — the root-cause fix.transformers>=4.50.3,<5andhuggingface-hub<1.0.main.py --quick-test-for-ci --cpu) that fails the build if ComfyUI can't start — permanent guard against this class of silent breakage.BASE_PROVIDES_TORCH: when the base ships a tuned torch (NGC), build the venv with--system-site-packagesand run comfy-cli with--skip-torch-or-directmlso the bundled torch is reused, not clobbered.nvcr.io/nvidia/pytorch:26.05-py3.src/start.sh: pre-flight now launches a real CUDA kernel and printssm_/torch/cudaversions, so an arch/kernel mismatch fails loudly at boot instead of as "server not reachable".handler.py: optionalREPORT_HOST_CUDAflag adds ahostblock (nvidia-smi CUDA, driver, GPU, torch) to job output for diagnostics. Off by default.docker-bake.hcl: NGC is the default base for all targets (base, sdxl, sd3, flux1-*, z-image-turbo). Removed the cuda12.8.1 and experimental targets.base-cuda12-8-1fromrelease.yml/manual-build-all.yml; removed the experimentaldev-cuda-bases.ymlworkflow.Validation
End-to-end SDXL txt2img on the NGC image across 15/15 serverless-available GPUs, all generating real 1024×1024 images:
Tradeoff
Every image now inherits the ~20 GB NGC base, so all variants are larger and cold-start slower than the previous cu126 images. Accepted intentionally for fleet-wide CUDA-13/Blackwell support.