Skip to content

WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226

Draft
TimPietruskyRunPod wants to merge 6 commits into
mainfrom
fix/dr-1170-ngc-base-blackwell
Draft

WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix#226
TimPietruskyRunPod wants to merge 6 commits into
mainfrom
fix/dr-1170-ngc-base-blackwell

Conversation

@TimPietruskyRunPod

Copy link
Copy Markdown
Contributor

Summary

Fixes DR-1170 — recurring ComfyUI server (127.0.0.1:8188) not reachable after multiple retries on serverless workers — and moves all images onto the NVIDIA NGC PyTorch base so they run across the entire GPU fleet, including Blackwell (CUDA 13).

Root cause

The "server not reachable" error was a misleading symptom, not the cause. comfy-cli installs ComfyUI into its own workspace venv (/comfyui/.venv), but start.sh launches ComfyUI with /opt/venv's python. The launch venv was therefore missing ComfyUI's runtime deps — most recently sqlalchemy (pulled in by ComfyUI's new asset DB, imported at startup). ComfyUI crashed instantly on boot, the handler saw the process die, and reported it as "not reachable" (~100 ms execution time).

Two secondary landmines were also fixed: unbounded transformers / huggingface-hub pulling breaking 5.x / 1.x, and a GPU pre-flight that only ran driver calls (so a kernel/arch mismatch slipped through).

Changes

  • Dockerfile
    • Mirror ComfyUI's full dependency set (core + custom-node requirements.txt) into /opt/venv so the launch venv is complete — the root-cause fix.
    • Pin transformers>=4.50.3,<5 and huggingface-hub<1.0.
    • Add a build-time smoke test (main.py --quick-test-for-ci --cpu) that fails the build if ComfyUI can't start — permanent guard against this class of silent breakage.
    • Add BASE_PROVIDES_TORCH: when the base ships a tuned torch (NGC), build the venv with --system-site-packages and run comfy-cli with --skip-torch-or-directml so the bundled torch is reused, not clobbered.
    • Default base image → nvcr.io/nvidia/pytorch:26.05-py3.
  • src/start.sh: pre-flight now launches a real CUDA kernel and prints sm_/torch/cuda versions, so an arch/kernel mismatch fails loudly at boot instead of as "server not reachable".
  • handler.py: optional REPORT_HOST_CUDA flag adds a host block (nvidia-smi CUDA, driver, GPU, torch) to job output for diagnostics. Off by default.
  • docker-bake.hcl: NGC is the default base for all targets (base, sdxl, sd3, flux1-*, z-image-turbo). Removed the cuda12.8.1 and experimental targets.
  • CI: dropped base-cuda12-8-1 from release.yml / manual-build-all.yml; removed the experimental dev-cuda-bases.yml workflow.

Validation

End-to-end SDXL txt2img on the NGC image across 15/15 serverless-available GPUs, all generating real 1024×1024 images:

  • Blackwell: B200, RTX 5090, RTX PRO 6000 ✅
  • Hopper: H100 SXM, H100 NVL, H200 ✅
  • Ada: RTX 4090, RTX 2000 Ada, L4, L40S ✅
  • Ampere: A100 SXM, A100 PCIe, A40, A6000, RTX 3090 ✅
  • B300 not offered as a serverless GPU (untested)

Tradeoff

Every image now inherits the ~20 GB NGC base, so all variants are larger and cold-start slower than the previous cu126 images. Accepted intentionally for fleet-wide CUDA-13/Blackwell support.

Investigating DR-1170 "ComfyUI server unreachable" on CUDA 13 / Blackwell
hosts. Root cause hypothesis: the shipped cu126 torch has no sm_120 kernels,
so the worker boots but dies on the first GPU op, surfacing as the misleading
"server not reachable" handler error.

- Dockerfile: add BASE_PROVIDES_TORCH so a base image that already ships a tuned
  torch (NGC nvcr.io/nvidia/pytorch) is reused instead of clobbered — venv built
  with --system-site-packages and comfy-cli run with --skip-torch-or-directml.
- docker-bake.hcl: add base-ngc target (nvcr.io/nvidia/pytorch:26.05-py3, CUDA
  13.2, Blackwell kernels + cuda-compat forward-compat libs).
- start.sh: pre-flight now launches a real CUDA kernel and prints sm_/torch/cuda
  versions, so an arch/kernel mismatch fails loudly at boot with a clear cause.
- CI: add dev-cuda-bases workflow to build base-ngc and base-cuda12-8-1 on push
  so we can A/B test the images on real GPUs.
…sting

SDXL is public (no HF token). Bakes the checkpoint into the image so a
serverless worker can run txt2img with no network volume — needed to validate
end-to-end image generation across GPUs/CUDA versions for DR-1170.
…ke test

Root cause of DR-1170 'ComfyUI server not reachable': ComfyUI's requirements
have no upper bound on transformers/huggingface-hub, so fresh installs pulled
transformers 5.x + hf-hub 1.x, whose breaking changes crash ComfyUI at startup
on EVERY GPU (confirmed: identical instant-crash on both NGC and cu128 images
across Ampere/Ada/Hopper/Blackwell). Pin to last good majors and add a
--quick-test-for-ci smoke test so startup breakage fails the build, not the worker.
comfy-cli installs ComfyUI into its own /comfyui/.venv, but start.sh launches
ComfyUI with /opt/venv's python. The launch venv was missing ComfyUI's runtime
deps (sqlalchemy from the new asset DB, etc.), so ComfyUI crashed at startup —
surfacing as 'ComfyUI server not reachable'. Mirror ComfyUI core + custom-node
requirements into /opt/venv. Caught by the new build-time smoke test.
…UDA)

When REPORT_HOST_CUDA=true, job results include a 'host' block with the host's
actual CUDA version (nvidia-smi), driver, GPU, and torch build CUDA. Lets us
validate the exact CUDA version a serverless worker landed on — the endpoint
API only exposes the min-cuda floor. Off by default; cached after first call.
- docker-bake.hcl: default BASE_IMAGE = nvcr.io/nvidia/pytorch:26.05-py3 with
  BASE_PROVIDES_TORCH=true, applied to all model targets (base, sdxl, sd3,
  flux1-*, z-image-turbo). Removed the experimental base-ngc/sdxl-ngc/
  sdxl-cuda128 and the cuda12.8.1 targets.
- Dockerfile: default ARG BASE_IMAGE -> NGC, BASE_PROVIDES_TORCH default -> true.
- release.yml + manual-build-all.yml: drop base-cuda12-8-1 from build matrix.
- remove experimental dev-cuda-bases.yml workflow.

NGC base validated end-to-end on all 15 serverless GPUs incl. all Blackwell (DR-1170).
@TimPietruskyRunPod TimPietruskyRunPod changed the title fix(DR-1170): fix ComfyUI startup crash + move to NGC PyTorch base for all GPUs WIP: migrate base image to NGC PyTorch (Blackwell/CUDA-13) — separate from DR-1170 fix Jun 10, 2026
@TimPietruskyRunPod TimPietruskyRunPod marked this pull request as draft June 10, 2026 13:57
@TimPietruskyRunPod

Copy link
Copy Markdown
Contributor Author

Re-scoped. The DR-1170 startup-crash fix (the urgent part) is split out into #227 against the current cu126 base — no base-image changes, ships independently.

This PR now tracks only the NGC PyTorch base migration (Blackwell/CUDA-13). Parked because keeping NVIDIA's tuned torch while adding torchaudio is non-trivial: NGC ships torch+torchvision but not torchaudio, and PyPI torchaudio is ABI-incompatible with NGC's custom 2.12.0a0 torch. Needs a decision (build torchaudio from source vs drop it for image-only workflows vs stock-torch stack) before it's mergeable.

TimPietruskyRunPod added a commit that referenced this pull request Jun 17, 2026
ComfyUI's runtime deps are now installed into the venv that start.sh launches it with, fixing the startup crash that surfaced as 'server not reachable'. Pins transformers<5 / huggingface-hub<1, adds a build-time smoke test, and hardens the GPU pre-flight. Base image unchanged (cu126); NGC migration tracked separately in #226.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants