Skip to content

fix(DR-1170): fix ComfyUI startup crash ("server not reachable")#227

Merged
TimPietruskyRunPod merged 3 commits into
mainfrom
fix/dr-1170-startup-crash
Jun 17, 2026
Merged

fix(DR-1170): fix ComfyUI startup crash ("server not reachable")#227
TimPietruskyRunPod merged 3 commits into
mainfrom
fix/dr-1170-startup-crash

Conversation

@TimPietruskyRunPod

Copy link
Copy Markdown
Contributor

Summary

Fixes DR-1170 — the recurring ComfyUI server (127.0.0.1:8188) not reachable after multiple retries error on serverless workers. Scoped intentionally to the crash fix only, on the current cu126 base — no base-image changes. (Moving to the NGC base is tracked separately in #226.)

Root cause

Nothing in this repo changed — the breakage rode in via COMFYUI_VERSION=latest.

comfy-cli installs ComfyUI into its own workspace venv (/comfyui/.venv), but start.sh launches ComfyUI with /opt/venv's python. That left the launch venv missing ComfyUI's runtime deps — most recently sqlalchemy, newly imported at startup by ComfyUI's asset DB. ComfyUI crashed instantly on boot (~100 ms), the handler saw the process die, and reported it as "server not reachable."

Because we install the latest ComfyUI on every build, the same Dockerfile started producing broken images once upstream added that import — already-built images kept working, fresh builds broke.

Changes (Dockerfile + start.sh only)

  • Mirror ComfyUI's full dependency set (core + custom-node requirements.txt) into /opt/venv so the launch venv is complete.
  • Pin transformers>=4.50.3,<5 and huggingface-hub<1.0 (both unbounded upstream; 5.x / 1.x also break startup).
  • Build-time smoke test (main.py --quick-test-for-ci --cpu): a startup-breaking dependency now fails the build instead of a live worker — permanent guard against this class of latest regression.
  • start.sh pre-flight launches a real CUDA kernel and prints sm_/torch/cuda, so a kernel/arch mismatch fails loudly at boot instead of as "not reachable."

Scope

ComfyUI is installed by comfy-cli into its own /comfyui/.venv, but start.sh
launches it with /opt/venv's python. The launch venv was missing ComfyUI's
runtime deps (sqlalchemy from the new asset DB, etc.), so ComfyUI crashed at
startup — surfacing as the misleading 'ComfyUI server not reachable' error.
Nothing in this repo changed; the breakage rode in via COMFYUI_VERSION=latest.

- Mirror ComfyUI's full dependency set (core + custom-node requirements) into
  /opt/venv so the launch venv is complete.
- Pin transformers<5 / huggingface-hub<1 (both unbounded upstream).
- Add a build-time smoke test (main.py --quick-test-for-ci --cpu) so a
  startup-breaking dep fails the build instead of a live worker.
- start.sh pre-flight now launches a real kernel + prints sm/torch/cuda, so a
  GPU/kernel mismatch fails loudly at boot instead of as 'not reachable'.

Base-image stays cu126; NGC base migration is a separate branch/PR.
Comment thread Dockerfile Outdated
# so a fresh install can pull transformers 5.x / huggingface-hub 1.x whose
# breaking API changes also crash ComfyUI at startup. Keep them on the last
# known-good major.
RUN uv pip install "transformers>=4.50.3,<5" "huggingface-hub<1.0"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can run this in the same step as the previous uv pip install, it will delete the old versions from the layer entirely, saving some space

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — done in 1462e01. Folded the transformers<5 / huggingface-hub<1 pin into the same RUN as the requirements install, so the downgrade happens in one layer and the 5.x/1.x versions don't linger. Re-running the build to revalidate the smoke test.

Per review (PR #227): downgrade transformers<5 / huggingface-hub<1 in the same
RUN as the requirements install so the unwanted 5.x/1.x versions don't linger in
a lower layer.

@Madiator2011Work Madiator2011Work left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TimPietruskyRunPod TimPietruskyRunPod merged commit 402e5ed into main Jun 17, 2026
2 checks passed
@TimPietruskyRunPod TimPietruskyRunPod deleted the fix/dr-1170-startup-crash branch June 17, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants