fix(DR-1170): fix ComfyUI startup crash ("server not reachable")#227
Merged
Conversation
ComfyUI is installed by comfy-cli into its own /comfyui/.venv, but start.sh launches it with /opt/venv's python. The launch venv was missing ComfyUI's runtime deps (sqlalchemy from the new asset DB, etc.), so ComfyUI crashed at startup — surfacing as the misleading 'ComfyUI server not reachable' error. Nothing in this repo changed; the breakage rode in via COMFYUI_VERSION=latest. - Mirror ComfyUI's full dependency set (core + custom-node requirements) into /opt/venv so the launch venv is complete. - Pin transformers<5 / huggingface-hub<1 (both unbounded upstream). - Add a build-time smoke test (main.py --quick-test-for-ci --cpu) so a startup-breaking dep fails the build instead of a live worker. - start.sh pre-flight now launches a real kernel + prints sm/torch/cuda, so a GPU/kernel mismatch fails loudly at boot instead of as 'not reachable'. Base-image stays cu126; NGC base migration is a separate branch/PR.
| # so a fresh install can pull transformers 5.x / huggingface-hub 1.x whose | ||
| # breaking API changes also crash ComfyUI at startup. Keep them on the last | ||
| # known-good major. | ||
| RUN uv pip install "transformers>=4.50.3,<5" "huggingface-hub<1.0" |
There was a problem hiding this comment.
If you can run this in the same step as the previous uv pip install, it will delete the old versions from the layer entirely, saving some space
Contributor
Author
There was a problem hiding this comment.
Good call — done in 1462e01. Folded the transformers<5 / huggingface-hub<1 pin into the same RUN as the requirements install, so the downgrade happens in one layer and the 5.x/1.x versions don't linger. Re-running the build to revalidate the smoke test.
Per review (PR #227): downgrade transformers<5 / huggingface-hub<1 in the same RUN as the requirements install so the unwanted 5.x/1.x versions don't linger in a lower layer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes DR-1170 — the recurring
ComfyUI server (127.0.0.1:8188) not reachable after multiple retrieserror on serverless workers. Scoped intentionally to the crash fix only, on the current cu126 base — no base-image changes. (Moving to the NGC base is tracked separately in #226.)Root cause
Nothing in this repo changed — the breakage rode in via
COMFYUI_VERSION=latest.comfy-cli installs ComfyUI into its own workspace venv (
/comfyui/.venv), butstart.shlaunches ComfyUI with/opt/venv's python. That left the launch venv missing ComfyUI's runtime deps — most recentlysqlalchemy, newly imported at startup by ComfyUI's asset DB. ComfyUI crashed instantly on boot (~100 ms), the handler saw the process die, and reported it as "server not reachable."Because we install the latest ComfyUI on every build, the same Dockerfile started producing broken images once upstream added that import — already-built images kept working, fresh builds broke.
Changes (Dockerfile + start.sh only)
requirements.txt) into/opt/venvso the launch venv is complete.transformers>=4.50.3,<5andhuggingface-hub<1.0(both unbounded upstream; 5.x / 1.x also break startup).main.py --quick-test-for-ci --cpu): a startup-breaking dependency now fails the build instead of a live worker — permanent guard against this class oflatestregression.sm_/torch/cuda, so a kernel/arch mismatch fails loudly at boot instead of as "not reachable."Scope