Fix ipfs-cluster wedge: blox-ai bind-mounts identity.json as a directory#76
Merged
Conversation
…file Root cause: the blox-ai plugin bind-mounted the cluster identity FILE directly (/uniondrive/ipfs-cluster/identity.json:...:ro). When the host file is absent at container-create time, the Docker daemon auto-creates the source as a root-owned DIRECTORY. That permanently wedges ipfs-cluster (its init loops forever waiting for a regular-file identity.json) and makes go-fula initipfscluster panic (WriteFile over a directory); go-fula.sh has no set -e so it marks .ipfscluster_setup done anyway, masking it. Neither readiness-check nor the blox-ai "not earning" tree caught it: the wedged container is still running, and the wait-loop emits no daemon-log error the readiness checks match. Proven end-to-end on a lab device, including deterministic peerID regeneration so earnings are preserved. Fix 1 (prevent recurrence) - plugins/blox-ai/docker-compose.yml: mount the parent directory /uniondrive/ipfs-cluster instead of the file, and point BLOX_AI_CLUSTER_IDENTITY_PATH at identity.json inside it. Docker can then only ever auto-create the harmless directory, never the file-as-directory. identity_health honors the env var, so no blox-ai image rebuild is needed. Fix 2 (recover already-wedged devices) - readiness-check.py check_and_fix_ipfs_cluster(): detect identity.json being a directory (the existing log-pattern checks never match this failure), stop blox-ai, stop fula, rm the bogus dir, restart fula so go-fula regenerates the deterministic identity, then restart blox-ai once the file is back. All subprocess calls are timeout-bounded so a D-state container cannot hang the watchdog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix 3a (go-fula.sh): remove a directory-shaped identity.json before initipfscluster so the writer does not panic over it. Defense-in-depth alongside the Go-side guard in go-fula initipfscluster (separate PR). Fix 4a (not-earning.yaml): branch on identity_health's new cluster_identity_is_directory reason so the "not earning" tree gives a concrete "cluster identity is a folder" verdict + restart_fula recommendation instead of the generic "indeterminate". Pairs with the blox-ai identity_health.py change that emits the reason (separate PR); the branch is inert until that ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On a blox,
/uniondrive/ipfs-cluster/identity.jsonwas an empty directory instead of a file.ipfs_clusterwas stuck forever printingWaiting for /internal and /uniondrive to become available and writable...and the device could not earn. Neitherfula-readiness-checknor the blox-ai "not earning" path detected or repaired it.Root cause (proven on a lab device)
The blox-ai plugin bind-mounts the identity FILE directly (
plugins/blox-ai/docker-compose.yml):When the Docker daemon sets up a bind mount whose host source path is absent, it auto-creates the source as a root-owned directory. blox-ai runs as its own systemd service with no ordering dependency on the identity existing, so whenever
identity.jsonis absent at blox-ai container-create time (fresh/reformatted/uniondrive, or before go-fula writes it), Docker poisons the path. That permanently wedges the cluster ([ -f identity.json ]fails forever) and makes go-fula'sinitipfsclusterpanic (WriteFileover a directory);go-fula.shhas noset -e, so it marks.ipfscluster_setupdone anyway, masking it.Reproduced byte-for-byte on the lab device: a
docker runwith that exact mount + absent source createdidentity.jsonasdrwxr-xr-x root root. The regenerated cluster peerID is deterministic (derived from/internal/config.yaml), so recovery preserves the on-chain registration -> earnings resume without re-registration.Why the safety nets missed it
identity.jsonwas treated as "legitimately absent."oom_killedorstate != running; the wedged container isrunning, so it fell through to a chain check that couldn't read the peerID -> "indeterminate."Changes (this PR, fula-ota)
plugins/blox-ai/docker-compose.yml: mount the parent dir/uniondrive/ipfs-clusterinstead of the file, and pointBLOX_AI_CLUSTER_IDENTITY_PATHatidentity.jsoninside it. Docker can then only ever auto-create the harmless directory, never the file-as-dir.identity_healthhonors the env var, so no blox-ai image rebuild is needed.readiness-check.pycheck_and_fix_ipfs_cluster(): detectos.path.isdir(identity.json)-> stop blox-ai -> stop fula ->rm -rfthe dir -> restart fula (go-fula regenerates the deterministic identity) -> restart blox-ai once the file is back. All subprocess calls are timeout-bounded so a D-state container cannot hang the watchdog.docker/go-fula/go-fula.sh: remove a directory-shapedidentity.jsonbeforeinitipfscluster.plugins/blox-ai/trees/not-earning.yaml: branch on the newcluster_identity_is_directoryreason -> concrete "cluster identity is a folder" verdict +restart_fularecommendation instead of "indeterminate."Companion PRs
initipfsclusterself-heals a directory-shaped identity.json (Fix 3b).identity_healthreportscluster_identity_is_directory(Fix 4b) -- Fix 4a's tree branch is inert until this ships.Validation (lab device, fxblox-rk1)
** IPFS Cluster is READY **, rejoined the pool, same peerID.docker cp->/usr/bin/fula->install.shrefreshes the active compose on boot.Notes
No fleet deploy happens on merge -- CI builds
fxsupport:releaseonly when a GitHub Release is published. Ship Fix 1 + Fix 2 together (self-heal alone could re-wedge without the mount fix).🤖 Generated with Claude Code