Skip to content

Fix ipfs-cluster wedge: blox-ai bind-mounts identity.json as a directory#76

Merged
ehsan6sha merged 2 commits into
mainfrom
fix/ipfs-cluster-identity-dir-wedge
Jun 18, 2026
Merged

Fix ipfs-cluster wedge: blox-ai bind-mounts identity.json as a directory#76
ehsan6sha merged 2 commits into
mainfrom
fix/ipfs-cluster-identity-dir-wedge

Conversation

@ehsan6sha

Copy link
Copy Markdown
Member

Problem

On a blox, /uniondrive/ipfs-cluster/identity.json was an empty directory instead of a file. ipfs_cluster was stuck forever printing Waiting for /internal and /uniondrive to become available and writable... and the device could not earn. Neither fula-readiness-check nor the blox-ai "not earning" path detected or repaired it.

Root cause (proven on a lab device)

The blox-ai plugin bind-mounts the identity FILE directly (plugins/blox-ai/docker-compose.yml):

- /uniondrive/ipfs-cluster/identity.json:/etc/fula/host-ipfs-cluster-identity.json:ro

When the Docker daemon sets up a bind mount whose host source path is absent, it auto-creates the source as a root-owned directory. blox-ai runs as its own systemd service with no ordering dependency on the identity existing, so whenever identity.json is absent at blox-ai container-create time (fresh/reformatted /uniondrive, or before go-fula writes it), Docker poisons the path. That permanently wedges the cluster ([ -f identity.json ] fails forever) and makes go-fula's initipfscluster panic (WriteFile over a directory); go-fula.sh has no set -e, so it marks .ipfscluster_setup done anyway, masking it.

Reproduced byte-for-byte on the lab device: a docker run with that exact mount + absent source created identity.json as drwxr-xr-x root root. The regenerated cluster peerID is deterministic (derived from /internal/config.yaml), so recovery preserves the on-chain registration -> earnings resume without re-registration.

Why the safety nets missed it

  • readiness-check only matched specific daemon-error log strings; the wait-loop matches none, and a directory identity.json was treated as "legitimately absent."
  • not-earning tree only flagged the cluster if oom_killed or state != running; the wedged container is running, so it fell through to a chain check that couldn't read the peerID -> "indeterminate."

Changes (this PR, fula-ota)

  • Fix 1 (prevent recurrence) -- plugins/blox-ai/docker-compose.yml: mount the parent dir /uniondrive/ipfs-cluster instead of the file, and point BLOX_AI_CLUSTER_IDENTITY_PATH at identity.json inside it. Docker can then only ever auto-create the harmless directory, never the file-as-dir. identity_health honors the env var, so no blox-ai image rebuild is needed.
  • Fix 2 (recover already-wedged devices) -- readiness-check.py check_and_fix_ipfs_cluster(): detect os.path.isdir(identity.json) -> stop blox-ai -> stop fula -> rm -rf the dir -> restart fula (go-fula regenerates the deterministic identity) -> restart blox-ai once the file is back. All subprocess calls are timeout-bounded so a D-state container cannot hang the watchdog.
  • Fix 3a (defense) -- docker/go-fula/go-fula.sh: remove a directory-shaped identity.json before initipfscluster.
  • Fix 4a (detection) -- plugins/blox-ai/trees/not-earning.yaml: branch on the new cluster_identity_is_directory reason -> concrete "cluster identity is a folder" verdict + restart_fula recommendation instead of "indeterminate."

Companion PRs

  • go-fula: initipfscluster self-heals a directory-shaped identity.json (Fix 3b).
  • blox-ai: identity_health reports cluster_identity_is_directory (Fix 4b) -- Fix 4a's tree branch is inert until this ships.

Validation (lab device, fxblox-rk1)

  • Reproduced the mechanism (Docker creates the dir on real mergerfs).
  • Proved deterministic peerID regeneration (file == cluster API == regenerated).
  • Full wedge->recover cycle: reproduced the exact wedge, recovery restored ** IPFS Cluster is READY **, rejoined the pool, same peerID.
  • Applied Fix 1 on-device: blox-ai reads identity via the new path, healthy.
  • Verified propagation: fxsupport image -> docker cp -> /usr/bin/fula -> install.sh refreshes the active compose on boot.

Notes

No fleet deploy happens on merge -- CI builds fxsupport:release only when a GitHub Release is published. Ship Fix 1 + Fix 2 together (self-heal alone could re-wedge without the mount fix).

🤖 Generated with Claude Code

ehsan6sha and others added 2 commits June 18, 2026 13:15
…file

Root cause: the blox-ai plugin bind-mounted the cluster identity FILE
directly (/uniondrive/ipfs-cluster/identity.json:...:ro). When the host file
is absent at container-create time, the Docker daemon auto-creates the source
as a root-owned DIRECTORY. That permanently wedges ipfs-cluster (its init
loops forever waiting for a regular-file identity.json) and makes go-fula
initipfscluster panic (WriteFile over a directory); go-fula.sh has no set -e
so it marks .ipfscluster_setup done anyway, masking it. Neither
readiness-check nor the blox-ai "not earning" tree caught it: the wedged
container is still running, and the wait-loop emits no daemon-log error the
readiness checks match. Proven end-to-end on a lab device, including
deterministic peerID regeneration so earnings are preserved.

Fix 1 (prevent recurrence) - plugins/blox-ai/docker-compose.yml: mount the
parent directory /uniondrive/ipfs-cluster instead of the file, and point
BLOX_AI_CLUSTER_IDENTITY_PATH at identity.json inside it. Docker can then only
ever auto-create the harmless directory, never the file-as-directory.
identity_health honors the env var, so no blox-ai image rebuild is needed.

Fix 2 (recover already-wedged devices) - readiness-check.py
check_and_fix_ipfs_cluster(): detect identity.json being a directory (the
existing log-pattern checks never match this failure), stop blox-ai, stop
fula, rm the bogus dir, restart fula so go-fula regenerates the deterministic
identity, then restart blox-ai once the file is back. All subprocess calls are
timeout-bounded so a D-state container cannot hang the watchdog.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix 3a (go-fula.sh): remove a directory-shaped identity.json before
initipfscluster so the writer does not panic over it. Defense-in-depth
alongside the Go-side guard in go-fula initipfscluster (separate PR).

Fix 4a (not-earning.yaml): branch on identity_health's new
cluster_identity_is_directory reason so the "not earning" tree gives a
concrete "cluster identity is a folder" verdict + restart_fula recommendation
instead of the generic "indeterminate". Pairs with the blox-ai
identity_health.py change that emits the reason (separate PR); the branch is
inert until that ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ehsan6sha ehsan6sha merged commit 6099258 into main Jun 18, 2026
4 checks passed
@ehsan6sha ehsan6sha deleted the fix/ipfs-cluster-identity-dir-wedge branch June 18, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant