feat(installer): WireGuard self-heal sudoers + boot persistence + engine-aware supervision (#5) by dhnpmp-tech · Pull Request #450 · dhnpmp-tech/dcp-platform

dhnpmp-tech · 2026-05-30T10:27:51Z

Foolproofing #5 — make the provider daemon self-heal and survive reboot

From docs/ops/dcp-foolproofing-roadmap.md. Installer-only; dcp_daemon.py is unchanged — its _self_heal_wg already shells sudo -n wg-quick down/up <iface>; the only missing piece was the sudoers grant, so as a non-root run-user the heal silently failed (this is part of why Node 2 stayed dark for 3+ days).

⚠️ Cannot be exercised on macOS — needs a Linux systemd rig before deploy. bash -n passes on both scripts and the generated sudoers grammar validates under visudo -cf, but linger / wg-quick@ enable / is-enabled asserts must be validated on a real provider box.

`dcp-setup-unix.sh`

Self-heal sudoers: installs /etc/sudoers.d/dcp-wg (0440 root:root) granting the run-user passwordless sudo for exactly wg-quick up/down on wg0+wg1 — absolute binary path, literal verbs/interfaces, no wildcards, no other binaries, no arbitrary .conf paths. Validated with visudo -cf before install; on failure it warns and skips (never breaks system-wide sudo). Skipped entirely when the run-user is root.
Boot persistence: loginctl enable-linger; systemctl enable wg-quick@<iface> for whichever wg{0,1}.conf exists; enables the detected engine unit (not just start).
Engine detection: probes :11434/:8000/:8080/MLX, records the engine kind/unit so assertions check the unit actually in play (not a hardcoded gguf assumption).
Fail-loud post-install assert: dc1-provider + engine unit + wg-quick@<iface> all is-enabled, else exit 1 with a banner (install-time fail-loud is correct; asserts auto-skip when the relevant config/unit isn't present yet, so a pre-config install isn't falsely failed).

`setup-inference-supervisors.sh`

Engine-aware: cleanly skips (exit 0) instead of hard-failing when a non-llama.cpp engine is running and llama.cpp prereqs are absent; the hard requirements remain for the llama.cpp path.

🔍 Reviewer must check (security)

The exact /etc/sudoers.d/dcp-wg content (heredoc in the FIX #5a block): Cmnd_Alias DCP_WGQUICK = <abs-wg-quick> up wg0, <abs-wg-quick> down wg0, <abs-wg-quick> up wg1, <abs-wg-quick> down wg1 then <user> ALL=(root) NOPASSWD: DCP_WGQUICK. Confirm: (1) visudo -cf validated before install; (2) scoped to wg-quick only — no wildcards/other binaries; (3) absolute path required because the daemon invokes bare wg-quick and sudo resolves via secure_path.

Rollout

Backend/installer deploys manually. Validate on a Linux rig (ideally Node 2 once its WG is back) before rolling into the live installer.

🤖 Generated with Claude Code

…ine-aware supervision (#5) Foolproofing roadmap #5 — make the provider daemon actually recover and survive reboot. Installer-only; dcp_daemon.py unchanged (its _self_heal_wg already shells `sudo -n wg-quick`). Cannot be exercised on macOS — needs a Linux systemd rig before deploy. dcp-setup-unix.sh: - Installs /etc/sudoers.d/dcp-wg (0440 root:root) granting the run-user passwordless sudo for EXACTLY wg-quick up/down on wg0+wg1 (absolute path, literal verbs/ifaces, no wildcards). Validated with `visudo -cf` BEFORE install; on failure it warns + skips (never breaks sudo). Skipped when run-user is root. This is the missing piece that made _self_heal_wg silently fail. - Boot persistence: loginctl enable-linger; systemctl enable wg-quick@<iface> for the present wg{0,1}.conf; enable the detected engine unit. - Engine detection: probes :11434/:8000/:8080/mlx and records the engine kind/unit so the assertion checks the unit actually in play (not a hardcoded gguf assumption). - Post-install assert: dc1-provider + engine unit + wg-quick@<iface> all `is-enabled`, else exit 1 with a loud banner (install-time fail-loud is correct; assertions auto-skip when the config/unit isn't present yet so a pre-config install isn't falsely failed). setup-inference-supervisors.sh: - Engine-aware: cleanly skips (exit 0) instead of hard-failing when a non-llama.cpp engine is running and llama.cpp prereqs are absent; the hard requirements remain for the llama.cpp path. Tests: bash -n passes on both scripts; dcp_daemon.py py_compiles (unmodified). visudo grammar of the generated drop-in validated. Full install behavior needs a Linux rig. Docs: docs/CHANGELOG.md [Unreleased].

chatgpt-codex-connector · 2026-05-30T10:27:54Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

vercel · 2026-05-30T10:27:58Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
dc1-platform	Ready	Preview, Comment	May 30, 2026 10:28am

vercel Bot deployed to Preview May 30, 2026 10:28 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(installer): WireGuard self-heal sudoers + boot persistence + engine-aware supervision (#5)#450

feat(installer): WireGuard self-heal sudoers + boot persistence + engine-aware supervision (#5)#450
dhnpmp-tech wants to merge 1 commit into
mainfrom
feat/installer-selfheal

dhnpmp-tech commented May 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

vercel Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhnpmp-tech commented May 30, 2026

Foolproofing #5 — make the provider daemon self-heal and survive reboot

dcp-setup-unix.sh

setup-inference-supervisors.sh

🔍 Reviewer must check (security)

Rollout

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

vercel Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`dcp-setup-unix.sh`

`setup-inference-supervisors.sh`

vercel Bot commented May 30, 2026 •

edited

Loading