You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(autostart): rebind wrap dependents on F-45 SA recreate
Closes#10. Auto-fix the dead-netns cascade that the v1.0.2 image-
drift recreate (and the issue-#5 stale-netns recreate) triggers as a
side effect: when anchord removes its managed SA and spawns a new
one, every container declaring `network_mode: service:<sa>` was
pinned to the OLD container ID at compose create-time and ends up
in a destroyed netns. Looks running to Docker, no interface.
Production incident (2026-05-23) hit 21 such victims across 9 SAs
during a v1.0.1 → v1.0.2 rollout — Frigate end-to-end broken because
authentik_server itself was orphaned. v1.1.0 added detection-only
(WARN logs + doctor CLI). v1.2.0 closes the loop for the case anchord
itself caused, where the recreate scope is unambiguous.
Design
The fix-scope is intentionally narrow: anchord only auto-rebinds
dependents whose netmode points at THE SA IT JUST REMOVED. Anchord
knows the old SA's container ID at the moment of removal, so there's
no scope-discovery problem, no race with operator-driven changes,
no need for healthy-observation state-tracking, no multi-SA
ambiguity. v1.1.0's general dependents watcher keeps WARNing about
the broader case (operator-driven SA recreates, OOM-killed SAs);
those still need manual recovery.
Mechanism
internal/autostart/autostart.go centralises the recreate cycle in
a new helper `recreateSAWithOrphanRebind`, used at both points where
SA recreate happens:
- maybeManage's stale-netns branch (issue #5)
- maybeManage's image-drift branch (issue #8 follow-up, was the
separate maybeRecreateStaleImageSA; now folded back in)
The helper:
1. Before Remove: enumerates wrap dependents via
findOrphanCandidates(all, oldSA) — matches NetworkMode against
the SA's long ID, short ID (>=12), and names.
2. Removes the old SA. Failure aborts so we don't end up with
no SA at all.
3. Calls createAndStartManaged (now returns the new ID).
4. For each enumerated orphan: ops.RecreateWithNetworkMode(id,
"container:<newSAID>"). Failures are warn-logged per-dep and
the loop continues — one broken dep doesn't justify leaving
the others stranded.
Inspect-then-recreate preserves Config + HostConfig from the
dependent's running state and only patches NetworkMode. No compose
files needed (compose's network_mode resolution is what we're
replicating directly). No image bloat — pure SDK.
The image-drift check moves from maybeRecreateStaleImageSA into
maybeManage proper as an additional gate alongside stale-netns; the
old method is deleted. backfill is the only caller that supplies
the SelfInfo pointer for the drift check (per-event paths pass nil
so noisy event sources don't trigger churning recreates).
Config + flag
New ANCHORD_AUTOFIX_DEAD_NETNS env knob (config.LoadNetworkAnchor),
default true, same shape as ANCHORD_AUTOSTART_SIBLINGS. Setting
false reverts to v1.1.0 behaviour: SA gets recreated as before,
dependents are left orphaned for the v1.1.0 dependents watcher's
WARN log to flag.
Docker-socket-proxy permission delta: requires DELETE + container
create/start on the dependent containers. Same surface anchord
already uses for its own SA recreate — no new endpoints, just
applied to wrap-dep containers in our project scope. README
documents the flag and the implications.
Tests
7 new tests (315 unit total, +14 over v1.1.0):
- TestFindOrphanCandidates_ByAllRefForms — predicate matches long
ID / short ID / name, ignores SA itself + unrelated netmodes
- TestBackfill_F45_ReboundDependentsOnImageDrift — image-drift +
3 orphans → SA recreated, all 3 rebound to new SA
- TestRun_F45_ReboundDependentsOnStaleNetns — issue #5 path also
triggers the rebind on event
- TestBackfill_F45_NoRebindWhenAutoFixDisabled — opt-out
invariant: SA still recreated, no dep rebinds
- TestBackfill_F45_NoRebindOnAbsentSA — create-fresh path doesn't
sweep unrelated orphans
- TestRun_F45_RebindContinuesAfterPerDepFailure — one bad dep
doesn't abort the loop
- TestLoad_AutoFixDeadNetns — config parsing happy/sad paths
Verified: 315/315 unit + 74/74 e2e green, race -count=3 clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+45-25Lines changed: 45 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -338,6 +338,7 @@ All via environment variables.
338
338
| `ANCHORD_PROJECT` | yes¹ | `$COMPOSE_PROJECT_NAME` | Scope of containers anchord manages. Required unless `ANCHORD_LABEL_SELECTOR` is set. Ignored (with a WARN log) when both are set |
339
339
| `ANCHORD_LABEL_SELECTOR` | no | | F-42: replaces the project filter with an operator-defined label set, comma-separated `key=value` AND-joined (e.g. `anchord.role=ldap-outpost,env=prod`). Use when multiple anchords share a project, or when target containers are spawned outside Compose and carry no project label (e.g. authentik outposts) |
340
340
| `ANCHORD_AUTOSTART_SIBLINGS` | no | `true` | F-43: watch Docker for `container start` events and bootstrap any sibling container in `Created` state whose `network_mode: container:<X>` matches the just-started target. Needed for service-anchors whose target is spawned at runtime (e.g. authentik outposts via the Docker API). Requires `POST=1` on the docker-socket-proxy. Set to `false` to disable |
341
+
| `ANCHORD_AUTOFIX_DEAD_NETNS` | no | `true` | Issue #10: when the network-anchor recreates its F-45-managed service-anchor (stale-netns or image-drift path), also re-create every wrap dependent that was pinned to the old SA's container ID. Without this, the dependents end up in a destroyed netns and look running to Docker while being invisible to the outside. Requires `DELETE=1` + container create/start on the docker-socket-proxy (same set F-45's existing SA recreate already needs). Set to `false` to keep v1.1.0 behaviour (detection-only via the dependents watcher's WARN log) |
341
342
| `ANCHORD_MANAGED_SA_TARGET` | no | | F-45: stable name of a runtime-spawned target container. When set, the network-anchor not only auto-starts existing Created-state siblings (F-43) but CREATES a service-anchor on demand when this target appears. Needed when Compose cannot declare the service-anchor (target doesn't yet exist at compose-up time and Compose halts on create-then-cant-start). Empty = pure F-43 behaviour |
342
343
| `ANCHORD_MANAGED_SA_NAME` | no | `<TARGET>-service-anchor` | F-45: name of the container the network-anchor creates. Only consulted when `ANCHORD_MANAGED_SA_TARGET` is set |
343
344
| `ANCHORD_MANAGED_SA_IMAGE` | no | (anchord's own image) | F-45: image for the managed service-anchor. Default resolved at runtime from the network-anchor's own container inspect — keeps both containers on the same image version |
@@ -481,17 +482,22 @@ not running, a different host, post-mortem analysis).
481
482
Default is `anchord`, which matches the canonical service name in the
482
483
example compose. If you rename the network-anchor service, set
483
484
`ANCHORD_GATEWAY_HOSTNAME` on each service-anchor to match.
484
-
- **Recreating a service-anchor orphans its wrap dependents.** Any
485
+
- **Recreating a service-anchor orphans its wrap dependents** —
486
+
but in the common case anchord now repairs them itself. Any
485
487
container declaring `network_mode: service:fe-anchor-X` is pinned
486
488
to fe-anchor-X's container ID at create-time and stays pinned
487
-
across recreate. After
488
-
`docker compose up -d --no-deps --force-recreate fe-anchor-X` (or
489
-
any `docker rm` of the SA), recreate every dependent in the same
490
-
stack — typically the per-stack Traefik plus any acme/wrap services
491
-
netns-mode'd to the same fe-anchor. anchord detects the situation
492
-
and emits a `WARN dependent in dead netns ...` log line per victim,
493
-
but it does not (yet) auto-recreate. Use
494
-
`anchord doctor stale-netns` for a cluster-wide one-shot scan.
489
+
across recreate. When **anchord** recreates its F-45-managed
490
+
service-anchor (stale-netns or image-drift path), it enumerates
491
+
every wrap dependent of the old SA and re-creates each against
492
+
the new SA's ID before returning — see the `ANCHORD_AUTOFIX_DEAD_NETNS`
493
+
flag (default on). When **the operator** recreates a service-anchor
494
+
manually (`docker rm`, `compose up --force-recreate`), anchord's
495
+
v1.1.0 dependents watcher still emits a `WARN dependent in dead netns
496
+
...` log line per victim with the exact recovery command — auto-fix
497
+
only fires for SA recreates anchord caused itself, because there
498
+
the scope is unambiguous (no race with operator, no scope
499
+
discovery). Use `anchord doctor stale-netns` for a cluster-wide
500
+
one-shot scan after a manual incident.
495
501
- **One network-anchor per backend identity.** Default discovery
496
502
scope is the Compose project; two anchords filtering the same set
497
503
of backends will fight over their DNAT entries. With
@@ -517,21 +523,21 @@ here. The release pipeline rejects any tag whose recorded hash does
517
523
not match the current source, so this block is the project's
0 commit comments