Skip to content

feat(anc): wire check-hotfix into node wrapper behind ENABLE_PROVISIONING_HOTFIX#8715

Draft
Devinwong wants to merge 5 commits into
devinwong/anc-check-hotfix-configmapfrom
devinwong/anc-wire-check-hotfix-wrapper
Draft

feat(anc): wire check-hotfix into node wrapper behind ENABLE_PROVISIONING_HOTFIX#8715
Devinwong wants to merge 5 commits into
devinwong/anc-check-hotfix-configmapfrom
devinwong/anc-wire-check-hotfix-wrapper

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

2.1c - Wire check-hotfix into the node wrapper (shell only)

POC / M1 draft. Shell-only wiring for the Provisioning-Hotfix flow. No Go changes.

Enablement (where this sits in the rollout chain)

This env gate is the on-node terminal of the design's region-staged opt-in:
EnableProvisioningHotfix aks-rp toggle (AKS Toggles-as-code, per region) -> absvc
respects toggle -> ANC respects toggle. This PR implements only the last hop ("ANC
respects toggle"). The env var name mirrors the toggle/contract name to match the
existing contract->env convention (e.g. EnableIMDSRestriction -> ENABLE_IMDS_RESTRICTION),
keeping the chain traceable. Wiring absvc to render this var from a contract field is a
separate follow-up PR; the aks-rp toggle + toggle YAML live in the aks-rp repo. Until
those land, the var renders unset everywhere, so this change is inert (default-off).

Note: 2.1d (#8717) relaxes this env gate, moving the on/off decision into the Go binary
via the enable_provisioning_hotfix contract field (single source of truth). This PR
intentionally ADDS the gate; #8717 relaxes it, so each PR stays reviewable on its own.

What this does

Adds one call to the check-hotfix subcommand (added in 2.1b) inside
aks-node-controller-wrapper.sh, gated behind a new env flag ENABLE_PROVISIONING_HOTFIX
that is OFF by default. check-hotfix reads the hotfix pointer from the LPS endpoint
(IMDS-attested) and refreshes
$HOTFIX_JSON, which the existing download-hotfix block consumes - so it must run
first. The call is fail-open (the command always exits 0) and additionally wrapped
defensively so it can never block provisioning.

Default-off / fail-open guarantee

When ENABLE_PROVISIONING_HOTFIX is unset, empty, or any value other than the literal
string true, the wrapper behaves EXACTLY as it does today. This preserves the
6-month VHD backward-compatibility window: older VHDs running newer CSE, and newer
VHDs running older CSE, are unaffected unless the flag is explicitly turned on.

Known-safe: old VHD + flag on

If ENABLE_PROVISIONING_HOTFIX=true ever reaches a node whose VHD-baked ANC binary predates
2.1b, "$BIN_PATH" check-hotfix is an unknown subcommand and exits non-zero. The
if ... else log "...continuing (fail-open)" fi wrapper swallows that error, so
provisioning still proceeds unchanged. This path is covered by shellspec case 4 below
(check-hotfix exits non-zero -> wrapper still provisions), which models the missing
subcommand. This matters for the 6-month VHD support window.

Before / after flow

Flag off (default - unchanged):

guard config/nbc -> [download-hotfix if $HOTFIX_JSON] -> select binary -> provision

Flag on (ENABLE_PROVISIONING_HOTFIX=true):

guard config/nbc -> check-hotfix (refresh pointer) -> [download-hotfix if $HOTFIX_JSON] -> select binary -> provision

Notes

  • check-hotfix takes no flags/args; it reads the AKSNodeConfig from its default
    on-node path internally for the LPS endpoint (IMDS-attested) it reads, so the wrapper passes nothing.
  • HOTFIX_JSON is parameterized as ${HOTFIX_JSON:-<default>} to match the existing
    BIN_PATH / CONFIG_PATH / NBC_CMD_PATH pattern and to allow shellspec to exercise
    the download-hotfix branch. Production default path is unchanged.
  • Write/read handoff verified: check-hotfix writes the pointer to defaultHotfixVersionPath
    (/opt/azure/containers/aks-node-controller-hotfix.json, hotfix.go) and download-hotfix
    reads the same constant. The wrapper's HOTFIX_JSON default is byte-identical, and the
    Go hotfixVersionPath override exists only for tests (no env/production override and
    check-hotfix takes no path flag), so the two never diverge on a node.
  • POSIX compliant ([ ], =, ${VAR:-}); passes shellcheck generic + POSIX (SC3010/SC3014)
    and the wrapper shellspec suite (8 examples, 0 failures).

Tests

New shellspec cases in aks_node_controller_wrapper_spec.sh:

  1. flag unset -> check-hotfix NOT called (no behavior change)
  2. non-true value (e.g. "1") -> treated as disabled
  3. flag = true -> check-hotfix runs BEFORE download-hotfix, provision last
  4. check-hotfix exits non-zero -> wrapper still proceeds to provision (fail-open)

Stack

main
 \- #8694  2.1a  base->version hotfix map (Go)
     \- #8696  2.1b  check-hotfix LPS endpoint reader (Go)
         \- #8715  2.1c  wire check-hotfix into wrapper (shell)   <- this PR
             \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate

Base is set to the 2.1b branch so the diff shows only the wrapper + shellspec changes.
Will retarget to main as the stack merges down.

This unblocks the on-node e2e PoC tests (fail-open and multi-base) since check-hotfix
is otherwise never invoked at boot.

@Devinwong Devinwong changed the title feat(anc): wire check-hotfix into node wrapper behind ANC_HOTFIX_ENABLED feat(anc): wire check-hotfix into node wrapper behind ENABLE_PROVISIONING_HOTFIX Jun 16, 2026
@Devinwong Devinwong marked this pull request as ready for review June 16, 2026 02:27
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from ede050a to 0c90761 Compare June 16, 2026 17:37
…2.1b)

Add a fail-open 'check-hotfix' CLI subcommand that reads the
kube-system/anc-hotfix-version ConfigMap published by the
live-patching-controller and stages the resolved {hotfixes:{...}} pointer
to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go); creds from AKSNodeConfig bootstrap
  token + apiserver FQDN (primary) or on-node kubeconfigs (secondary).
- Shares the 2.1a hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (configMapRead,
  noHotfixForBase, customDataFallback, failed).
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the ConfigMap read fails (TODO: typed absvc contract).
- Injectable App fields (checkHotfixConfigMapFetcher, nodeConfigPath) for
  network-free unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 0c90761 to b33ec66 Compare June 16, 2026 18:08
Devin Wong and others added 3 commits June 16, 2026 11:10
Add a default-off ANC_HOTFIX_ENABLED-gated call to the 2.1b check-hotfix
subcommand in aks-node-controller-wrapper.sh, placed before the existing
download-hotfix block since check-hotfix refreshes the hotfix pointer that
block consumes. The call is fail-open and wrapped defensively so it can never
block provisioning. When the flag is unset/non-true the wrapper behaves exactly
as before (6-month VHD backward compat). Parameterize HOTFIX_JSON to match the
existing path-var pattern and enable shellspec coverage of the download-hotfix
branch. Add shellspec tests for flag off, flag on ordering, fail-open, and
non-true value handling.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the check-hotfix non-zero (fail-open) case also models a node whose
VHD-baked binary predates 2.1b, where check-hotfix is an unknown subcommand.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the design's EnableProvisioningHotfix aks-rp region toggle and AgentBaker's
contract->env naming convention (EnableIMDSRestriction -> ENABLE_IMDS_RESTRICTION),
so the toggle -> absvc -> ANC opt-in chain stays traceable. No behavior change;
still default-off and fail-open.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-wire-check-hotfix-wrapper branch from f842590 to 3ebabf0 Compare June 16, 2026 18:11
The hotfix pointer read channel moved from the kube-system ConfigMap (apiserver +
bootstrap token) to the LPS endpoint (IMDS-attested); the fetch/auth rewrite lives in
2.1b. The wrapper's check-hotfix -> download-hotfix call contract, the
ENABLE_PROVISIONING_HOTFIX gate, and the fail-open semantics are unchanged - only the
explanatory comment is updated to name the new read channel accurately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong requested a review from xuexu6666 as a code owner June 19, 2026 20:59
@Devinwong

Copy link
Copy Markdown
Collaborator Author

Read-channel pivot: the hotfix-pointer read moves from Option 2 (kube-system anc-hotfix-version ConfigMap via apiserver + bootstrap token) to Option 4 (LPS endpoint, IMDS-attested), validated by e2e showing the node can reach LPS pre-kubelet. The fetch/auth rewrite lives in #8696 (2.1b). This wrapper wiring is channel-agnostic: the check-hotfix -> download-hotfix call sequence, the ENABLE_PROVISIONING_HOTFIX gate (relaxed by 2.1d via the enable_provisioning_hotfix contract field), and the fail-open semantics are all unchanged. Only comments/wording were updated to name the new read channel.

@github-actions

Copy link
Copy Markdown
Contributor

Changes cached containers or packages on windows VHDs

Please get a Windows SIG member to approve.

The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.

  • Additions are new things cached.
  • Deletions are things no longer cached.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index 7039bac..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 5915cf1..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index 37d9326..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index 5b08280..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp

@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch 2 times, most recently from 07b497b to 0d6f945 Compare June 20, 2026 00:25
@Devinwong Devinwong marked this pull request as draft June 20, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant