Skip to content

feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696

Draft
Devinwong wants to merge 1 commit into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap
Draft

feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696
Devinwong wants to merge 1 commit into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Add a check-hotfix subcommand that reads the hotfix pointer from the live-patching-service

This adds a new fail-open check-hotfix subcommand to aks-node-controller. It reads a base-to-hotfix version pointer map from the live-patching-service (LPS) over an IMDS-attested HTTPS path that is reachable before kubelet is up, and writes that pointer to the file download-hotfix already consumes. download-hotfix then re-resolves the pointer and keeps its existing patch-only, strictly-higher gating. check-hotfix only fetches and stages the pointer - it never installs anything and never blocks provisioning.

Stacking

This branch is stacked on the base-to-version hotfix map change (PR #8694). The PR base is set to that branch so the diff shows only this change (app.go wiring + checkhotfix.go + checkhotfix_test.go). It must merge after #8694; if #8694 merges first, retarget this PR to main.

main
 \- #8694  2.1a  base->version hotfix map (Go)
     \- #8696  2.1b  check-hotfix LPS endpoint reader (Go)        <- this PR
         \- #8715  2.1c  wire check-hotfix into wrapper (shell)
             \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate

This PR's check-hotfix command is always-on by itself - it has no feature gate. The gating arrives later in 2.1d (#8717), which adds an enable_provisioning_hotfix contract field so check-hotfix self-gates on it.

What it does

  1. Reads the hotfix pointer from the LPS with a raw net/http HTTPS GET (no client-go dependency). The transport mirrors the proven pre-kubelet connectivity path:
    • TLS ServerName is pinned to the LPS SNI host, while a custom dialer forces the TCP connection to the apiserver FQDN on port 443 (the curl --resolve trick). The apiserver front routes the pinned SNI to the LPS backend.
    • Authorization is the IMDS attested-data document signature (http://169.254.169.254/metadata/attested/document). IMDS is reachable pre-kubelet, so this works before any kube credential exists. The token fetch is injectable for tests.
    • The server certificate is verified against the cluster CA carried in the provision config (base64 PEM); the CA source used is logged. check-hotfix runs before the provisioning scripts, so the decoded on-node CA file (/etc/kubernetes/certs/ca.crt) does not exist yet - the provision config is the only credential source guaranteed to be present at this point.
    • Short-timeout (~10s) client so provisioning is never delayed.
  2. Parses the response: the body is the {"hotfixes":{...}} JSON object, which unmarshals directly into the same config type download-hotfix uses, so both commands share one parser and data contract.
  3. Writes the pointer to /opt/azure/containers/aks-node-controller-hotfix.json in the same {"hotfixes":{...}} shape (atomic temp-file + rename), so download-hotfix re-resolves it and applies its unchanged gating.
  4. Fail-open: the command always exits 0 so provisioning is never blocked. Any timeout / connection error / 5xx / parse failure is logged, emitted as telemetry, and swallowed.
  5. Not-enrolled is benign: a reachable LPS that has nothing for this node yet (HTTP 401 / 403 / 404) is a clean no-op, not a failure. The LPS authorizes in two stages (attestation, then agent-pool authorization), so a node whose pool is not yet enrolled in live-patching gets a 401 - the expected steady state on a freshly enabled cluster. These responses stage no overlay and are reported as notEnrolled.
  6. Cold-start fallback: only when the LPS could not be reached or talked to (transport failure / 5xx), it reads a lenient top-level hotfixes object embedded in the provision config and uses that. (Marked with a TODO to switch to a typed config field once that contract exists.)
  7. Telemetry: guest-agent events under task name CheckHotfix with outcomes lpsRead, noHotfixForBase, notEnrolled, customDataFallback, failed.

Open dependency (placeholder route)

The LPS route and response schema for the base-to-hotfix pointer are a planned-maintenance deliverable that is not finalized yet; the connectivity prototype only proved reachability of the LPS read path. The route is a clearly-named placeholder constant (lpsHotfixPath = "/v1/anc-hotfix") with a TODO, and the expected response body is documented as {"hotfixes":{"<YYYYMM.DD base>":"<YYYYMM.DD.PATCH>"}}. The IMDS/LPS client helpers mirror the connectivity prototype and are flagged in-code to be de-duplicated into a shared LPS client when that lands.

Net effect (examples)

LPS serves the pointer body:

{"hotfixes":{"202604.01":"202604.01.1","202605.01":"202605.01.2"}}

check-hotfix stages /opt/azure/containers/aks-node-controller-hotfix.json:

{"hotfixes":{"202604.01":"202604.01.1","202605.01":"202605.01.2"}}
Node baked ANC version LPS read check-hotfix outcome download-hotfix then does
202604.01.0 OK lpsRead base 202604.01 -> 202604.01.1, patch 1 > 0, upgrades
202605.01.2 OK lpsRead base 202605.01 -> 202605.01.2, patch not higher, no-op
202607.15.0 OK (no matching base) noHotfixForBase no pointer for this base, no-op
any 401 / 403 / 404 (pool not enrolled) notEnrolled no overlay staged, no-op
202604.01.0 unreachable (timeout / 5xx), provision config has embedded hotfixes customDataFallback reads staged fallback pointer, resolves as above
202604.01.0 unreachable, no fallback present failed (still exit 0) nothing staged, no-op

Tests

New network-free unit tests (LPS fetcher and attested-token injected, no real networking): success read+write, benign not-enrolled 401/403/404 no-op (no overlay), timeout/connection/5xx fail-open, invalid pointer JSON fail-open, noHotfixForBase, cold-start fallback (and no-pointer failure), telemetry outcomes and always-exit-0 wiring, shared-parser equivalence with download-hotfix, attested-token injection, provision-config endpoint/CA resolution, and the SNI-pinned TLS client (provision-config trust vs insecure fallback).

All new tests pass. The full go test ./... run shows no new failures versus the base branch. The only failures are pre-existing Windows-only environmental ones (they need /etc/os-release, bash, and unix file perms) that pass in Linux CI.

Note: wiring this command into the provisioning wrapper script is intentionally out of scope for this PR and will land separately behind a feature flag.

@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - check-hotfix ConfigMap reader (2.1b) feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap Jun 12, 2026
@Devinwong Devinwong force-pushed the devinwong/laughing-pancake branch from 5fff98d to 061ba60 Compare June 15, 2026 21:53
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 64e782d to ede050a Compare June 15, 2026 21:56
@Devinwong Devinwong marked this pull request as ready for review June 16, 2026 02:27
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch 2 times, most recently from 0c90761 to b33ec66 Compare June 16, 2026 18:08
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from b33ec66 to 07b497b Compare June 19, 2026 21:06
@Devinwong Devinwong requested a review from xuexu6666 as a code owner June 19, 2026 21:06
@Devinwong Devinwong changed the title feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS Jun 19, 2026
Add a fail-open 'check-hotfix' CLI subcommand that reads the base->hotfix
pointer map from the live-patching-service (LPS) over the IMDS-attested SNI
path that is reachable pre-kubelet, and stages the resolved {hotfixes:{...}}
pointer to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go). TLS ServerName pinned to the LPS
  SNI host while the TCP dial is forced to the apiserver FQDN (curl --resolve
  trick); Authorization is the IMDS attested-data signature; the server cert
  is verified against the cluster CA from the provision-config.
- FQDN + cluster CA come from the AKSNodeConfig ANC already parses (the only
  credential source present pre-provisioning); caSource is logged.
- Shares the hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (lpsRead, noHotfixForBase,
  notEnrolled, customDataFallback, failed).
- A reachable LPS with nothing for this node (HTTP 401 pool-not-enrolled,
  403, 404) is a benign no-op (notEnrolled): no overlay is staged and it is
  never classified as a failure. Only transport/5xx failures fall back.
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the LPS read fails (TODO: typed contract field).
- Injectable App fields (checkHotfixFetcher, fetchAttestedToken,
  nodeConfigPath) for network-free unit tests.
- The LPS route + response schema are a planned-maintenance deliverable that
  is not finalized; lpsHotfixPath is a clearly-marked placeholder with a TODO.
  The IMDS/LPS client helpers mirror the connectivity prototype and should be
  de-duplicated into a shared LPS client when that lands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 07b497b to 0d6f945 Compare June 20, 2026 00:25
@github-actions

Copy link
Copy Markdown
Contributor

Changes cached containers or packages on windows VHDs

Please get a Windows SIG member to approve.

The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.

  • Additions are new things cached.
  • Deletions are things no longer cached.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index 7039bac..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 5915cf1..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index 37d9326..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index 5b08280..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp

@Devinwong Devinwong marked this pull request as draft June 20, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant