[CORE-16324] Auto finalize real upgrade test#30814
Conversation
Add ManualFinalizationUpgradeTest, a real-upgrade counterpart to the synthetic ManualFinalizationTest. Where the synthetic test fakes the version bump with __REDPANDA_LATEST_LOGICAL_VERSION, this drives an actual binary upgrade from the latest prior-feature-line release up to HEAD, exercising the operator flow where features_auto_finalization is opted out on the old binary before the upgrade. Covers three scenarios end to end: auto-finalization disabled blocks the post-upgrade advance, an explicit FinalizeUpgrade drives it, and GetUpgradeStatus reports the lifecycle. The old release lacks the admin v2 RPCs, so status and finalize are only invoked once every node is on HEAD; old and new logical versions are discovered at runtime rather than hard-coded. The synthetic test is kept as-is for fast, hermetic coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add test_config_passes_through_multi_hop_upgrade to ManualFinalizationUpgradeTest. features_auto_finalization was backported to both v25.3.15 and v26.1.9, so an operator can set the opt-out once on the oldest release and have it carry through a v25.3 -> v26.1 -> HEAD upgrade with the flag set only once. The test sets the flag once on the oldest release, upgrades through the latest v26.1 patch (which has the knob but not the gating logic, so it advances the active version normally), then to HEAD (which consumes the knob). It asserts the active version is held after the final hop and that GetUpgradeStatus reports auto-finalization disabled -- proving the flag survived every controller-log hop -- before an explicit finalize drives the advance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ea45877 to
b3076ec
Compare
Lay the groundwork for exercising the core promise of the unfinalized-upgrade feature: while an upgrade is not finalized the cluster can be rolled back to the prior release (giving up post-finalization features), and once finalized it cannot. test_downgrade_before_finalize boots v26.1.x, disables auto-finalization, then repeatedly upgrades the binaries to HEAD and rolls them back without finalizing -- the active version is held at the old version, so the old binary runs on the existing data -- before finalizing on the last attempt. test_downgrade_rejected_after_finalize confirms the flip side: after finalization the old binary aborts startup with "Incompatible downgrade detected". The per-state perturbation hook (_perturb) is intentionally a no-op for now; richness -- workloads, topic/config churn, leadership moves -- is added in a later commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flesh out _perturb (previously a no-op) into the shared baseline run in every upgrade/downgrade state: create a fixed set of topics, seed them once, read all records back, restart the whole cluster in place, wait for it to return healthy, and read the records again. Because the topics persist across the transitions, the read-backs double as a feature-agnostic check that data produced on one binary survives both a roll to the other and an in-place restart. Structure the per-upgrade-step perturbations as 1 + N: the common work above plus one function per supported unfinalized-upgrade step. _perturb_v26_1_to_v26_2 is added as the N=1 placeholder -- intentionally empty -- to be filled with the v26.2-gated feature exercises as v26.2 development proceeds. A note documents how this grows when v26.3 work begins: a new per-step function plus a harness extension to chain unfinalized upgrades (v26.1 -> v26.2 -> v26.3), retained per the support window. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| "shadow_link_sr_api_sync", | ||
| "batch_mirror_topic_status", |
There was a problem hiding this comment.
These are left out right now because I need to sync with authors of these features to understand how best to introduce test.
There was a problem hiding this comment.
Both were introduced by me, actually:
shadow_link_sr_api_sync: the key behaviour we could verify is that you can't use the new config fields mid-upgrade, similar to the test in a7ebcd9batch_mirror_topic_status: this one is trickier to test because the feature here switches internal behaviour (non-observable, intended to improve performance at high topic counts), so the best idea I have here would be having some trace-log-based assertions to confirm that the new implementation path is not being used before the feature activation, only after.
There was a problem hiding this comment.
Pull request overview
Adds a new “real upgrade” test suite for manual upgrade finalization/unfinalized upgrade behavior, and introduces a reusable perturb/rollback harness to validate downgrade safety and post-finalization feature enablement across real released binaries (v25.3/v26.1) to HEAD.
Changes:
- Introduces
ManualFinalizationUpgradeTestto exercise manual finalization behavior using real binary upgrades/downgrades (including multi-hop scenarios). - Adds a perturbation framework (baseline data-plane integrity checks + per-step feature-gate exercises) and uses it to validate one v26.2-gated feature (
tiered_cloud_topics). - Adds downgrade/rollback cycle coverage before finalize, and a negative test that downgrade is rejected after finalize.
CI test resultstest results on build#85843test results on build#85874
test results on build#86055
|
…tion Fill the v26.1 -> v26.2 per-step perturbation (previously an empty placeholder) with the tiered_cloud_topics feature gate. While the upgrade is unfinalized the active version is held at v26.1, so the feature is unavailable and creating a topic with storage mode tiered_cloud is refused -- the gate that keeps the feature out of reach until finalization. The perturbation drives that validator path in each upgraded-unfinalized phase (no topic is created, so it cannot block the downgrade), and _verify_v26_2_features_working confirms after the final finalize that the gate opens: the feature moves unavailable -> available and can be enabled to active. Add the per-feature scaffolding: _perturb_v26_1_to_v26_2 branches on phase so it only runs on the v26.2 binary, _verify_v26_2_features_working runs after finalize, and a _feature_state helper. The other two v26.2-gated features are not exercised by this single-cluster perturbation: shadow_link_sr_api_sync and batch_mirror_topic_status are both cluster-linking features (batch_mirror_topic_status additionally needs a second cluster). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cross-check the perturbation's per-feature coverage against the live feature table so a newly gated feature cannot be added -- in any major -- without either an exercise or an explicit acknowledgement. The check is major-agnostic and never names a release. While an upgrade is unfinalized the active version is held at the old release, so the features gated by the upgrade under test are exactly those the features endpoint reports "unavailable": they require the not-yet-active version of the binary we upgraded to, and a binary never carries features for a major beyond its own (so features more than one major out cannot appear). _assert_feature_coverage asserts that set is a subset of PERTURB_EXERCISED_FEATURES | PERTURB_ACKNOWLEDGED_FEATURES and fails with an actionable message otherwise. The registries are cumulative across upgrade steps, so reaching a new major only means adding the new features' names -- the guard itself never changes and no per-major pruning is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f8fed2b to
eaf3a6e
Compare
| # instead of a single old -> HEAD hop. | ||
| # Keep older step functions and their harness coverage for as long as the |
There was a problem hiding this comment.
Do you think we should also eventually start deprecating old upgrade paths (e.g. the 26.1 -> 26.2 upgrade path once 26.2 goes end of support)? I see pros/cons for both, but wondering if you've thought about that already.
There was a problem hiding this comment.
I think we can deprecate what we don't officially support. If we don't see meaningful increase in testing cost, we may not even notice.
| assert finalized.version_after_finalization == self.new_logical | ||
|
|
||
| @cluster(num_nodes=3, log_allow_list=RESTART_LOG_ALLOW_LIST) | ||
| def test_get_upgrade_status_reports_lifecycle(self): |
There was a problem hiding this comment.
nit: test_manual_request_triggers_advance and test_get_upgrade_status_reports_lifecycle seem identical except for the assertions they make, I'd be tempted to merge them to simplify
There was a problem hiding this comment.
Good catch. the lifecycle test is essentially a superset; the only thing the advance test adds is the v1 get_features()["cluster_version"] check. I'll merge them and keep that assertion so we still cover both the v1 advance and the v2 status reporting.
| # final upgrade is finalized. Each cycle simulates discovering a problem after | ||
| # upgrading the binaries and rolling back to the prior release without | ||
| # finalizing -- the core capability the unfinalized-upgrade feature provides. | ||
| DOWNGRADE_ROLLBACK_CYCLES = 2 |
There was a problem hiding this comment.
Do you have any scenarios in mind, by any chance, where doing 2 cycles of this could lead to more coverage than just 1 scenario? I am just asking because this new test is fairly expensive (4 minutes 19.347 seconds), so I was trying to think through if there are any knobs we have to speed it up without losing any meaningful coverage.
There was a problem hiding this comment.
┌───────────────────────────────────────────────────────────────────────────┬───────────────────────────┬──────────────┐
│ Phase │ Wall time │ Per-cycle? │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ dt overhead (bazel build fully cached ~6s + docker compose up/down) │ ~26s │ fixed │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Boot cluster on old release (v26.1.10) │ ~5s │ fixed │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Cycle 0 — incl. one-time topic seeding ~31s │ 46.3s │ mostly fixed │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Cycle 1 — steady state (4 cluster restarts + verifies + feature exercise) │ 15.1s │ ← marginal │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Final upgrade + finalize + verify │ ~4.6s │ fixed │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Teardown │ ~3.1s │ fixed │
├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
│ Total dt wall │ ~101s (ducktape body 75s) │ │
└───────────────────────────────────────────────────────────────────────────┴───────────────────────────┴──────────────┘
There was a problem hiding this comment.
I don't think the above experiment is very scientific, but claude broke down the phases, and it seems like an extra cycle is a marginal increase i cost.
I don't have any specific scenario in mind, but I think if we notice this being too long we can absolutely just change it to 1 cycle (though according to that table we might not expect to gain much reduction in runtime).
| "shadow_link_sr_api_sync", | ||
| "batch_mirror_topic_status", |
There was a problem hiding this comment.
Both were introduced by me, actually:
shadow_link_sr_api_sync: the key behaviour we could verify is that you can't use the new config fields mid-upgrade, similar to the test in a7ebcd9batch_mirror_topic_status: this one is trickier to test because the feature here switches internal behaviour (non-observable, intended to improve performance at high topic counts), so the best idea I have here would be having some trace-log-based assertions to confirm that the new implementation path is not being used before the feature activation, only after.
| assert all(m.version_known and m.alive for m in status.members) | ||
| assert all(m.logical_version == self.new_logical for m in status.members) | ||
| # release_version is plumbed through from the per-node health report. | ||
| assert all(m.release_version for m in status.members) |
There was a problem hiding this comment.
yes, i suppose it could but waiting for status above hopefully ensures things converge. did you have specific scenario in mind?
There was a problem hiding this comment.
Nothing specific in mind, was just exploring and wondering if it was possible
In ManualFinalizationUpgradeTest, test_manual_request_triggers_advance and test_get_upgrade_status_reports_lifecycle ran the same real upgrade (install old release, boot, roll to HEAD, finalize) and differed only in their assertions; the lifecycle test was a near-superset. Fold the one unique assertion -- the v1 features-API view that cluster_version is still held at the old version at READY_TO_FINALIZE -- into the lifecycle test and drop the advance test. The real upgrade is the expensive part of these tests, so this drops a full cluster lifecycle from the suite with no loss of coverage: the surviving test still proves the manual finalize drives the advance, checked against both the v1 and v2 APIs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A new feature, iceberg_extended_mode_config, was added gated at v26_2_1, so it joins the set the v26.1 -> v26.2 perturbation coverage guard reports as unavailable while the upgrade is unfinalized. That guard fails unless every gated feature is exercised or acknowledged. The feature gates Iceberg extended-mode topic-config validation; perturbing it would need Iceberg topic setup orthogonal to the finalization behavior under test, so acknowledge it rather than exercise it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| With `features_auto_finalization=false` set on the old release before | ||
| the upgrade, a completed rolling upgrade to HEAD must not auto-advance | ||
| cluster_version. The opt-out, written on the old binary, has to survive |
nguyen-andrew
left a comment
There was a problem hiding this comment.
Nice tests, just a few dedup nits
| def _disable_auto_finalization(self): | ||
| self.redpanda.set_cluster_config({"features_auto_finalization": False}) | ||
|
|
||
| def _call_with_leader_retry(self, call, timeout_sec=30): | ||
| """Retry a controller-leader-routed admin v2 call through the transient | ||
| UNAVAILABLE window after restarts/leadership changes. See the identical | ||
| helper in ManualFinalizationTest for the rationale.""" | ||
| deadline = time.time() + timeout_sec | ||
| while True: | ||
| try: | ||
| return call() | ||
| except ConnectError as e: | ||
| if e.code != ConnectErrorCode.UNAVAILABLE or time.time() >= deadline: | ||
| raise | ||
| time.sleep(1) | ||
|
|
||
| def _finalize(self): | ||
| return self._call_with_leader_retry( | ||
| lambda: self.admin_v2.features().finalize_upgrade( | ||
| features_pb2.FinalizeUpgradeRequest() | ||
| ) | ||
| ) | ||
|
|
||
| def _get_upgrade_status(self): | ||
| return self._call_with_leader_retry( | ||
| lambda: self.admin_v2.features().get_upgrade_status( | ||
| features_pb2.GetUpgradeStatusRequest() | ||
| ) | ||
| ) | ||
|
|
||
| def _wait_for_status_state(self, state, timeout_sec=30): | ||
| """Wait until GetUpgradeStatus reports `state`; return that status.""" | ||
| wait_until( | ||
| lambda: self._get_upgrade_status().state == state, | ||
| timeout_sec=timeout_sec, | ||
| backoff_sec=1, | ||
| ) | ||
| return self._get_upgrade_status() |
There was a problem hiding this comment.
nit: could we deduplicate these (_disable_auto_finalization, _call_with_leader_retry, _finalize, _get_upgrade_status, _wait_for_status_state) with their counterparts in ManualFinalizationTest?
|
|
||
| def _wait_for_status_state(self, state, timeout_sec=30): | ||
| """Wait until GetUpgradeStatus reports `state`; return that status.""" | ||
| wait_until( |
There was a problem hiding this comment.
nit: can probably use wait_until_result
| self._wait_for_cluster_settled() | ||
| return self._node_latest_logical_version(self.redpanda.nodes[0]) | ||
|
|
||
| def _wait_for_cluster_version(self, target, timeout_sec=60): |
There was a problem hiding this comment.
nit: Could we reuse/merge with _wait_for_version_everywhere instead?
| f"Old release {old_release} reports logical version {self.old_logical}" | ||
| ) | ||
|
|
||
| def _restart_at_new(self, nodes): |
There was a problem hiding this comment.
nit: could we aggregate the install / restart / settle pattern into shared helpers, something like:
def _restart_and_settle(self): # in-place restart of the whole cluster
self.redpanda.restart_nodes(self.redpanda.nodes)
self._wait_for_cluster_settled()
def _roll_all_to(self, version): # in-place upgrade/downgrade
self.installer.install(self.redpanda.nodes, version)
self._restart_and_settle()
return self._node_latest_logical_version(self.redpanda.nodes[0])
def _boot_all_on(self, version): # cold boot (start(), not restart)
self.installer.install(self.redpanda.nodes, version)
self.redpanda.start()
def _upgrade_all_to(self, version):
return self._roll_all_to(version)
def _downgrade_all_to(self, release):
self._roll_all_to(release)
def _restart_at_new(self):
self.new_logical = self._roll_all_to(RedpandaInstaller.HEAD)
assert self.new_logical > self.old_logical, (...)
This PR introduces two kinds of tests. First, it reproduces some of the synthetic tests for unfinalized upgrades which used a special environment variable to "fake" the upgrades, and the reproductions use real 25.3 and 26.1 minor releases that contain the correct configuration. So, "real" unfinalized upgrade test.
The second kind of test is more of a framework. The framework is for building a test which upgrades a cluster without finalizing, perturbs the system (X), downgrades, then repeats the process, but this time finalizes the upgrade and verifies the system looks like it is in a good state.
X: what goes into the perturbation step? Ideally it is some set of tests that exercise the features that are intended to be feature gated for a release. This PR adds the framework, and a very simple test for one 26.1 feature. More to come later
Fixes: CORE-16324
Fixes: CORE-16325
Backports Required
Release Notes