Skip to content

[CORE-16324] Auto finalize real upgrade test#30814

Open
dotnwat wants to merge 8 commits into
redpanda-data:devfrom
dotnwat:auto-finalize-real-upgrade-test
Open

[CORE-16324] Auto finalize real upgrade test#30814
dotnwat wants to merge 8 commits into
redpanda-data:devfrom
dotnwat:auto-finalize-real-upgrade-test

Conversation

@dotnwat

@dotnwat dotnwat commented Jun 15, 2026

Copy link
Copy Markdown
Member

This PR introduces two kinds of tests. First, it reproduces some of the synthetic tests for unfinalized upgrades which used a special environment variable to "fake" the upgrades, and the reproductions use real 25.3 and 26.1 minor releases that contain the correct configuration. So, "real" unfinalized upgrade test.

The second kind of test is more of a framework. The framework is for building a test which upgrades a cluster without finalizing, perturbs the system (X), downgrades, then repeats the process, but this time finalizes the upgrade and verifies the system looks like it is in a good state.

X: what goes into the perturbation step? Ideally it is some set of tests that exercise the features that are intended to be feature gated for a release. This PR adds the framework, and a very simple test for one 26.1 feature. More to come later

Fixes: CORE-16324
Fixes: CORE-16325

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Comment thread tests/rptest/tests/cluster_features_test.py Outdated
Comment thread tests/rptest/tests/cluster_features_test.py
dotnwat and others added 2 commits June 15, 2026 12:26
Add ManualFinalizationUpgradeTest, a real-upgrade counterpart to the
synthetic ManualFinalizationTest. Where the synthetic test fakes the
version bump with __REDPANDA_LATEST_LOGICAL_VERSION, this drives an
actual binary upgrade from the latest prior-feature-line release up to
HEAD, exercising the operator flow where features_auto_finalization is
opted out on the old binary before the upgrade.

Covers three scenarios end to end: auto-finalization disabled blocks the
post-upgrade advance, an explicit FinalizeUpgrade drives it, and
GetUpgradeStatus reports the lifecycle. The old release lacks the admin
v2 RPCs, so status and finalize are only invoked once every node is on
HEAD; old and new logical versions are discovered at runtime rather than
hard-coded. The synthetic test is kept as-is for fast, hermetic coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add test_config_passes_through_multi_hop_upgrade to
ManualFinalizationUpgradeTest. features_auto_finalization was backported
to both v25.3.15 and v26.1.9, so an operator can set the opt-out once on
the oldest release and have it carry through a v25.3 -> v26.1 -> HEAD
upgrade with the flag set only once.

The test sets the flag once on the oldest release, upgrades through the
latest v26.1 patch (which has the knob but not the gating logic, so it
advances the active version normally), then to HEAD (which consumes the
knob). It asserts the active version is held after the final hop and that
GetUpgradeStatus reports auto-finalization disabled -- proving the flag
survived every controller-log hop -- before an explicit finalize drives
the advance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dotnwat dotnwat force-pushed the auto-finalize-real-upgrade-test branch from ea45877 to b3076ec Compare June 15, 2026 19:35
Lay the groundwork for exercising the core promise of the
unfinalized-upgrade feature: while an upgrade is not finalized the cluster
can be rolled back to the prior release (giving up post-finalization
features), and once finalized it cannot.

test_downgrade_before_finalize boots v26.1.x, disables auto-finalization,
then repeatedly upgrades the binaries to HEAD and rolls them back without
finalizing -- the active version is held at the old version, so the old
binary runs on the existing data -- before finalizing on the last attempt.
test_downgrade_rejected_after_finalize confirms the flip side: after
finalization the old binary aborts startup with "Incompatible downgrade
detected".

The per-state perturbation hook (_perturb) is intentionally a no-op for
now; richness -- workloads, topic/config churn, leadership moves -- is
added in a later commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dotnwat dotnwat changed the title Auto finalize real upgrade test [CORE-16324] Auto finalize real upgrade test Jun 15, 2026
Flesh out _perturb (previously a no-op) into the shared baseline run in
every upgrade/downgrade state: create a fixed set of topics, seed them
once, read all records back, restart the whole cluster in place, wait for
it to return healthy, and read the records again. Because the topics
persist across the transitions, the read-backs double as a
feature-agnostic check that data produced on one binary survives both a
roll to the other and an in-place restart.

Structure the per-upgrade-step perturbations as 1 + N: the common work
above plus one function per supported unfinalized-upgrade step.
_perturb_v26_1_to_v26_2 is added as the N=1 placeholder -- intentionally
empty -- to be filled with the v26.2-gated feature exercises as v26.2
development proceeds. A note documents how this grows when v26.3 work
begins: a new per-step function plus a harness extension to chain
unfinalized upgrades (v26.1 -> v26.2 -> v26.3), retained per the support
window.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +1212 to +1213
"shadow_link_sr_api_sync",
"batch_mirror_topic_status",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are left out right now because I need to sync with authors of these features to understand how best to introduce test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both were introduced by me, actually:

  • shadow_link_sr_api_sync: the key behaviour we could verify is that you can't use the new config fields mid-upgrade, similar to the test in a7ebcd9
  • batch_mirror_topic_status: this one is trickier to test because the feature here switches internal behaviour (non-observable, intended to improve performance at high topic counts), so the best idea I have here would be having some trace-log-based assertions to confirm that the new implementation path is not being used before the feature activation, only after.

@dotnwat dotnwat marked this pull request as ready for review June 16, 2026 05:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “real upgrade” test suite for manual upgrade finalization/unfinalized upgrade behavior, and introduces a reusable perturb/rollback harness to validate downgrade safety and post-finalization feature enablement across real released binaries (v25.3/v26.1) to HEAD.

Changes:

  • Introduces ManualFinalizationUpgradeTest to exercise manual finalization behavior using real binary upgrades/downgrades (including multi-hop scenarios).
  • Adds a perturbation framework (baseline data-plane integrity checks + per-step feature-gate exercises) and uses it to validate one v26.2-gated feature (tiered_cloud_topics).
  • Adds downgrade/rollback cycle coverage before finalize, and a negative test that downgrade is rejected after finalize.

Comment thread tests/rptest/tests/cluster_features_test.py
Comment thread tests/rptest/tests/cluster_features_test.py Outdated
@vbotbuildovich

vbotbuildovich commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#85843
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) CloudTopicsSizeReportingTest test_l0_size_reporting null integration https://buildkite.com/redpanda/redpanda/builds/85843#019ecedd-c0d3-467f-a772-b43023605080 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsSizeReportingTest&test_method=test_l0_size_reporting
FLAKY(PASS) DatalakeTableNameTest test_topic_name_dot_replacement {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/85843#019ecedd-c0d0-43cf-9563-af36ea091bd0 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0065, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeTableNameTest&test_method=test_topic_name_dot_replacement
FLAKY(PASS) NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 1} integration https://buildkite.com/redpanda/redpanda/builds/85843#019ecede-9c4e-403a-a555-27696245fbfe 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0288, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": false} integration https://buildkite.com/redpanda/redpanda/builds/85843#019ecede-9c4f-4d7f-a94f-0cd41f03bd32 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0257, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing
test results on build#85874
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/85874#019ed224-e8b5-47d5-92b2-6011189dce91 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0257, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS) DatalakeTableNameTest test_dlq_table_name_dot_replacement {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/85874#019ed226-938d-429c-8a4c-81c01dc25c76 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0194, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeTableNameTest&test_method=test_dlq_table_name_dot_replacement
test results on build#86055
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86055#019ee1b5-b87b-4532-a4b3-b8feee158759 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0353, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1021, p1=0.3405, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": false} integration https://buildkite.com/redpanda/redpanda/builds/86055#019ee1b6-3847-4608-ad03-0d560da6c03c 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0512, p0=0.6505, reject_threshold=0.0100. adj_baseline=0.1459, p1=0.1885, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing

dotnwat and others added 2 commits June 16, 2026 12:05
…tion

Fill the v26.1 -> v26.2 per-step perturbation (previously an empty
placeholder) with the tiered_cloud_topics feature gate. While the upgrade is
unfinalized the active version is held at v26.1, so the feature is unavailable
and creating a topic with storage mode tiered_cloud is refused -- the gate that
keeps the feature out of reach until finalization. The perturbation drives that
validator path in each upgraded-unfinalized phase (no topic is created, so it
cannot block the downgrade), and _verify_v26_2_features_working confirms after
the final finalize that the gate opens: the feature moves unavailable ->
available and can be enabled to active.

Add the per-feature scaffolding: _perturb_v26_1_to_v26_2 branches on phase so it
only runs on the v26.2 binary, _verify_v26_2_features_working runs after
finalize, and a _feature_state helper.

The other two v26.2-gated features are not exercised by this single-cluster
perturbation: shadow_link_sr_api_sync and batch_mirror_topic_status are both
cluster-linking features (batch_mirror_topic_status additionally needs a second
cluster).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cross-check the perturbation's per-feature coverage against the live feature
table so a newly gated feature cannot be added -- in any major -- without
either an exercise or an explicit acknowledgement.

The check is major-agnostic and never names a release. While an upgrade is
unfinalized the active version is held at the old release, so the features
gated by the upgrade under test are exactly those the features endpoint reports
"unavailable": they require the not-yet-active version of the binary we upgraded
to, and a binary never carries features for a major beyond its own (so features
more than one major out cannot appear). _assert_feature_coverage asserts that
set is a subset of PERTURB_EXERCISED_FEATURES | PERTURB_ACKNOWLEDGED_FEATURES and
fails with an actionable message otherwise. The registries are cumulative across
upgrade steps, so reaching a new major only means adding the new features' names
-- the guard itself never changes and no per-major pruning is required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dotnwat dotnwat force-pushed the auto-finalize-real-upgrade-test branch from f8fed2b to eaf3a6e Compare June 16, 2026 20:27
pgellert
pgellert previously approved these changes Jun 18, 2026

@pgellert pgellert left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +1396 to +1397
# instead of a single old -> HEAD hop.
# Keep older step functions and their harness coverage for as long as the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should also eventually start deprecating old upgrade paths (e.g. the 26.1 -> 26.2 upgrade path once 26.2 goes end of support)? I see pros/cons for both, but wondering if you've thought about that already.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can deprecate what we don't officially support. If we don't see meaningful increase in testing cost, we may not even notice.

assert finalized.version_after_finalization == self.new_logical

@cluster(num_nodes=3, log_allow_list=RESTART_LOG_ALLOW_LIST)
def test_get_upgrade_status_reports_lifecycle(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test_manual_request_triggers_advance and test_get_upgrade_status_reports_lifecycle seem identical except for the assertions they make, I'd be tempted to merge them to simplify

@dotnwat dotnwat Jun 19, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. the lifecycle test is essentially a superset; the only thing the advance test adds is the v1 get_features()["cluster_version"] check. I'll merge them and keep that assertion so we still cover both the v1 advance and the v2 status reporting.

# final upgrade is finalized. Each cycle simulates discovering a problem after
# upgrading the binaries and rolling back to the prior release without
# finalizing -- the core capability the unfinalized-upgrade feature provides.
DOWNGRADE_ROLLBACK_CYCLES = 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any scenarios in mind, by any chance, where doing 2 cycles of this could lead to more coverage than just 1 scenario? I am just asking because this new test is fairly expensive (4 minutes 19.347 seconds), so I was trying to think through if there are any knobs we have to speed it up without losing any meaningful coverage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 ┌───────────────────────────────────────────────────────────────────────────┬───────────────────────────┬──────────────┐
  │                                   Phase                                   │         Wall time         │  Per-cycle?  │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ dt overhead (bazel build fully cached ~6s + docker compose up/down)       │ ~26s                      │ fixed        │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Boot cluster on old release (v26.1.10)                                    │ ~5s                       │ fixed        │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Cycle 0 — incl. one-time topic seeding ~31s                               │ 46.3s                     │ mostly fixed │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Cycle 1 — steady state (4 cluster restarts + verifies + feature exercise) │ 15.1s                     │ ← marginal   │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Final upgrade + finalize + verify                                         │ ~4.6s                     │ fixed        │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Teardown                                                                  │ ~3.1s                     │ fixed        │
  ├───────────────────────────────────────────────────────────────────────────┼───────────────────────────┼──────────────┤
  │ Total dt wall                                                             │ ~101s (ducktape body 75s) │              │
  └───────────────────────────────────────────────────────────────────────────┴───────────────────────────┴──────────────┘

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the above experiment is very scientific, but claude broke down the phases, and it seems like an extra cycle is a marginal increase i cost.

I don't have any specific scenario in mind, but I think if we notice this being too long we can absolutely just change it to 1 cycle (though according to that table we might not expect to gain much reduction in runtime).

Comment on lines +1212 to +1213
"shadow_link_sr_api_sync",
"batch_mirror_topic_status",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both were introduced by me, actually:

  • shadow_link_sr_api_sync: the key behaviour we could verify is that you can't use the new config fields mid-upgrade, similar to the test in a7ebcd9
  • batch_mirror_topic_status: this one is trickier to test because the feature here switches internal behaviour (non-observable, intended to improve performance at high topic counts), so the best idea I have here would be having some trace-log-based assertions to confirm that the new implementation path is not being used before the feature activation, only after.

assert all(m.version_known and m.alive for m in status.members)
assert all(m.logical_version == self.new_logical for m in status.members)
# release_version is plumbed through from the per-node health report.
assert all(m.release_version for m in status.members)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this flake based on this comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i suppose it could but waiting for status above hopefully ensures things converge. did you have specific scenario in mind?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing specific in mind, was just exploring and wondering if it was possible

dotnwat and others added 2 commits June 19, 2026 13:43
In ManualFinalizationUpgradeTest, test_manual_request_triggers_advance
and test_get_upgrade_status_reports_lifecycle ran the same real upgrade
(install old release, boot, roll to HEAD, finalize) and differed only in
their assertions; the lifecycle test was a near-superset.

Fold the one unique assertion -- the v1 features-API view that
cluster_version is still held at the old version at READY_TO_FINALIZE --
into the lifecycle test and drop the advance test. The real upgrade is
the expensive part of these tests, so this drops a full cluster lifecycle
from the suite with no loss of coverage: the surviving test still proves
the manual finalize drives the advance, checked against both the v1 and
v2 APIs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A new feature, iceberg_extended_mode_config, was added gated at v26_2_1,
so it joins the set the v26.1 -> v26.2 perturbation coverage guard
reports as unavailable while the upgrade is unfinalized. That guard
fails unless every gated feature is exercised or acknowledged.

The feature gates Iceberg extended-mode topic-config validation;
perturbing it would need Iceberg topic setup orthogonal to the
finalization behavior under test, so acknowledge it rather than exercise
it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +1583 to +1585
With `features_auto_finalization=false` set on the old release before
the upgrade, a completed rolling upgrade to HEAD must not auto-advance
cluster_version. The opt-out, written on the old binary, has to survive

@nguyen-andrew nguyen-andrew left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests, just a few dedup nits

Comment on lines +1538 to +1575
def _disable_auto_finalization(self):
self.redpanda.set_cluster_config({"features_auto_finalization": False})

def _call_with_leader_retry(self, call, timeout_sec=30):
"""Retry a controller-leader-routed admin v2 call through the transient
UNAVAILABLE window after restarts/leadership changes. See the identical
helper in ManualFinalizationTest for the rationale."""
deadline = time.time() + timeout_sec
while True:
try:
return call()
except ConnectError as e:
if e.code != ConnectErrorCode.UNAVAILABLE or time.time() >= deadline:
raise
time.sleep(1)

def _finalize(self):
return self._call_with_leader_retry(
lambda: self.admin_v2.features().finalize_upgrade(
features_pb2.FinalizeUpgradeRequest()
)
)

def _get_upgrade_status(self):
return self._call_with_leader_retry(
lambda: self.admin_v2.features().get_upgrade_status(
features_pb2.GetUpgradeStatusRequest()
)
)

def _wait_for_status_state(self, state, timeout_sec=30):
"""Wait until GetUpgradeStatus reports `state`; return that status."""
wait_until(
lambda: self._get_upgrade_status().state == state,
timeout_sec=timeout_sec,
backoff_sec=1,
)
return self._get_upgrade_status()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we deduplicate these (_disable_auto_finalization, _call_with_leader_retry, _finalize, _get_upgrade_status, _wait_for_status_state) with their counterparts in ManualFinalizationTest?


def _wait_for_status_state(self, state, timeout_sec=30):
"""Wait until GetUpgradeStatus reports `state`; return that status."""
wait_until(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can probably use wait_until_result

self._wait_for_cluster_settled()
return self._node_latest_logical_version(self.redpanda.nodes[0])

def _wait_for_cluster_version(self, target, timeout_sec=60):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we reuse/merge with _wait_for_version_everywhere instead?

f"Old release {old_release} reports logical version {self.old_logical}"
)

def _restart_at_new(self, nodes):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we aggregate the install / restart / settle pattern into shared helpers, something like:

def _restart_and_settle(self):                  # in-place restart of the whole cluster
    self.redpanda.restart_nodes(self.redpanda.nodes)
    self._wait_for_cluster_settled()

def _roll_all_to(self, version):                # in-place upgrade/downgrade
    self.installer.install(self.redpanda.nodes, version)
    self._restart_and_settle()
    return self._node_latest_logical_version(self.redpanda.nodes[0])

def _boot_all_on(self, version):                # cold boot (start(), not restart)
    self.installer.install(self.redpanda.nodes, version)
    self.redpanda.start()

def _upgrade_all_to(self, version):
    return self._roll_all_to(version)

def _downgrade_all_to(self, release):
    self._roll_all_to(release)

def _restart_at_new(self):
    self.new_logical = self._roll_all_to(RedpandaInstaller.HEAD)
    assert self.new_logical > self.old_logical, (...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants