Skip to content

[v26.1.x] kafka/cluster_link: fix deadlocks with kafka::cluster#30788

Merged
WillemKauf merged 6 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30784-v26.1.x-799
Jun 16, 2026
Merged

[v26.1.x] kafka/cluster_link: fix deadlocks with kafka::cluster#30788
WillemKauf merged 6 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30784-v26.1.x-799

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30784

Allocating a `config` on the stack in this test blows it up due
to the size of a `configuration` object.

Use the exposed `config::make_config()` helper to allocate it on
the heap instead.

(cherry picked from commit 04dcdcb)
Demonstrates deadlocks present in `client::stop()`. A wedged `client` waiting
for an inflight schema registry request to finish which calls `stop()` is
currently only awakened by the `cluster`'s `abort_source`, which in turn is
only aborted when `cluster::stop()` is called, but only _after_ the `client`
manages to close its gate.

In summary, the client's gate waits on the hung request, the request waits on
the abort source, and the abort source waits on the gate being closed, leading
to a deadlock.

(cherry picked from commit fe75dbb)
Fixes the deadlocks in `client::stop()` demonstrated by the previous commit.
By splitting `cluster::stop()` into two functions, we fix the case of a
hung request waiting on the `cluster`'s `abort_source` to fire while the
`abort_source` is stuck waiting for the gate of the source client for the
hung request to close.

(cherry picked from commit 6e297d2)
Demonstrates deadlocks present in `link::stop()`. A wedged `link` waiting
for an inflight schema registry request to finish which calls `stop()` is
currently only awakened by the `cluster`'s `abort_source`, which in turn is
only aborted when `cluster::stop()` is called, but only _after_ the `link`
manages to close its gate.

In summary, the link's gate waits on the hung request, the request waits on
the abort source, and the abort source waits on the gate being closed, leading
to a deadlock.

(cherry picked from commit bbcc4f5)
Same deadlock as present in `kafka::client::stop()` - we need to call
`cluster::shutdown_input()` before attempting to close any gates, since
a hung task will cause us to be stuck forever.

(cherry picked from commit 5b7cc61)
@vbotbuildovich vbotbuildovich added this to the v26.1.x-next milestone Jun 12, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 12, 2026
@vbotbuildovich

Copy link
Copy Markdown
Collaborator Author

Retry command for Build#85733

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/audit_log_test.py::AuditLogTestsAppLifecycle.test_recovery_mode@{"audit_transport_mode":"kclient"}

@vbotbuildovich

vbotbuildovich commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#85733
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(FAIL) AuditLogTestsAppLifecycle test_recovery_mode {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/85733#019ebd0d-fe4d-4d80-a84f-8fd0bb13e5cc 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestsAppLifecycle&test_method=test_recovery_mode
test results on build#85762
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/85762#019ebdc5-3c2a-4e18-a7eb-4bf23074bd2d 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression

We were writing all of these cases individually when in reality,
all shutdown errors probably require nothing more than a `DEBUG` log
line and an early return. Commit 6e297d2
resulted in a new `broken_named_semaphore` exception being propagated
through this path, leading to `ERROR`s in this path due to the unknown
exception route being taken.

Future proof by checking `ssx::is_shutdown_exception()` and logging &
early returning.

(cherry picked from commit 33b107f)
@WillemKauf WillemKauf merged commit 10ac573 into redpanda-data:v26.1.x Jun 16, 2026
18 checks passed
@WillemKauf

Copy link
Copy Markdown
Contributor

Fix has been in dev for a few days with no obvious new failures popping up in CI - feels safe to merge this backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants