Core 15812/ct scheduler test#30801

Draft

oleiman wants to merge 18 commits into

redpanda-data:devfrom

oleiman:core-15812/ct-scheduler-test

oleiman commented Jun 12, 2026

Member

.

Backports Required

Release Notes

none

oleiman added 3 commits

June 12, 2026 16:43


          cloud_topics: add single_flight coordinator type

5ff6aa9

Dedup concurrent work keyed by a path. run(key, as, work) runs the
work once on the first caller (the leader) and merges later callers
onto its outcome. Bounded per shard; at capacity callers run
uncoordinated.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>


          cloud_topics: per-shard in-flight L1 cache dedup

78278af

Concurrent reads missing the cloud cache on the same extent each
issue their own S3 GET. Route read_object's cold-miss download
through single_flight so one GET serves all waiters on a shard.

A gate drains in-flight reads before file_io destruction.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>


          cloud_topics: add file_io_probe with L1 dedup counters

e38fe37

Per-shard counters to gauge dedup: reads, cache_misses, and
concurrent_read_merges. Plumbed through app and read-replica
refreshers.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman self-assigned this

github-actions Bot added area/build area/redpanda labels

oleiman commented Jun 12, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/scale: cloud_io lane sampler + disable CT leveling in MPT

1e027a9

Add a per-lane cloud_io scheduler sampler to the many_partitions
cloud-topics consume: every 20s it logs in_flight/waiters for
consumer_fetch, producer_upload, and default_group from the public
metrics endpoint (internal metrics are disabled at this scale), so
read-lane starvation behind the cloud_topics consumer hang is visible
in the artifacts. Runs in a daemon thread joined in a finally;
self-disables when the scheduler metrics are absent.

Also disable cloud-topic leveling and compaction for the cloud-topics
run. Leveling runs in the cloud_io default_group lane and does heavy
extent-merge I/O during the consume, contending with consumer_fetch
for the shared client pool; compaction is near-idle on these
delete-policy topics. Disabling both isolates whether freeing
default_group I/O lets the consumer drain; reconciliation stays on as
the L0->L1 data path.

oleiman force-pushed the core-15812/ct-scheduler-test branch from 13306b5 to 1e027a9 Compare

June 13, 2026 05:39

oleiman commented Jun 13, 2026

Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          k/fetch: don't map client-disconnect to not_leader_for_partition

b1d76ec

When a client connection's input is shut down, the request's abort
source is aborted with std::errc::connection_aborted, which surfaces in
the per-partition read as a std::system_error. fetch_ntps's
handle_exception mapped every read failure -- including this one -- to
not_leader_for_partition to force a client retry. For a client that is
going away that is wrong: it spams the log (one WARN per partition) and,
on a half-closed connection whose response is still delivered, tells the
client the partition moved, so it refreshes metadata and reconnects. At
high cloud-topic partition counts (ManyPartitionsTest cloud_topics) this
compounds into a connection-churn / not_leader storm that adds load and
keeps the consumer from making progress.

Re-raise client-disconnect errors instead, using the same
net::is_reconnect_error predicate the fetch worker's handle_exceptions()
already uses to recognize a disconnect; it drops the error quietly
rather than producing per-partition not_leader results for a response
nobody will act on.

oleiman commented Jun 13, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

1 similar comment

oleiman commented Jun 14, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

oleiman added 2 commits

June 14, 2026 14:37


          cloud_topics: expose l1_reader_cache hit/miss metrics

5d6c565

Register a public metrics group (cloud_topics_l1_reader_cache) for the
per-shard L1 reader cache: cached_readers, in_use_readers, hits, misses,
readers_evicted. Make them public so they are scrapeable when internal
metrics are disabled (as at high partition counts).

A near-zero hit rate with steady evictions is the signature of the cache
thrashing -- the active reader set exceeding
cloud_topics_l1_reader_cache_max_size -- which forces every fetch to
rebuild the reader and re-query the L1 metastore.


          ct/scale: size L1 reader cache + sample its hit rate in MPT

0392fd2

At ManyPartitions cloud_topics scale (~17.8k partitions) a consumer is
behind on ~500 partitions per shard, but the L1 reader cache defaults to
128/shard, so the LRU evicts the reader the consumer is about to
re-fetch. Every fetch rebuilds the reader and re-RPCs the metastore,
saturating the kafka scheduling group and stalling the consumer while
the object-store pool sits idle.

Bump cloud_topics_l1_reader_cache_max_size to 1024 and the eviction
timeout to 600s so the cache holds the active set across the consume,
and sample the new l1_reader_cache hit/miss/eviction counters alongside
the cloud_io lane occupancy so the run shows the hit rate directly.

Leveling and compaction stay disabled (ruled out as the cause earlier;
kept off to isolate the reader cache).

oleiman force-pushed the core-15812/ct-scheduler-test branch from 9dde7c3 to 0392fd2 Compare

June 14, 2026 21:37

oleiman commented Jun 14, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/scale: capture a targeted CPU profile mid-consume in MPT

7c11db3

Add a one-shot CPU-profile capture to the many_partitions cloud_topics
consume: 60s in (once the kafka scheduling group is saturated), sample
one broker for 30s via the admin cpu_profile endpoint and log the top
stacks by occurrence. The wait_ms override enables the profiler for just
that window, so it stays off the rest of the run. Targeted to one broker
and one window to keep the cost down.

The backtraces are raw addresses; the logged top stacks are symbolized
offline against the build binary to find what saturates the kafka group.

oleiman commented Jun 15, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

1 similar comment

oleiman commented Jun 15, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          Revert "ct: Add early max_bytes check to L1 reader"

a05044a

a0bbe35 made the L1 reader return end-of-stream + no data when
strict_max_bytes && max_bytes==0. In a fetch over many partitions the
plan hands the budget-exhausted tail partitions max_bytes==0 with
strict_max_bytes set (everything after the obligatory first read), so
those partitions now return nothing and never advance -- the consumer
stalls (ManyPartitionsTest cloud_topics times out, a regression since
late March). Before the check they got a trickle and drained.

Removing the check leaves strict_max_bytes unused in the L1 reader;
update the max_bytes_zero_behavior test accordingly (strict max_bytes=0
now returns one batch, like the non-strict case).

Reverting to verify the regression via CDT; a proper fix that keeps the
no-wasteful-read intent without starving the tail follows.

oleiman force-pushed the core-15812/ct-scheduler-test branch from c548631 to a05044a Compare

June 15, 2026 07:22

oleiman commented Jun 15, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/l1: decouple reader IO from client connection abort

75196dc

The L1 reader passed the fetch's client connection abort source into its
object-store and metastore reads. A transient consumer disconnect then
aborted the in-flight read, dropping the reader's open stream and making
it non-reusable, so the l1_reader_cache evicted it. The reconnect
re-fetched the same offset, missed the cache, and paid a fresh metastore
RPC + object GET -- a thrash that never lets the cache warm under a
disconnect-heavy workload.

Bind object-store and metastore IO to a reader-owned abort source so a
client disconnect no longer tears down the warm reader; the reconnect
reuses it. Each IO stays bounded by its own retry/timeout policy.

Add a cache test: a reader survives a mid-read client disconnect.

oleiman commented Jun 15, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/l1: instrument reader-cache miss reasons

77edc72

Split why the l1_reader_cache fails to reuse readers, to tell over-read
(a reader is cached for the partition but at an offset the next fetch
doesn't request) from frontier-EOS (the reader is returned having read
to the end of available data and is disposed). Surface the breakdown via
the cloud_topics_l1_reader_cache metric group and the ManyPartitions
sampler.

oleiman commented Jun 15, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/l1: fix reader-cache miss-reason metric name collision

29b885e

The miss-reason counter was named with a "misses" prefix, which the
public-metrics substring matcher confused with the existing misses
counter, so the ManyPartitions sampler threw "more than one metric
matched" every iteration and logged no breakdown. Rename it to
offset_mismatch so each query pattern matches exactly one metric.

oleiman commented Jun 16, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

oleiman commented Jun 16, 2026

Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct: expose reconciliation + batch-cache progress as public metrics

7b01d67

The ManyPartitions stall (CORE-15812) shows the consumer stuck while the
fetch-path cloud_io pool sits idle -- data isn't reaching readable
objects. The reconciler and batch-cache probes that would show why are
internal-only, and internal metrics are disabled at scale. Expose
reconciliation progress (objects/batches/bytes_reconciled) and batch-cache
hit/miss as public metrics and sample them in the ManyPartitions test, so
a CDT run shows whether reconciliation is advancing and whether the batch
cache is going cold.

oleiman commented Jun 17, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

oleiman added 2 commits

June 17, 2026 08:24


          ct/reconciler: don't skip a source when the LSO is unavailable

be23801

has_pending_data() returned false when last_stable_offset() was
unavailable, dropping the source before reconciliation. But make_reader
syncs the stm (via sync_effective_start), so skipping kept the LSO
unavailable and the partition never reconciled -- a self-perpetuating
skip that stalls reconciliation and the consumers reading those
partitions. Treat an unavailable LSO as "maybe pending" and keep the
source so the next pass syncs the stm and re-evaluates.


          ct/reconciler: expose LSO-HWM gap + unavailable-LSO count

721b704

Diagnostic for CORE-15812. pending_offset_lag reads 0 both when caught up
and when the LSO is unavailable, so it can't distinguish a reconciliation
stall from genuine idleness. Add per-shard public gauges: lso_hwm_gap
(sum of HWM-LSO across sources -- persistently positive means the LSO is
frozen below the HWM) and lso_unavailable_sources, fed by a per-iteration
sweep in the reconciliation loop, and sample them in ManyPartitions.

oleiman commented Jun 17, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          cloud_topics: instrument no-object EOS and reconciler corrections

22cc371

Localize the cloud-topics consumer stall by splitting the read- and
reconciler-side signals of an lro/metastore divergence:

- level_one_reader flags EOS caused by the metastore returning no
  object for the next offset (a reconciled offset with no registered
  extent), distinct from a normal end-of-data EOS.
- l1_reader_cache exposes that subset as public "no_object_eos".
- reconciler_probe exposes offset_corrections (metastore dropped
  misaligned extents) and partitions_reconciled.
- many_partitions_test samples the new metrics.

oleiman commented Jun 17, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          cloud_topics: instrument the reconciliation gap (read + metastore)

409861a

Pin where the lro/metastore divergence (CORE-15812) originates:

- level_one_reader/l1_reader_cache: split out gap_eos, the subset of
  no-object EOS where the reader delivered 0 bytes -- a reconciled
  offset (start_offset <= lro) with no registered extent, the direct
  gap signal. Sample it in many_partitions_test and log a rate-limited
  warn naming the partition/offset.
- lsm/stm: a skipped apply still advances the apply offset, so write
  reports success while the extent was dropped; log it at warn instead
  of debug so any silent drop is visible.
- lsm/replicated_db: on open, log the recovery boundaries
  (max_persisted_seqno, volatile_buffer seqno range, replayed vs
  skipped) to catch a manifest seqno advanced past durable data.

oleiman commented Jun 18, 2026

Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics


          ct/scale: bump fetch_max_read_concurrency in MPT cloud_topics

3790c9c

Each cloud-topic partition read issues a metastore extent-lookup RPC;
at the default concurrency of 1 a fetch serializes one ~100ms round-trip
per partition while the broker stays resource-idle. Raise it to 16 for
the cloud_topics MPT variant to test whether that per-partition
serialization is the throughput cap (CORE-15812).

oleiman commented Jun 18, 2026

Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda