Skip to content

Core 15812/ct scheduler test#30801

Draft
oleiman wants to merge 18 commits into
redpanda-data:devfrom
oleiman:core-15812/ct-scheduler-test
Draft

Core 15812/ct scheduler test#30801
oleiman wants to merge 18 commits into
redpanda-data:devfrom
oleiman:core-15812/ct-scheduler-test

Conversation

@oleiman

@oleiman oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member

.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

oleiman added 3 commits June 12, 2026 16:43
Dedup concurrent work keyed by a path. run(key, as, work) runs the
work once on the first caller (the leader) and merges later callers
onto its outcome. Bounded per shard; at capacity callers run
uncoordinated.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Concurrent reads missing the cloud cache on the same extent each
issue their own S3 GET. Route read_object's cold-miss download
through single_flight so one GET serves all waiters on a shard.

A gate drains in-flight reads before file_io destruction.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Per-shard counters to gauge dedup: reads, cache_misses, and
concurrent_read_merges. Plumbed through app and read-replica
refreshers.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Add a per-lane cloud_io scheduler sampler to the many_partitions
cloud-topics consume: every 20s it logs in_flight/waiters for
consumer_fetch, producer_upload, and default_group from the public
metrics endpoint (internal metrics are disabled at this scale), so
read-lane starvation behind the cloud_topics consumer hang is visible
in the artifacts. Runs in a daemon thread joined in a finally;
self-disables when the scheduler metrics are absent.

Also disable cloud-topic leveling and compaction for the cloud-topics
run. Leveling runs in the cloud_io default_group lane and does heavy
extent-merge I/O during the consume, contending with consumer_fetch
for the shared client pool; compaction is near-idle on these
delete-policy topics. Disabling both isolates whether freeing
default_group I/O lets the consumer drain; reconciliation stays on as
the L0->L1 data path.
@oleiman oleiman force-pushed the core-15812/ct-scheduler-test branch from 13306b5 to 1e027a9 Compare June 13, 2026 05:39
@oleiman

oleiman commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

When a client connection's input is shut down, the request's abort
source is aborted with std::errc::connection_aborted, which surfaces in
the per-partition read as a std::system_error. fetch_ntps's
handle_exception mapped every read failure -- including this one -- to
not_leader_for_partition to force a client retry. For a client that is
going away that is wrong: it spams the log (one WARN per partition) and,
on a half-closed connection whose response is still delivered, tells the
client the partition moved, so it refreshes metadata and reconnects. At
high cloud-topic partition counts (ManyPartitionsTest cloud_topics) this
compounds into a connection-churn / not_leader storm that adds load and
keeps the consumer from making progress.

Re-raise client-disconnect errors instead, using the same
net::is_reconnect_error predicate the fetch worker's handle_exceptions()
already uses to recognize a disconnect; it drops the error quietly
rather than producing per-partition not_leader results for a response
nobody will act on.
@oleiman

oleiman commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

1 similar comment
@oleiman

oleiman commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

oleiman added 2 commits June 14, 2026 14:37
Register a public metrics group (cloud_topics_l1_reader_cache) for the
per-shard L1 reader cache: cached_readers, in_use_readers, hits, misses,
readers_evicted. Make them public so they are scrapeable when internal
metrics are disabled (as at high partition counts).

A near-zero hit rate with steady evictions is the signature of the cache
thrashing -- the active reader set exceeding
cloud_topics_l1_reader_cache_max_size -- which forces every fetch to
rebuild the reader and re-query the L1 metastore.
At ManyPartitions cloud_topics scale (~17.8k partitions) a consumer is
behind on ~500 partitions per shard, but the L1 reader cache defaults to
128/shard, so the LRU evicts the reader the consumer is about to
re-fetch. Every fetch rebuilds the reader and re-RPCs the metastore,
saturating the kafka scheduling group and stalling the consumer while
the object-store pool sits idle.

Bump cloud_topics_l1_reader_cache_max_size to 1024 and the eviction
timeout to 600s so the cache holds the active set across the consume,
and sample the new l1_reader_cache hit/miss/eviction counters alongside
the cloud_io lane occupancy so the run shows the hit rate directly.

Leveling and compaction stay disabled (ruled out as the cause earlier;
kept off to isolate the reader cache).
@oleiman oleiman force-pushed the core-15812/ct-scheduler-test branch from 9dde7c3 to 0392fd2 Compare June 14, 2026 21:37
@oleiman

oleiman commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Add a one-shot CPU-profile capture to the many_partitions cloud_topics
consume: 60s in (once the kafka scheduling group is saturated), sample
one broker for 30s via the admin cpu_profile endpoint and log the top
stacks by occurrence. The wait_ms override enables the profiler for just
that window, so it stays off the rest of the run. Targeted to one broker
and one window to keep the cost down.

The backtraces are raw addresses; the logged top stacks are symbolized
offline against the build binary to find what saturates the kafka group.
@oleiman

oleiman commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

1 similar comment
@oleiman

oleiman commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

a0bbe35 made the L1 reader return end-of-stream + no data when
strict_max_bytes && max_bytes==0. In a fetch over many partitions the
plan hands the budget-exhausted tail partitions max_bytes==0 with
strict_max_bytes set (everything after the obligatory first read), so
those partitions now return nothing and never advance -- the consumer
stalls (ManyPartitionsTest cloud_topics times out, a regression since
late March). Before the check they got a trickle and drained.

Removing the check leaves strict_max_bytes unused in the L1 reader;
update the max_bytes_zero_behavior test accordingly (strict max_bytes=0
now returns one batch, like the non-strict case).

Reverting to verify the regression via CDT; a proper fix that keeps the
no-wasteful-read intent without starving the tail follows.
@oleiman oleiman force-pushed the core-15812/ct-scheduler-test branch from c548631 to a05044a Compare June 15, 2026 07:22
@oleiman

oleiman commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

The L1 reader passed the fetch's client connection abort source into its
object-store and metastore reads. A transient consumer disconnect then
aborted the in-flight read, dropping the reader's open stream and making
it non-reusable, so the l1_reader_cache evicted it. The reconnect
re-fetched the same offset, missed the cache, and paid a fresh metastore
RPC + object GET -- a thrash that never lets the cache warm under a
disconnect-heavy workload.

Bind object-store and metastore IO to a reader-owned abort source so a
client disconnect no longer tears down the warm reader; the reconnect
reuses it. Each IO stays bounded by its own retry/timeout policy.

Add a cache test: a reader survives a mid-read client disconnect.
@oleiman

oleiman commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Split why the l1_reader_cache fails to reuse readers, to tell over-read
(a reader is cached for the partition but at an offset the next fetch
doesn't request) from frontier-EOS (the reader is returned having read
to the end of available data and is disposed). Surface the breakdown via
the cloud_topics_l1_reader_cache metric group and the ManyPartitions
sampler.
@oleiman

oleiman commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

The miss-reason counter was named with a "misses" prefix, which the
public-metrics substring matcher confused with the existing misses
counter, so the ManyPartitions sampler threw "more than one metric
matched" every iteration and logged no breakdown. Rename it to
offset_mismatch so each query pattern matches exactly one metric.
@oleiman

oleiman commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

@oleiman

oleiman commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

The ManyPartitions stall (CORE-15812) shows the consumer stuck while the
fetch-path cloud_io pool sits idle -- data isn't reaching readable
objects. The reconciler and batch-cache probes that would show why are
internal-only, and internal metrics are disabled at scale. Expose
reconciliation progress (objects/batches/bytes_reconciled) and batch-cache
hit/miss as public metrics and sample them in the ManyPartitions test, so
a CDT run shows whether reconciliation is advancing and whether the batch
cache is going cold.
@oleiman

oleiman commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

oleiman added 2 commits June 17, 2026 08:24
has_pending_data() returned false when last_stable_offset() was
unavailable, dropping the source before reconciliation. But make_reader
syncs the stm (via sync_effective_start), so skipping kept the LSO
unavailable and the partition never reconciled -- a self-perpetuating
skip that stalls reconciliation and the consumers reading those
partitions. Treat an unavailable LSO as "maybe pending" and keep the
source so the next pass syncs the stm and re-evaluates.
Diagnostic for CORE-15812. pending_offset_lag reads 0 both when caught up
and when the LSO is unavailable, so it can't distinguish a reconciliation
stall from genuine idleness. Add per-shard public gauges: lso_hwm_gap
(sum of HWM-LSO across sources -- persistently positive means the LSO is
frozen below the HWM) and lso_unavailable_sources, fed by a per-iteration
sweep in the reconciliation loop, and sample them in ManyPartitions.
@oleiman

oleiman commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Localize the cloud-topics consumer stall by splitting the read- and
reconciler-side signals of an lro/metastore divergence:

- level_one_reader flags EOS caused by the metastore returning no
  object for the next offset (a reconciled offset with no registered
  extent), distinct from a normal end-of-data EOS.
- l1_reader_cache exposes that subset as public "no_object_eos".
- reconciler_probe exposes offset_corrections (metastore dropped
  misaligned extents) and partitions_reconciled.
- many_partitions_test samples the new metrics.
@oleiman

oleiman commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Pin where the lro/metastore divergence (CORE-15812) originates:

- level_one_reader/l1_reader_cache: split out gap_eos, the subset of
  no-object EOS where the reader delivered 0 bytes -- a reconciled
  offset (start_offset <= lro) with no registered extent, the direct
  gap signal. Sample it in many_partitions_test and log a rate-limited
  warn naming the partition/offset.
- lsm/stm: a skipped apply still advances the apply offset, so write
  reports success while the extent was dropped; log it at warn instead
  of debug so any silent drop is visible.
- lsm/replicated_db: on open, log the recovery boundaries
  (max_persisted_seqno, volatile_buffer seqno range, replayed vs
  skipped) to catch a manifest seqno advanced past durable data.
@oleiman

oleiman commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

/cdt
rp_version=build
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Each cloud-topic partition read issues a metastore extent-lookup RPC;
at the default concurrency of 1 a fetch serializes one ~100ms round-trip
per partition while the broker stays resource-idle. Raise it to 16 for
the cloud_topics MPT variant to test whether that per-partition
serialization is the throughput cap (CORE-15812).
@oleiman

oleiman commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant