Core 15812/ct scheduler test#30801
Conversation
Dedup concurrent work keyed by a path. run(key, as, work) runs the work once on the first caller (the leader) and merges later callers onto its outcome. Bounded per shard; at capacity callers run uncoordinated. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Concurrent reads missing the cloud cache on the same extent each issue their own S3 GET. Route read_object's cold-miss download through single_flight so one GET serves all waiters on a shard. A gate drains in-flight reads before file_io destruction. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Per-shard counters to gauge dedup: reads, cache_misses, and concurrent_read_merges. Plumbed through app and read-replica refreshers. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
|
/cdt |
Add a per-lane cloud_io scheduler sampler to the many_partitions cloud-topics consume: every 20s it logs in_flight/waiters for consumer_fetch, producer_upload, and default_group from the public metrics endpoint (internal metrics are disabled at this scale), so read-lane starvation behind the cloud_topics consumer hang is visible in the artifacts. Runs in a daemon thread joined in a finally; self-disables when the scheduler metrics are absent. Also disable cloud-topic leveling and compaction for the cloud-topics run. Leveling runs in the cloud_io default_group lane and does heavy extent-merge I/O during the consume, contending with consumer_fetch for the shared client pool; compaction is near-idle on these delete-policy topics. Disabling both isolates whether freeing default_group I/O lets the consumer drain; reconciliation stays on as the L0->L1 data path.
13306b5 to
1e027a9
Compare
|
/cdt |
When a client connection's input is shut down, the request's abort source is aborted with std::errc::connection_aborted, which surfaces in the per-partition read as a std::system_error. fetch_ntps's handle_exception mapped every read failure -- including this one -- to not_leader_for_partition to force a client retry. For a client that is going away that is wrong: it spams the log (one WARN per partition) and, on a half-closed connection whose response is still delivered, tells the client the partition moved, so it refreshes metadata and reconnects. At high cloud-topic partition counts (ManyPartitionsTest cloud_topics) this compounds into a connection-churn / not_leader storm that adds load and keeps the consumer from making progress. Re-raise client-disconnect errors instead, using the same net::is_reconnect_error predicate the fetch worker's handle_exceptions() already uses to recognize a disconnect; it drops the error quietly rather than producing per-partition not_leader results for a response nobody will act on.
|
/cdt |
1 similar comment
|
/cdt |
Register a public metrics group (cloud_topics_l1_reader_cache) for the per-shard L1 reader cache: cached_readers, in_use_readers, hits, misses, readers_evicted. Make them public so they are scrapeable when internal metrics are disabled (as at high partition counts). A near-zero hit rate with steady evictions is the signature of the cache thrashing -- the active reader set exceeding cloud_topics_l1_reader_cache_max_size -- which forces every fetch to rebuild the reader and re-query the L1 metastore.
At ManyPartitions cloud_topics scale (~17.8k partitions) a consumer is behind on ~500 partitions per shard, but the L1 reader cache defaults to 128/shard, so the LRU evicts the reader the consumer is about to re-fetch. Every fetch rebuilds the reader and re-RPCs the metastore, saturating the kafka scheduling group and stalling the consumer while the object-store pool sits idle. Bump cloud_topics_l1_reader_cache_max_size to 1024 and the eviction timeout to 600s so the cache holds the active set across the consume, and sample the new l1_reader_cache hit/miss/eviction counters alongside the cloud_io lane occupancy so the run shows the hit rate directly. Leveling and compaction stay disabled (ruled out as the cause earlier; kept off to isolate the reader cache).
9dde7c3 to
0392fd2
Compare
|
/cdt |
Add a one-shot CPU-profile capture to the many_partitions cloud_topics consume: 60s in (once the kafka scheduling group is saturated), sample one broker for 30s via the admin cpu_profile endpoint and log the top stacks by occurrence. The wait_ms override enables the profiler for just that window, so it stays off the rest of the run. Targeted to one broker and one window to keep the cost down. The backtraces are raw addresses; the logged top stacks are symbolized offline against the build binary to find what saturates the kafka group.
|
/cdt |
1 similar comment
|
/cdt |
a0bbe35 made the L1 reader return end-of-stream + no data when strict_max_bytes && max_bytes==0. In a fetch over many partitions the plan hands the budget-exhausted tail partitions max_bytes==0 with strict_max_bytes set (everything after the obligatory first read), so those partitions now return nothing and never advance -- the consumer stalls (ManyPartitionsTest cloud_topics times out, a regression since late March). Before the check they got a trickle and drained. Removing the check leaves strict_max_bytes unused in the L1 reader; update the max_bytes_zero_behavior test accordingly (strict max_bytes=0 now returns one batch, like the non-strict case). Reverting to verify the regression via CDT; a proper fix that keeps the no-wasteful-read intent without starving the tail follows.
c548631 to
a05044a
Compare
|
/cdt |
The L1 reader passed the fetch's client connection abort source into its object-store and metastore reads. A transient consumer disconnect then aborted the in-flight read, dropping the reader's open stream and making it non-reusable, so the l1_reader_cache evicted it. The reconnect re-fetched the same offset, missed the cache, and paid a fresh metastore RPC + object GET -- a thrash that never lets the cache warm under a disconnect-heavy workload. Bind object-store and metastore IO to a reader-owned abort source so a client disconnect no longer tears down the warm reader; the reconnect reuses it. Each IO stays bounded by its own retry/timeout policy. Add a cache test: a reader survives a mid-read client disconnect.
|
/cdt |
Split why the l1_reader_cache fails to reuse readers, to tell over-read (a reader is cached for the partition but at an offset the next fetch doesn't request) from frontier-EOS (the reader is returned having read to the end of available data and is disposed). Surface the breakdown via the cloud_topics_l1_reader_cache metric group and the ManyPartitions sampler.
|
/cdt |
The miss-reason counter was named with a "misses" prefix, which the public-metrics substring matcher confused with the existing misses counter, so the ManyPartitions sampler threw "more than one metric matched" every iteration and logged no breakdown. Rename it to offset_mismatch so each query pattern matches exactly one metric.
|
/cdt |
|
/cdt |
The ManyPartitions stall (CORE-15812) shows the consumer stuck while the fetch-path cloud_io pool sits idle -- data isn't reaching readable objects. The reconciler and batch-cache probes that would show why are internal-only, and internal metrics are disabled at scale. Expose reconciliation progress (objects/batches/bytes_reconciled) and batch-cache hit/miss as public metrics and sample them in the ManyPartitions test, so a CDT run shows whether reconciliation is advancing and whether the batch cache is going cold.
|
/cdt |
has_pending_data() returned false when last_stable_offset() was unavailable, dropping the source before reconciliation. But make_reader syncs the stm (via sync_effective_start), so skipping kept the LSO unavailable and the partition never reconciled -- a self-perpetuating skip that stalls reconciliation and the consumers reading those partitions. Treat an unavailable LSO as "maybe pending" and keep the source so the next pass syncs the stm and re-evaluates.
Diagnostic for CORE-15812. pending_offset_lag reads 0 both when caught up and when the LSO is unavailable, so it can't distinguish a reconciliation stall from genuine idleness. Add per-shard public gauges: lso_hwm_gap (sum of HWM-LSO across sources -- persistently positive means the LSO is frozen below the HWM) and lso_unavailable_sources, fed by a per-iteration sweep in the reconciliation loop, and sample them in ManyPartitions.
|
/cdt |
Localize the cloud-topics consumer stall by splitting the read- and reconciler-side signals of an lro/metastore divergence: - level_one_reader flags EOS caused by the metastore returning no object for the next offset (a reconciled offset with no registered extent), distinct from a normal end-of-data EOS. - l1_reader_cache exposes that subset as public "no_object_eos". - reconciler_probe exposes offset_corrections (metastore dropped misaligned extents) and partitions_reconciled. - many_partitions_test samples the new metrics.
|
/cdt |
Pin where the lro/metastore divergence (CORE-15812) originates: - level_one_reader/l1_reader_cache: split out gap_eos, the subset of no-object EOS where the reader delivered 0 bytes -- a reconciled offset (start_offset <= lro) with no registered extent, the direct gap signal. Sample it in many_partitions_test and log a rate-limited warn naming the partition/offset. - lsm/stm: a skipped apply still advances the apply offset, so write reports success while the extent was dropped; log it at warn instead of debug so any silent drop is visible. - lsm/replicated_db: on open, log the recovery boundaries (max_persisted_seqno, volatile_buffer seqno range, replayed vs skipped) to catch a manifest seqno advanced past durable data.
|
/cdt |
Each cloud-topic partition read issues a metastore extent-lookup RPC; at the default concurrency of 1 a fetch serializes one ~100ms round-trip per partition while the broker stays resource-idle. Raise it to 16 for the cloud_topics MPT variant to test whether that per-partition serialization is the throughput cap (CORE-15812).
|
/cdt |
.
Backports Required
Release Notes