Skip to content

[v25.3.x] iceberg: Push Parquest column stats to Iceberg manifests#30776

Open
vbotbuildovich wants to merge 2 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30704-v25.3.x-1781219157
Open

[v25.3.x] iceberg: Push Parquest column stats to Iceberg manifests#30776
vbotbuildovich wants to merge 2 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30704-v25.3.x-1781219157

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30704

  • Command: git cherry-pick -x 7f83353 b4e8232
  • Commits backported: 2
  • Conflicts resolved: 1
  • Commits skipped (already on target): 0
  • Backport branch: ai-backport-pr-30704-v25.3.x-1781219157

Conflict details

  • 7f83353 (src/v/datalake/base_types.h): include block conflicted only on context; accepted the incoming includes (base/format_to.h, bytes/bytes.h, container/chunked_vector.h, serde/envelope.h, serde/rw/bytes.h) needed by the new per_column_stats struct.
  • 7f83353 (src/v/datalake/BUILD): accepted the incoming base_types library deps (//src/v/base, //src/v/bytes, //src/v/container:chunked_vector, //src/v/serde, //src/v/serde:bytes).
  • 7f83353 (src/v/datalake/coordinator/BUILD): kept the target's //src/v/base dep and added the incoming //src/v/bytes dep for iceberg_file_committer.
  • 7f83353 (src/v/serde/parquet/writer.cc): kept the two new file-level accumulators (file_value_count, file_column_size_bytes); the surrounding bloom-filter block in the incoming context does not exist on v25.3.x and was omitted (bloom filters are not present on the target branch).
  • 7f83353 (src/v/serde/parquet/column_writer.cc): the target branch lacks the stats-truncation feature, so the commit's build_statistics helper was adapted to the target's non-truncated form (encode_for_stats(...), is_exact=true). Replaced per-page record_value() calls with the commit's _flushed_stats.merge(_current_page_stats), added the new _file_stats collector and file_column_stats() accumulation, and omitted the _bloom_filter member (bloom filters absent on target).

The second commit (b4e8232, ducktape test) applied cleanly.

Thread per-column stats (min/max bounds, null counts, value counts,
column sizes) through to the Iceberg data_file manifest entry, where
query engines use them for column-level predicate pushdown.

Maintain a file-level column_stats_collector in buffered_column_writer
that accumulates by merging after each flush_pages(), then use the result
after the file is done to get file-level stats.

(cherry picked from commit 7f83353)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Jun 11, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants