Skip to content

feat: format-agnostic serde hook for PhysicalExpr (Column + NotExpr)#21966

Draft
adriangb wants to merge 5 commits intoapache:mainfrom
pydantic:serde-on-expr
Draft

feat: format-agnostic serde hook for PhysicalExpr (Column + NotExpr)#21966
adriangb wants to merge 5 commits intoapache:mainfrom
pydantic:serde-on-expr

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Related to #21835 (no Closes — this PR only does the PhysicalExpr half and only registers two built-ins; full closure is multi-PR).

Rationale for this change

DataFusion has no in-tree serialization on the PhysicalExpr trait. The proto crate handles serialization with a downcast chain inside datafusion-proto/src/physical_plan/{from_proto,to_proto}.rs, which forces every expression author to expose private state via pub accessors so the proto crate can poke at it (#21835).

PR #21929's draft branch added a PhysicalExpr::to_proto hook to fix that — but the hook is protobuf-specific. If we ever want JSON, bincode, or anything else, we'd need a parallel to_json, to_bincode, etc., one trait method per format.

This PR introduces a single serde-style hook on PhysicalExpr that works with any serde format (JSON for debugging, bincode for binary, a serde adapter on top of protobuf, etc.). The shape is intentionally minimal so it can be reviewed in isolation:

  • PhysicalExpr::serde_tag(&self) -> &'static str — default "" means "not serializable".
  • PhysicalExpr::erased_serialize(&self) -> Box<dyn erased_serde::Serialize + '_> — default returns an error sentinel.
  • PhysicalExprDeserialize trait + PhysicalExprRegistry builder + DeserializeContext over &mut dyn erased_serde::Deserializer<'de>.

The deserialization side runs through any serde::Deserializer (Registry::deserialize) with deserialize_json as a convenience entry point. Trait-object children recurse through registry.expr_seed() inside hand-rolled visitors.

Wire stability across DataFusion versions is not a goal here — the datafusion-proto crate stays the path for that. The serde hook is for in-process tooling, debugging, and intra-cluster ephemeral serialization. A future PR can write a serde adapter on top of prost so proto becomes another consumer of erased_serialize, at which point the proto crate's downcast chain can collapse.

What changes are included in this PR?

  • Trait additions on PhysicalExpr (physical-expr-common): serde_tag and erased_serialize, both with default impls so this is non-breaking for existing in-tree and out-of-tree implementers.
  • impl Serialize for dyn PhysicalExpr producing a {tag, data} envelope. Arc<dyn PhysicalExpr> is automatically Serialize via serde's rc feature.
  • PhysicalExprDeserialize trait + PhysicalExprRegistry (physical-expr-common). Builder-style (empty().with::<T>().with::<U>()), modeled on the optimizer's rule list. No globals, no inventory crate (which has wasm/FFI gotchas DataFusion cares about).
    • with::<T>() panics on duplicate tag; try_with::<T>() returns Result for the runtime-plugin case.
  • Two built-ins wired up: Column (leaf) and NotExpr (single trait-object child, exercises the DeserializeContext::deserialize_unary helper).
  • Runnable example at datafusion-examples/examples/physical_expr_serde — defines a custom PhysicalExpr with a trait-object child, registers it on top of default_registry(), and round-trips through JSON.
  • Library-user-guide page walking through the trait surface, child handling, tag uniqueness, format choice, and how this differs from datafusion-proto.

The remaining built-ins follow the same shape and will land in follow-up PRs in batches grouped by what they need from the rest of the codebase (Operator first, then ScalarValue/arrow::DataType for Literal/Cast*/Like/InList/Case).

Are these changes tested?

Yes:

  • 6 unit tests in datafusion-physical-expr exercise: Column round-trip, NotExpr round-trip, NOT(NOT(a)) nested round-trip, unknown-tag error path, try_with duplicate-tag error, with duplicate-tag panic.
  • The example is runnable end-to-end (cargo run --example physical_expr_serde) and prints the JSON, the round-tripped expression, and the expected error when the registry doesn't know about the custom tag.
  • cargo check --workspace and cargo clippy --all-targets --all-features -- -D warnings (on touched crates) clean.

Are there any user-facing changes?

Yes, but additive:

  • Two new optional methods on the PhysicalExpr trait, both with defaults. Existing implementers (in-tree and out-of-tree) compile unchanged; they just won't be serializable through the new path until they opt in.
  • New public types: PhysicalExprRegistry, PhysicalExprDeserialize, DeserializeContext, ExprSeed, plus default_registry() in datafusion-physical-expr.
  • New library-user-guide page documenting the surface.

🤖 Generated with Claude Code

adriangb and others added 5 commits April 30, 2026 16:26
Introduces a format-agnostic serialization hook on PhysicalExpr:

- `serde_tag(&self) -> &'static str` (default empty = "not serializable")
- `erased_serialize(&self) -> Box<dyn erased_serde::Serialize + '_>`
  (default returns an error sentinel)

Plus an `impl Serialize for dyn PhysicalExpr` that wraps the two into a
`{tag, data}` envelope. `Arc<dyn PhysicalExpr>` becomes serializable
automatically via serde's `rc` feature.

This is the producer half of an in-tree, format-agnostic serialization
story for PhysicalExpr (JSON for debugging, proto via a serde adapter
in a follow-up). No deserialization yet — split for review.

The trait method returns a Box<dyn erased_serde::Serialize> rather than
taking a &mut dyn erased_serde::Serializer because erased_serde 0.4
sealed its Serialize trait, so the obvious bridge (custom impl) is no
longer possible. Returning an erased Serialize value sidesteps that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rRegistry

Adds the deserialization half of the format-agnostic PhysicalExpr serde
story:

- `PhysicalExprDeserialize` trait — `const TAG: &'static str` plus a
  `deserialize(ctx, data: serde_json::Value)` constructor. Implementers
  recurse on trait-object children via `ctx.registry().deserialize_value`.
- `DeserializeContext` — currently just bundles the registry; carries
  whatever child-decoding helpers grow over time.
- `PhysicalExprRegistry` — builder-style (`empty().with::<T>()...`) map
  from tag to constructor, modeled on the optimizer's rule list. No
  globals, no inventory crate — explicit per-deployment construction.
  Duplicate-tag registration panics; the tag-collision behavior is
  finalized in a later commit.
- `Registry::deserialize_json(&str)` — convenience entry point, parses
  to `serde_json::Value` and dispatches.

The trait method takes `serde_json::Value` for the body, not a
streaming serde Deserializer. erased_serde 0.4 sealed both Serialize
and Deserializer, so the streaming bridge isn't reachable from outside
the crate; using a buffered Value keeps things simple. A future PR can
swap to a streaming/format-agnostic shape — the surface (Registry,
trait, ctx) is designed to absorb that change.

No built-in expressions are registered yet — that's C3+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… serde

Replaces the JSON-specific deserialize trait method with a format-agnostic
one. The previous shape pinned `serde_json::Value` into the trait
signature, which would have forced parallel methods for every other
format. This commit reworks the surface so the trait is independent of
the wire format:

- `PhysicalExprDeserialize::deserialize(ctx: &mut DeserializeContext<'_, '_>)`
  — `ctx` exposes a `&mut dyn erased_serde::Deserializer<'de>` plus the
  registry. Leaf types call `ctx.deserialize::<Self>()`; expressions with
  trait-object children write a custom Visitor and use
  `ctx.registry().expr_seed()` for `next_value_seed`.
- `PhysicalExprRegistry::deserialize` for any `serde::Deserializer`,
  with `deserialize_json` retained as a convenience.
- The dispatch (read tag, look up in registry, drive constructor) lives
  in a `DeserializeSeed` chain so it composes with serde streaming.

Adds the first round-trippable expression — `Column` — to validate the
shape end-to-end, plus a `default_registry()` in `physical-expr` and
unit tests for round-trip and unknown-tag errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the second built-in expression to the new format-agnostic serde
hook (Column was the first, in the previous commit):

- `NotExpr` opts in via `serde_tag` + `erased_serialize` and a
  `PhysicalExprDeserialize` impl that uses a small new helper,
  `DeserializeContext::deserialize_unary("arg")`. The helper covers
  the unary-wrapper shape (a single `Arc<dyn PhysicalExpr>` field)
  that's common across the built-ins, so this is the boilerplate
  factored out for reuse later.
- `default_registry()` now ships with `Column` and `NotExpr`.
- Tag-collision finalization (open question from the plan): `with::<T>`
  panics on duplicate registration; new `try_with::<T>` returns a
  `Result` for the runtime-plugin case. Tests cover both paths.

Round-trip tests cover Column, NotExpr, and a nested NOT(NOT(a)) tree.
The remaining built-ins follow the same shape and will land in
follow-ups, in batches grouped by what they need from the rest of the
codebase (Operator, ScalarValue, arrow::DataType, etc., all need their
own serde impls before the expressions that hold them can be
registered).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New example `datafusion-examples/examples/physical_expr_serde` —
  defines a custom `PhysicalExpr` (with a trait-object child),
  registers it on top of the default registry, and round-trips it
  through JSON. Covers the case that needs a hand-rolled Visitor.
- Library-user-guide page `physical-expr-serde.md` walks through the
  trait surface, the `{tag, data}` envelope, child handling via
  `expr_seed`, tag uniqueness, format choice (and how this differs
  from `datafusion-proto`'s wire-stable path), and what's currently
  registered. Linked from `index.rst`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates labels Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant