Skip to content

Split proto serialization to encapsulate private state (#21835)#21929

Draft
adriangb wants to merge 8 commits intoapache:mainfrom
pydantic:split-proto-serialization
Draft

Split proto serialization to encapsulate private state (#21835)#21929
adriangb wants to merge 8 commits intoapache:mainfrom
pydantic:split-proto-serialization

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

datafusion-proto serializes every built-in PhysicalExpr through a single ~300-line downcast_ref chain in serialize_physical_expr_with_converter, with a symmetric match on ExprType in from_proto.rs. Because the serializer lives outside the crate where each expression is defined, every piece of internal state an expression wants to round-trip has to be pub. The motivating example was #21807, which had to expose pub struct Inner, pub fn inner(), pub fn from_parts(), pub fn original_children(), pub fn remapped_children() on DynamicFilterPhysicalExpr — all "warning: not stable; proto-only".

This PR puts the infrastructure in place so future stateful expressions don't hit that wall:

  1. The prost-generated types move to a new lightweight datafusion-proto-models crate.
  2. PhysicalExpr gains an opt-in to_proto method (feature-gated).
  3. Per-expression try_from_proto constructors land via a parallel PhysicalExprDecodeCtx.
  4. Column is migrated end-to-end as a working demo of the new pattern.

The remaining built-in expressions stay on the existing downcast/match path; they migrate one-by-one in follow-ups (each migration is independent and keeps behavior unchanged).

What changes are included in this PR?

Five stacked commits, each independently mergeable / splittable later:

  1. Extract datafusion-proto-models crate. Pure refactor — moves .proto, gen, src/generated/* into a new lightweight crate (mirror of datafusion-proto-common's split). datafusion-proto re-exports the proto types via pub mod protobuf, so external consumers (datafusion-ffi, datafusion-examples, benchmarks) need no changes.

    Because the prost types are now foreign to datafusion-proto, the orphan rule forced two pieces of trait surgery:

    • Inherent impl protobuf::PhysicalPlanNode { ... } → new PhysicalPlanNodeExt trait. Callers that used protobuf::PhysicalPlanNode::try_from_physical_plan_with_converter(...) need a use datafusion_proto::physical_plan::PhysicalPlanNodeExt; (small API break, all in-tree call sites updated).
    • impl From<&protobuf::X> for Y (and TryFrom) for foreign-foreign pairs → new FromProto/TryFromProto traits in datafusion_proto::convert. Callers go from (&p).into() to Y::from_proto(&p). This is a workaround, not the desired end state — see Follow-up architectural cleanup below.
  2. Add PhysicalExpr::to_proto hook (feature-gated). Adds an opt-in proto feature on datafusion-physical-expr-common:

    #[cfg(feature = \"proto\")]
    fn to_proto(
        &self,
        ctx: &PhysicalExprEncodeCtx<'_>,
    ) -> Result<Option<PhysicalExprNode>> { Ok(None) }

    Ok(None) (the default) means "fall through" — datafusion-proto's downcast chain handles the expression. Ok(Some(node)) lets the expression serialize itself, so types with private state (DynamicFilterPhysicalExpr's RwLock-wrapped inner, etc.) won't need pub-for-proto accessors when they migrate.

  3. Migrate Column to PhysicalExpr::to_proto. First expression to use the hook. Removes its arm from the downcast chain.

  4. Wrap encoder in PhysicalExprEncodeCtx struct. The bare PhysicalExprEncoder trait moves behind a concrete PhysicalExprEncodeCtx struct so expression authors don't see &dyn in their signatures. Mirrors how PhysicalPlanDecodeContext is shaped today. Gives us a stable place to add helpers (UDF/UDAF/UDWF encoding, future registry hooks) without expanding a public trait surface.

  5. Add PhysicalExprDecodeCtx and migrate Column decode. Decode-side counterpart with a single-method API:

    pub struct PhysicalExprDecodeCtx<'a> { ... }
    impl PhysicalExprDecodeCtx<'_> {
        pub fn schema(&self) -> &Schema;
        pub fn decode(&self, node: &PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>>;
    }
    
    impl Column {
        pub fn try_from_proto(
            node: &PhysicalColumn,
            ctx: &PhysicalExprDecodeCtx<'_>,
        ) -> Result<Arc<Self>>;
    }

    Whatever "decode" needs to do (central match for built-ins, codec/registry for extensions) is hidden behind one decode method on the ctx — same shape regardless of expression origin.

Are these changes tested?

The existing roundtrip_physical_plan and roundtrip_physical_expr tests cover Column round-tripping in both directions, and they exercise the new to_proto / try_from_proto path. All 159 proto tests pass. (6 pre-existing failures require the parquet-testing git submodule, unrelated to this change.)

No new tests were added because no new behavior was introduced — the migrated Column produces and consumes the same wire format as before.

Are there any user-facing changes?

Yes, small API breaks in datafusion-proto:

  • protobuf::PhysicalPlanNode::try_from_physical_plan_with_converter and try_into_physical_plan_with_converter are now methods on the PhysicalPlanNodeExt trait. External callers need use datafusion_proto::physical_plan::PhysicalPlanNodeExt; to continue calling them with the same syntax. (Two examples in this repo updated.)
  • impl From<...> for protobuf::X / impl From<&protobuf::X> for ... between proto types and types from datafusion-common / datafusion-expr / datafusion-datasource are replaced by FromProto/TryFromProto traits in datafusion_proto::convert. Callers replace Y::from(...) / (&x).into() with Y::from_proto(...). This is a workaround for an orphan-rule + dep-cycle constraint — see Follow-up architectural cleanup for the long-term plan that returns these to standard From/TryFrom syntax.

The new proto feature on datafusion-physical-expr-common and datafusion-physical-expr is off by default; crates that don't serialize plans pay nothing. datafusion-proto flips it on.

Notes for reviewers

  • Each commit is independently mergeable and could be split into its own PR — happy to do that if reviewers prefer. Kept stacked here for higher-level review of the design as a whole.
  • Only Column is migrated to the new pattern. Other built-ins (Literal, BinaryExpr, etc.) and DynamicFilterPhysicalExpr are intentionally left on the existing downcast/match path; their migration is purely additive in follow-up PRs and doesn't change wire format.
  • The decode side uses a single-method decode(node) API on the ctx, matching the encode side's to_proto(&ctx) shape — both built-in and third-party expressions will look the same to recurse into children. Third-parties still need PhysicalExtensionCodec for the registry / dispatch role; an inventory-style register::<T>() builder is feasible later without further public-API churn (the ctx hides the dispatch trait, so the implementation behind it can swap freely).

Follow-up architectural cleanup

The FromProto / TryFromProto traits introduced in this PR are a workaround for an orphan-rule + dep-cycle constraint, not the intended end state. The cleaner architecture — collapse proto-common into proto-models, push impls down to target-type crates so callers use standard From / TryFrom — is a separate refactor that touches 8+ crates (including retiring datafusion-proto-common) and doesn't fit cleanly into this PR's review surface.

Full dep-graph analysis, the constraint that blocks doing it inline, the target architecture, and a step-by-step sequencing for the follow-up PR are written up at #21835 (comment). That comment is meant as a hand-off doc — anyone picking it up should have everything they need (cycle analysis, ~hour-by-hour work breakdown, gotchas around pbjson regeneration and parallel types) without rediscovering it.

The to_proto hook delivered in this PR (the actual goal of #21835) is independent of that cleanup and works regardless.

🤖 Generated with Claude Code

adriangb and others added 5 commits April 29, 2026 09:53
Mirror the existing datafusion-proto-common split for the physical/logical
plan schemas. The new crate contains only the .proto file and the prost-
generated Rust types, with no datafusion deps beyond datafusion-proto-common.

datafusion-proto re-exports the proto types via its existing protobuf module,
so downstream consumers (datafusion-ffi, datafusion-examples, benchmarks)
continue to work without Cargo.toml changes.

Because the prost-generated types are now foreign to datafusion-proto, the
orphan rule forbids:
  - inherent impl blocks on protobuf::PhysicalPlanNode
  - From/TryFrom impls between proto types and types from datafusion-common,
    datafusion-expr, datafusion-datasource (foreign + foreign)

To work around this:
  - Inherent impls on protobuf::PhysicalPlanNode are converted to a new
    PhysicalPlanNodeExt trait (callers must `use` it to call the methods).
  - Foreign-foreign From/TryFrom impls are converted to FromProto/TryFromProto
    traits defined in a new datafusion_proto::convert module.

This is groundwork for a follow-up that adds PhysicalExpr::to_proto so
expressions can serialize private state without exposing pub-for-proto
scaffolding (issue apache#21835).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in `proto` feature on `datafusion-physical-expr-common` that
exposes `PhysicalExpr::to_proto`:

```rust
#[cfg(feature = "proto")]
fn to_proto(
    &self,
    ctx: &dyn PhysicalExprEncoder,
) -> Result<Option<PhysicalExprNode>> { Ok(None) }
```

`Ok(None)` (the default) means "fall through" — `datafusion-proto`'s
existing downcast chain handles the expression. Returning `Ok(Some(node))`
lets the expression serialize itself, so types with private state
(`DynamicFilterPhysicalExpr`'s `RwLock`-wrapped inner, etc.) no longer
need pub-for-proto accessors.

`PhysicalExprEncoder` is a thin trait in `physical-expr-common` that
exposes `encode_child` for recursing through the proto converter
(preserving dedup). `datafusion-proto` provides the concrete encoder via
`TraitEncoder` in `to_proto.rs`, and `serialize_physical_expr_with_converter`
calls `expr.to_proto(...)` first, falling back to the existing chain on
`Ok(None)`. No expression has migrated yet, so behavior is unchanged.

The feature is off by default on `physical-expr-common`; `datafusion-proto`
flips it on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First expression to use the new `to_proto` hook, demonstrating the migration
pattern. The downcast arm in `serialize_physical_expr_with_converter` is
removed; the trait dispatch path now produces the same `PhysicalExprNode`.

`datafusion-physical-expr` gains a passthrough `proto` feature that turns on
`datafusion-physical-expr-common/proto` and pulls in `datafusion-proto-models`
for the proto types. `datafusion-proto` enables it.

Other built-in expressions (`UnKnownColumn`, `Literal`, `BinaryExpr`,
`DynamicFilterPhysicalExpr`, etc.) still go through the downcast chain and
will migrate one at a time in follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bare `PhysicalExprEncoder` trait now lives behind a concrete
`PhysicalExprEncodeCtx` struct. Expression authors no longer see `&dyn`
types in their `to_proto` signature; they receive a stable struct that
internally dispatches to a sealed `PhysicalExprEncode` implementor.

Mirrors how `PhysicalPlanDecodeContext` is shaped today (struct holding
a `&dyn PhysicalExtensionCodec`), and gives us a stable place to add
helpers (UDF/UDAF/UDWF encoding, future registry hooks) without
expanding a public trait surface.

Renames the encoder dispatch trait to `PhysicalExprEncode` and the
in-`datafusion-proto` impl to `ConverterEncoder` for clarity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the decode-side counterpart to `PhysicalExprEncodeCtx`. The public
surface is a single `decode(node)` method on the context — the central
match (and, in future, third-party registry lookups) live behind it.

Per-expression contract:

```rust
impl Column {
    pub fn try_from_proto(
        node: &PhysicalColumn,
        ctx: &PhysicalExprDecodeCtx,
    ) -> Result<Arc<Self>>;
}
```

`Column` is the first expression to migrate; the central match in
`parse_physical_expr_with_converter` now dispatches its arm via
`Column::try_from_proto`. Other variants stay inline; they migrate in
follow-ups, same shape, same trick (one expression per commit, central
match keeps shrinking).

Removes the temporary `FromProto<&PhysicalColumn> for Column` impl
introduced in the proto-models extraction — `Column` now owns both
directions of its own serialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate labels Apr 29, 2026
- licenserc.toml: exclude `datafusion/proto-models/src/generated/` from
  hawkeye license-header check (generated files; matches the existing
  exclusion for `datafusion/proto/src/generated/`).
- rat_exclude_files.txt: same exclusion for the release Apache RAT scan.
- .github/workflows/rust.yml: drop the `echo '' > .../generated/datafusion.rs`
  workaround in the cargo fmt job — the file (and its directory) was moved
  to `datafusion/proto-models/`.
- Trim now-unused optional deps in `datafusion-proto` and
  `datafusion-proto-models`. `datafusion-proto`'s `json` feature only needs
  `serde_json` directly (the pbjson-generated code lives in proto-models),
  and proto-models doesn't use `serde_json` itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the development-process Related to development process of DataFusion label Apr 29, 2026
Module-level doc comments referenced items inside the modules without
qualifying them, so cargo doc with `-D warnings` (CI default) failed
with `unresolved link` warnings. Qualify the links with the
`proto_encode::` / `proto_decode::` module prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the split-proto-serialization branch from 9302fa9 to 5360d7d Compare April 29, 2026 16:33
The proto-models extraction moved the prost-generated types from
`datafusion_proto::generated::*` into `datafusion_proto_models::generated::*`.
cargo-semver-checks flagged ~200 paths as removed, even though my
`pub use datafusion_proto_models::protobuf::*` wildcard re-export should
have covered them — turns out wildcard-of-wildcard re-exports aren't
followed by every static-analysis tool.

Two fixes:

1. `pub mod protobuf` now re-exports directly from
   `datafusion_proto_models::generated::datafusion::*` (the leaf module
   where the types are originally declared) instead of going through
   `datafusion_proto_models::protobuf::*` (itself a `pub use ::*`). One
   level of indirection avoided.

2. New `pub mod generated` re-exports the legacy `generated::datafusion::*`
   and `generated::datafusion_common` paths, with `#[deprecated]` pointing
   downstream callers at `datafusion_proto::protobuf` /
   `datafusion_proto_common::protobuf_common`.

Together: every `datafusion_proto::generated::*` and
`datafusion_proto::protobuf::*` path that compiled on `main` continues
to compile, with deprecation warnings on the legacy `generated::*`
paths.

The other API breaks (`PhysicalPlanNodeExt` trait import for inherent-
looking methods, `from_proto`/`try_from_proto` over `From`/`TryFrom` for
foreign-foreign pairs) can't be made backwards-compatible without
re-introducing the orphan rule problem; they're flagged in the PR
description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@@ -604,7 +601,6 @@ jobs:
rust-version: stable
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sus

@@ -0,0 +1,212 @@

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split this into 2 PRs or stack the commits so that the diff is not polluted by 800 lines of boilerplate


impl From<&CsvOptionsProto> for CsvOptions {
fn from(proto: &CsvOptionsProto) -> Self {
impl FromProto<&CsvOptionsProto> for CsvOptions {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to double check this change. It doesn't seem obviously linked to the other changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion physical-expr Changes to the physical-expr crates proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Split proto serialization to encapsulate private state

1 participant