fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution#972
Open
fengrui-z wants to merge 5 commits intodatajuicer:mainfrom
Open
Conversation
…ll → concrete type evolution The previous implementation locked the schema from the first batch and reused it for all subsequent batches via `Table.from_batches([batch], schema=schema)`. When an early batch inferred a nested field as `null` (e.g. `meta.url = null`) and a later batch introduced a concrete type (e.g. `string`), the cast from `string` to `null` would fail with ArrowInvalid. This fix removes the first-batch schema lock and instead uses `pyarrow.unify_schemas` to merge schemas across batches, allowing `null` types to be promoted to concrete types as new data is read. Fixes datajuicer#936 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the _read_stream method in ray_dataset.py to support schema unification when reading batches from a stream. This allows the system to handle batches with varying but compatible schemas. A review comment suggests refactoring the implementation to reduce code duplication by consolidating the pyarrow.Table creation after the final schema has been determined.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #936
JSONStreamDatasource._read_streamlocks the schema from the first batch and reuses it for all subsequent batches. When anearly batch infers a nested field as
null(e.g.meta.url = null) and a later batch introduces a concrete type (e.g.string), the forced cast fromstringtonullfails withArrowInvalid.This is a correctness bug in DJ's custom JSON streaming ingestion path. Ray's native
ray.data.read_jsonhandles the same input correctly.Root Cause
Fix
unify_schemas internally delegates to Arrow C++ UnifyTypes, which promotes null to the concrete type and recursively handles nested structs.
Test Plan
null -> concrete typeevolution #936 passes with this fixSee #936 for the minimal reproduction script.