fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution by fengrui-z · Pull Request #972 · datajuicer/data-juicer

fengrui-z · 2026-04-29T08:27:33Z

Summary

Fixes #936

JSONStreamDatasource._read_stream locks the schema from the first batch and reuses it for all subsequent batches. When an
early batch infers a nested field as null (e.g. meta.url = null) and a later batch introduces a concrete type (e.g.
string), the forced cast from string to null fails with ArrowInvalid.

This is a correctness bug in DJ's custom JSON streaming ingestion path. Ray's native ray.data.read_json handles the same input correctly.

Root Cause

# Before: first batch locks schema, all subsequent batches forced to it
table = pyarrow.Table.from_batches([batch], schema=schema)
if schema is None:
    schema = table.schema  # locked forever

Fix

Remove the first-batch schema lock — create table without forced schema
Use pyarrow.unify_schemas to merge schemas across batches, allowing null → concrete type promotion
After unification, cast the batch to the unified schema for consistency

  # After: schema evolves across batches
  table = pyarrow.Table.from_batches([batch])
  if schema is None:
      schema = table.schema
  else:
      unified = pyarrow.unify_schemas([schema, table.schema])
      if not unified.equals(schema):
          schema = unified
      table = pyarrow.Table.from_batches([batch], schema=schema)

unify_schemas internally delegates to Arrow C++ UnifyTypes, which promotes null to the concrete type and recursively handles nested structs.

Test Plan

Verify the minimal repro from [Bug] JSONStreamDatasource locks first-batch schema and fails on later null -> concrete type evolution #936 passes with this fix
Verify ray.data.read_json and read_json_stream produce consistent results on mixed-null JSONL
Verify no regression on JSONL files with uniform schema
Verify no regression on JSONL files with nested structs

See #936 for the minimal reproduction script.

…ll → concrete type evolution The previous implementation locked the schema from the first batch and reused it for all subsequent batches via `Table.from_batches([batch], schema=schema)`. When an early batch inferred a nested field as `null` (e.g. `meta.url = null`) and a later batch introduced a concrete type (e.g. `string`), the cast from `string` to `null` would fail with ArrowInvalid. This fix removes the first-batch schema lock and instead uses `pyarrow.unify_schemas` to merge schemas across batches, allowing `null` types to be promoted to concrete types as new data is read. Fixes datajuicer#936 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the _read_stream method in ray_dataset.py to support schema unification when reading batches from a stream. This allows the system to handle batches with varying but compatible schemas. A review comment suggests refactoring the implementation to reduce code duplication by consolidating the pyarrow.Table creation after the final schema has been determined.

…atajuicer#936)

fengrui-z requested review from Dludora and yxdyc April 29, 2026 08:27

fengrui-z requested a deployment to Testing April 29, 2026 08:27 — with GitHub Actions Waiting

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread data_juicer/core/data/ray_dataset.py Outdated

test: add regression test for JSONStreamDatasource schema evolution (d…

523f594

…atajuicer#936)

fengrui-z requested a deployment to Testing April 29, 2026 08:34 — with GitHub Actions Waiting

style: apply black formatting

4258403

fengrui-z requested a deployment to Testing April 29, 2026 08:43 — with GitHub Actions Waiting

refactor: simplify schema unification logic per Gemini review

9e248e2

fengrui-z requested a deployment to Testing April 29, 2026 09:10 — with GitHub Actions Waiting

style: merge f-strings for black compliance

132f727

fengrui-z requested a deployment to Testing April 29, 2026 09:21 — with GitHub Actions Waiting

fengrui-z marked this pull request as ready for review April 29, 2026 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution#972

fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution#972
fengrui-z wants to merge 5 commits intodatajuicer:mainfrom
fengrui-z:fix/json-stream-schema-lock

fengrui-z commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fengrui-z commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fengrui-z commented Apr 29, 2026 •

edited

Loading