Skip to content

Align xarray dimension handling in to_xarray with earthkit-data #110

Description

@andreas-grafberger

Is your feature request related to a problem? Please describe.

earthkit-data tries to establish conventions for the handling of xarray dataset dimensions that are followed by many data sources (or rather, readers). Those can be found here: https://earthkit-data.readthedocs.io/en/latest/guide/xarray/dim.html#xr-dim

Ideally, the earthkit-data polytope source (which wraps covjsonkit) would also align with these conventions.

This would mean that we could e.g. load reanalysis data into a dataframe with a single valid_time dimension, or load forecast data to have forecast_reference_time and step dimensions.

UPDATE (June 23rd, 2026): Additional notes

Covjsonkit().decode(covjson).to_xarray() currently returns
Union[xr.Dataset, list[xr.Dataset]], producing a list whenever the collection has more than
one unique domain: one dataset per spatial point, and per hdate for reanalysis retrievals. The
earthkit-data polytope source delegates directly to this (readers/covjson/reader.py), so
from_source("polytope", ...).to_xarray() inherits the list. This generalises
ecmwf/covjsonkit#109..

The current decoder reconstructs the datacube heuristically: it deduplicates coverages by
(x, y, z, tuple(t)) and locates values by matching every coverage against hard-coded
metadata keys:

# covjsonkit/decoder/TimeSeries.py
def _covers_domain(self, coverage, num, date, x, y, z):
    return (
        coverage["mars:metadata"]["number"] == num
        and coverage["mars:metadata"]["Forecast date"] == date
        and coverage["domain"]["axes"][self.x_name]["values"] == x
        ...

Each feature request defines exactly one logical datacube, so a single Dataset should always
be possible. For a multi-point reanalysis timeseries, something like:

<xarray.Dataset>
Dimensions:    (points: 2, valid_time: 1096)
Coordinates:
  * valid_time (valid_time) datetime64[ns] ...
  * points     (points) int64 0 1
    latitude   (points) float64 35.0 38.0
    longitude  (points) float64 -20.0 -22.0
    id         (points) <U9 'station_1' 'station_2'
Data variables:
    dis06      (points, valid_time) float64 ...
Attributes: (constant mars:metadata: class, stream, expver, ...)

Ideally, decoding would map domain axes to dimensions directly (composite -> points / labels / x/y (TBD),
t -> valid_time), promote any mars:metadata key that varies across coverages (number,
forecast_reference_time) to a dimension/coordinate, and keep constant keys as Dataset attrs.
Coverages with differing t-axes could be outer-joined with NaN fill, though ideally the encoder
avoids producing these in the first place.

Forecasts with multiple forecast reference times

The layout above only works when valid times are unique (reanalysis, or a single forecast).
For retrievals spanning multiple forecast reference times, valid times from consecutive
forecasts overlap, so valid_time cannot be an index dimension, only an auxiliary coordinate.
The dimensions are then forecast_reference_time and step, matching earthkit-data's default
xarray layout (time_dims=["forecast_reference_time", "step"], with add_valid_time_coord
injecting valid_time as a 2D auxiliary coordinate over the active time dimensions):

<xarray.Dataset>
Dimensions:                  (forecast_reference_time: 30, step: 8, points: 2)
Coordinates:
  * forecast_reference_time  (forecast_reference_time) datetime64[ns] ...
  * step                     (step) timedelta64[ns] 06:00 12:00 ... 48:00
  * points                   (points) int64 0 1
    latitude                 (points) float64 35.0 38.0
    longitude                (points) float64 -20.0 -22.0
    id                       (points) <U9 'station_1' 'station_2'
    valid_time               (forecast_reference_time, step) datetime64[ns] ...
Data variables:
    dis06                    (forecast_reference_time, step, points) float64 ...

On the CovJSON side, one coverage per forecast reference time remains the correct encoding
here: the spec requires axis values with a natural ordering to be monotonic, so overlapping
valid times cannot be packed into a single t axis. The decoder would instead lift the
per-coverage forecast_reference_time metadata into a dimension, derive
step = t - forecast_reference_time, and compute valid_time as the 2D coordinate.

Ideally, to_xarray would expose the same control over the time dimensions as other
earthkit-data sources, i.e. a time_dims kwarg with the same role names:

  • time_dims=["forecast_reference_time", "step"] (default for forecast data)
  • time_dims=["valid_time"] (default for reanalysis; for forecast data only valid when a
    single reference time is present, otherwise an error)
    with the kwarg forwarded from the earthkit-data covjson reader (which currently accepts
    **kwargs but drops them). from_source("polytope", ...).to_xarray(time_dims=...) would then
    behave consistently with the GRIB-based sources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions