Is your feature request related to a problem? Please describe.
earthkit-data tries to establish conventions for the handling of xarray dataset dimensions that are followed by many data sources (or rather, readers). Those can be found here: https://earthkit-data.readthedocs.io/en/latest/guide/xarray/dim.html#xr-dim
Ideally, the earthkit-data polytope source (which wraps covjsonkit) would also align with these conventions.
This would mean that we could e.g. load reanalysis data into a dataframe with a single valid_time dimension, or load forecast data to have forecast_reference_time and step dimensions.
UPDATE (June 23rd, 2026): Additional notes
Covjsonkit().decode(covjson).to_xarray() currently returns
Union[xr.Dataset, list[xr.Dataset]], producing a list whenever the collection has more than
one unique domain: one dataset per spatial point, and per hdate for reanalysis retrievals. The
earthkit-data polytope source delegates directly to this (readers/covjson/reader.py), so
from_source("polytope", ...).to_xarray() inherits the list. This generalises
ecmwf/covjsonkit#109..
The current decoder reconstructs the datacube heuristically: it deduplicates coverages by
(x, y, z, tuple(t)) and locates values by matching every coverage against hard-coded
metadata keys:
# covjsonkit/decoder/TimeSeries.py
def _covers_domain(self, coverage, num, date, x, y, z):
return (
coverage["mars:metadata"]["number"] == num
and coverage["mars:metadata"]["Forecast date"] == date
and coverage["domain"]["axes"][self.x_name]["values"] == x
...
Each feature request defines exactly one logical datacube, so a single Dataset should always
be possible. For a multi-point reanalysis timeseries, something like:
<xarray.Dataset>
Dimensions: (points: 2, valid_time: 1096)
Coordinates:
* valid_time (valid_time) datetime64[ns] ...
* points (points) int64 0 1
latitude (points) float64 35.0 38.0
longitude (points) float64 -20.0 -22.0
id (points) <U9 'station_1' 'station_2'
Data variables:
dis06 (points, valid_time) float64 ...
Attributes: (constant mars:metadata: class, stream, expver, ...)
Ideally, decoding would map domain axes to dimensions directly (composite -> points / labels / x/y (TBD),
t -> valid_time), promote any mars:metadata key that varies across coverages (number,
forecast_reference_time) to a dimension/coordinate, and keep constant keys as Dataset attrs.
Coverages with differing t-axes could be outer-joined with NaN fill, though ideally the encoder
avoids producing these in the first place.
Forecasts with multiple forecast reference times
The layout above only works when valid times are unique (reanalysis, or a single forecast).
For retrievals spanning multiple forecast reference times, valid times from consecutive
forecasts overlap, so valid_time cannot be an index dimension, only an auxiliary coordinate.
The dimensions are then forecast_reference_time and step, matching earthkit-data's default
xarray layout (time_dims=["forecast_reference_time", "step"], with add_valid_time_coord
injecting valid_time as a 2D auxiliary coordinate over the active time dimensions):
<xarray.Dataset>
Dimensions: (forecast_reference_time: 30, step: 8, points: 2)
Coordinates:
* forecast_reference_time (forecast_reference_time) datetime64[ns] ...
* step (step) timedelta64[ns] 06:00 12:00 ... 48:00
* points (points) int64 0 1
latitude (points) float64 35.0 38.0
longitude (points) float64 -20.0 -22.0
id (points) <U9 'station_1' 'station_2'
valid_time (forecast_reference_time, step) datetime64[ns] ...
Data variables:
dis06 (forecast_reference_time, step, points) float64 ...
On the CovJSON side, one coverage per forecast reference time remains the correct encoding
here: the spec requires axis values with a natural ordering to be monotonic, so overlapping
valid times cannot be packed into a single t axis. The decoder would instead lift the
per-coverage forecast_reference_time metadata into a dimension, derive
step = t - forecast_reference_time, and compute valid_time as the 2D coordinate.
Ideally, to_xarray would expose the same control over the time dimensions as other
earthkit-data sources, i.e. a time_dims kwarg with the same role names:
time_dims=["forecast_reference_time", "step"] (default for forecast data)
time_dims=["valid_time"] (default for reanalysis; for forecast data only valid when a
single reference time is present, otherwise an error)
with the kwarg forwarded from the earthkit-data covjson reader (which currently accepts
**kwargs but drops them). from_source("polytope", ...).to_xarray(time_dims=...) would then
behave consistently with the GRIB-based sources.
Is your feature request related to a problem? Please describe.
earthkit-data tries to establish conventions for the handling of xarray dataset dimensions that are followed by many data sources (or rather, readers). Those can be found here: https://earthkit-data.readthedocs.io/en/latest/guide/xarray/dim.html#xr-dim
Ideally, the earthkit-data polytope source (which wraps
covjsonkit) would also align with these conventions.This would mean that we could e.g. load reanalysis data into a dataframe with a single
valid_timedimension, or load forecast data to haveforecast_reference_timeandstepdimensions.UPDATE (June 23rd, 2026): Additional notes
Covjsonkit().decode(covjson).to_xarray()currently returnsUnion[xr.Dataset, list[xr.Dataset]], producing a list whenever the collection has more thanone unique domain: one dataset per spatial point, and per hdate for reanalysis retrievals. The
earthkit-data polytope source delegates directly to this (
readers/covjson/reader.py), sofrom_source("polytope", ...).to_xarray()inherits the list. This generalisesecmwf/covjsonkit#109..
The current decoder reconstructs the datacube heuristically: it deduplicates coverages by
(x, y, z, tuple(t))and locates values by matching every coverage against hard-codedmetadata keys:
Each feature request defines exactly one logical datacube, so a single Dataset should always
be possible. For a multi-point reanalysis timeseries, something like:
Ideally, decoding would map domain axes to dimensions directly (
composite->points/labels/x/y(TBD),t->valid_time), promote anymars:metadatakey that varies across coverages (number,forecast_reference_time) to a dimension/coordinate, and keep constant keys as Dataset attrs.Coverages with differing t-axes could be outer-joined with NaN fill, though ideally the encoder
avoids producing these in the first place.
Forecasts with multiple forecast reference times
The layout above only works when valid times are unique (reanalysis, or a single forecast).
For retrievals spanning multiple forecast reference times, valid times from consecutive
forecasts overlap, so
valid_timecannot be an index dimension, only an auxiliary coordinate.The dimensions are then
forecast_reference_timeandstep, matching earthkit-data's defaultxarray layout (
time_dims=["forecast_reference_time", "step"], withadd_valid_time_coordinjecting
valid_timeas a 2D auxiliary coordinate over the active time dimensions):On the CovJSON side, one coverage per forecast reference time remains the correct encoding
here: the spec requires axis values with a natural ordering to be monotonic, so overlapping
valid times cannot be packed into a single
taxis. The decoder would instead lift theper-coverage
forecast_reference_timemetadata into a dimension, derivestep = t - forecast_reference_time, and computevalid_timeas the 2D coordinate.Ideally,
to_xarraywould expose the same control over the time dimensions as otherearthkit-data sources, i.e. a
time_dimskwarg with the same role names:time_dims=["forecast_reference_time", "step"](default for forecast data)time_dims=["valid_time"](default for reanalysis; for forecast data only valid when asingle reference time is present, otherwise an error)
with the kwarg forwarded from the earthkit-data covjson reader (which currently accepts
**kwargsbut drops them).from_source("polytope", ...).to_xarray(time_dims=...)would thenbehave consistently with the GRIB-based sources.