Skip to content

bug: table.cache() alters table's schema #11973

@90degs2infty

Description

@90degs2infty

What happened?

Hi everyone!

I noticed some odd interference of calls to table.cache() with the schema reported by table.schema() depending on the order of calls. I.e. in the following code, the call to .cache() turns centroid's dtype from point:geometry to geospatial:geometry:

from pathlib import Path
from urllib.request import urlretrieve

import ibis

# Download upstream example geoparquet file
url = "https://github.com/opengeospatial/geoparquet/raw/refs/tags/v1.1.0+p1/examples/example.parquet"
parquet_path = Path("opengeospatial-example.parquet")

if not parquet_path.exists():
    urlretrieve(url, parquet_path)

# Ensure the spatial extension has been loaded (duckdb backend)
con = ibis.get_backend()
con.load_extension("spatial")

# Read data and compute centroids
data = ibis.read_parquet(parquet_path) # Key error when using duckdb==1.5.0, see below
data_with_centroids = data.mutate(centroid=ibis._.geometry.centroid())
data_with_centroids_cached = data_with_centroids.cache()

# Compare schemas
print(
    "data schema:",
    data.schema(),
    "\ndata with centroids schema:",
    data_with_centroids.schema(),
    "\ndata with centroids cached schema:",
    data_with_centroids_cached.schema(),
    "\nschemas equivalent:",
    data_with_centroids.schema().equals(data_with_centroids_cached.schema()),
    sep="\n",
)

Output:

data schema:
ibis.Schema {
  pop_est     float64
  continent   string
  name        string
  iso_a3      string
  gdp_md_est  int64
  geometry    geospatial:geometry
  bbox        struct<xmax: float64, xmin: float64, ymax: float64, ymin: float64>
}

data with centroids schema:
ibis.Schema {
  pop_est     float64
  continent   string
  name        string
  iso_a3      string
  gdp_md_est  int64
  geometry    geospatial:geometry
  bbox        struct<xmax: float64, xmin: float64, ymax: float64, ymin: float64>
  centroid    point:geometry
}

data with centroids cached schema:
ibis.Schema {
  pop_est     float64
  continent   string
  name        string
  iso_a3      string
  gdp_md_est  int64
  geometry    geospatial:geometry
  bbox        struct<xmax: float64, xmin: float64, ymax: float64, ymin: float64>
  centroid    geospatial:geometry
}

schemas equivalent:
False

Note how the two schemas are not considered equal. I would expect the reported schema to remain invariant under calls to table.cache().

What version of ibis are you using?

ibis-framework[duckdb,geospatial]==12.0.0

What backend(s) are you using, if any?

DuckDB, i.e. duckdb==1.4.4 (current LTS-version as of now; I'm not using the latest version of 1.5.0 because that one gives me a KeyError: 'OGC:CRS84' when reading the parquet-file)

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior inside of ibis

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions