Skip to content

Trino dbt source: resolve_trino_modified_type crashes with KeyError on uppercase data_type strings #17078

@kzajaczkowski

Description

@kzajaczkowski

Describe the bug

DataHub's dbt source crashes with an uncaught KeyError when the Trino type parser encounters an uppercase data_type string
from manifest.json. resolve_trino_modified_type() in metadata-ingestion/src/datahub/ingestion/source/sql/sql_types.py
regex-extracts the base type name and looks it up in TRINO_SQL_TYPES_MAP, whose keys are all lowercase (varchar, bigint,
timestamp, …). No normalisation is applied on the Trino path — even though the Snowflake path in the same resolve_sql_type()
function does call column_type.upper().

The crash only surfaces on the manifest-fallback path, i.e. when a node is missing from catalog.json (new model not yet
materialised, or a dropped table). Catalogued nodes avoid it because their types come from Trino already lowercase. A single
unresolved column type aborts the entire ingestion pipeline.

To Reproduce

Minimal (no dbt/Airflow needed):

  1. pip install 'acryl-datahub[dbt]==1.4.0.9'
  2. Run:
    from datahub.ingestion.source.sql.sql_types import resolve_sql_type
    resolve_sql_type("TIMESTAMP(6) WITH TIME ZONE", platform="trino")
  3. Observe KeyError: 'TIMESTAMP'.
  4. Compare with the lowercase form, which returns TimeTypeClass() correctly:
    resolve_sql_type("timestamp(6) with time zone", platform="trino")

End-to-end (Airflow + dbt on Trino):

  1. Add a new dbt model with at least one uppercase column type in schema.yml, e.g. data_type: TIMESTAMP(6) WITH TIME ZONE.
  2. Run dbt compile so the new node lands in manifest.json.
  3. Do not run dbt build on it — the table is absent from the warehouse and therefore from catalog.json (you can also reproduce by
    dropping an existing table).
  4. Run the dbt DataHub ingestion against the resulting artifacts.
  5. Observe the warning Node missing from catalog ... => model.. followed immediately by:
    File ".../datahub/ingestion/source/dbt/dbt_common.py", line 1350, in get_column_type
    TypeClass = resolve_sql_type(column_type, dbt_adapter)
    File ".../datahub/ingestion/source/sql/sql_types.py", line 621, in resolve_sql_type
    TypeClass = resolve_trino_modified_type(column_type)
    File ".../datahub/ingestion/source/sql/sql_types.py", line 250, in resolve_trino_modified_type
    return TRINO_SQL_TYPES_MAP[modified_type_base]
    KeyError: 'TIMESTAMP'

Expected behavior

resolve_trino_modified_type() should normalise case before the dict lookup (mirroring the Snowflake branch in resolve_sql_type()
that already calls .upper()). Trino itself is case-insensitive for type names, and users do write VARCHAR, BIGINT, TIMESTAMP(6)
WITH TIME ZONE, etc. in dbt contracts — DataHub should accept them.

Additionally, a single unresolved column type should not terminate the whole ingestion pipeline. Downgrading to a warning (and
proceeding with the column typed as NullTypeClass / unknown) would make the ingestion resilient to the long tail of Trino type
strings that the map does not explicitly cover.

Proposed patch:

def resolve_trino_modified_type(type_string: str) -> Any:
type_string = type_string.lower()
match = re.match(r"([a-zA-Z]+)(.+)", type_string)
if match:
return TRINO_SQL_TYPES_MAP[match.group(1)]
return TRINO_SQL_TYPES_MAP[type_string]

Screenshots

N/A — CLI / Airflow task failure; the full stack trace is in the "To Reproduce" section.

Desktop (please complete the following information):

  • OS: Linux (Airflow worker on Linux-5.14.0-x86_64-with-glibc2.36); also reproducible on macOS with a local venv.
  • Python: 3.12.12
  • acryl-datahub: 1.4.0.9 (latest 1.4.x on PyPI at time of filing; code path unchanged on master)
  • DataHub GMS: v1.5.0.1
  • dbt-core: 1.11.8
  • Adapter: dbt-trino against Trino

Additional context

The failure mode is easy to hit whenever a dbt project's manifest lists a model that hasn't been materialised yet in the
warehouse (and is therefore absent from catalog.json). DataHub's dbt source falls back to the raw data_type strings from the
manifest, and if any of those are uppercase, the whole ingestion pipeline dies on the first one.

A workaround is to lowercase every data_type value in schema.yml, but this is an unexpected foot-gun — contracts using the same
case convention as Trino DDL itself (uppercase, as in CREATE TABLE ... (col TIMESTAMP(6) WITH TIME ZONE)) shouldn't crash
ingestion.

TRINO_SQL_TYPES_MAP keys for reference:
boolean, tinyint, smallint, int, integer, bigint, real, double, decimal, varchar, char, varbinary, date, time, timestamp, row,
map, array, json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBug report

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions