Describe the bug
DataHub's dbt source crashes with an uncaught KeyError when the Trino type parser encounters an uppercase data_type string
from manifest.json. resolve_trino_modified_type() in metadata-ingestion/src/datahub/ingestion/source/sql/sql_types.py
regex-extracts the base type name and looks it up in TRINO_SQL_TYPES_MAP, whose keys are all lowercase (varchar, bigint,
timestamp, …). No normalisation is applied on the Trino path — even though the Snowflake path in the same resolve_sql_type()
function does call column_type.upper().
The crash only surfaces on the manifest-fallback path, i.e. when a node is missing from catalog.json (new model not yet
materialised, or a dropped table). Catalogued nodes avoid it because their types come from Trino already lowercase. A single
unresolved column type aborts the entire ingestion pipeline.
To Reproduce
Minimal (no dbt/Airflow needed):
pip install 'acryl-datahub[dbt]==1.4.0.9'
- Run:
from datahub.ingestion.source.sql.sql_types import resolve_sql_type
resolve_sql_type("TIMESTAMP(6) WITH TIME ZONE", platform="trino")
- Observe KeyError: 'TIMESTAMP'.
- Compare with the lowercase form, which returns TimeTypeClass() correctly:
resolve_sql_type("timestamp(6) with time zone", platform="trino")
End-to-end (Airflow + dbt on Trino):
- Add a new dbt model with at least one uppercase column type in schema.yml, e.g. data_type: TIMESTAMP(6) WITH TIME ZONE.
- Run dbt compile so the new node lands in manifest.json.
- Do not run dbt build on it — the table is absent from the warehouse and therefore from catalog.json (you can also reproduce by
dropping an existing table).
- Run the dbt DataHub ingestion against the resulting artifacts.
- Observe the warning Node missing from catalog ... => model.. followed immediately by:
File ".../datahub/ingestion/source/dbt/dbt_common.py", line 1350, in get_column_type
TypeClass = resolve_sql_type(column_type, dbt_adapter)
File ".../datahub/ingestion/source/sql/sql_types.py", line 621, in resolve_sql_type
TypeClass = resolve_trino_modified_type(column_type)
File ".../datahub/ingestion/source/sql/sql_types.py", line 250, in resolve_trino_modified_type
return TRINO_SQL_TYPES_MAP[modified_type_base]
KeyError: 'TIMESTAMP'
Expected behavior
resolve_trino_modified_type() should normalise case before the dict lookup (mirroring the Snowflake branch in resolve_sql_type()
that already calls .upper()). Trino itself is case-insensitive for type names, and users do write VARCHAR, BIGINT, TIMESTAMP(6)
WITH TIME ZONE, etc. in dbt contracts — DataHub should accept them.
Additionally, a single unresolved column type should not terminate the whole ingestion pipeline. Downgrading to a warning (and
proceeding with the column typed as NullTypeClass / unknown) would make the ingestion resilient to the long tail of Trino type
strings that the map does not explicitly cover.
Proposed patch:
def resolve_trino_modified_type(type_string: str) -> Any:
type_string = type_string.lower()
match = re.match(r"([a-zA-Z]+)(.+)", type_string)
if match:
return TRINO_SQL_TYPES_MAP[match.group(1)]
return TRINO_SQL_TYPES_MAP[type_string]
Screenshots
N/A — CLI / Airflow task failure; the full stack trace is in the "To Reproduce" section.
Desktop (please complete the following information):
- OS: Linux (Airflow worker on Linux-5.14.0-x86_64-with-glibc2.36); also reproducible on macOS with a local venv.
- Python: 3.12.12
- acryl-datahub: 1.4.0.9 (latest 1.4.x on PyPI at time of filing; code path unchanged on master)
- DataHub GMS: v1.5.0.1
- dbt-core: 1.11.8
- Adapter: dbt-trino against Trino
Additional context
The failure mode is easy to hit whenever a dbt project's manifest lists a model that hasn't been materialised yet in the
warehouse (and is therefore absent from catalog.json). DataHub's dbt source falls back to the raw data_type strings from the
manifest, and if any of those are uppercase, the whole ingestion pipeline dies on the first one.
A workaround is to lowercase every data_type value in schema.yml, but this is an unexpected foot-gun — contracts using the same
case convention as Trino DDL itself (uppercase, as in CREATE TABLE ... (col TIMESTAMP(6) WITH TIME ZONE)) shouldn't crash
ingestion.
TRINO_SQL_TYPES_MAP keys for reference:
boolean, tinyint, smallint, int, integer, bigint, real, double, decimal, varchar, char, varbinary, date, time, timestamp, row,
map, array, json.
Describe the bug
DataHub's dbt source crashes with an uncaught
KeyErrorwhen the Trino type parser encounters an uppercasedata_typestringfrom
manifest.json.resolve_trino_modified_type()inmetadata-ingestion/src/datahub/ingestion/source/sql/sql_types.pyregex-extracts the base type name and looks it up in
TRINO_SQL_TYPES_MAP, whose keys are all lowercase (varchar,bigint,timestamp, …). No normalisation is applied on the Trino path — even though the Snowflake path in the sameresolve_sql_type()function does call
column_type.upper().The crash only surfaces on the manifest-fallback path, i.e. when a node is missing from
catalog.json(new model not yetmaterialised, or a dropped table). Catalogued nodes avoid it because their types come from Trino already lowercase. A single
unresolved column type aborts the entire ingestion pipeline.
To Reproduce
Minimal (no dbt/Airflow needed):
pip install 'acryl-datahub[dbt]==1.4.0.9'resolve_sql_type("timestamp(6) with time zone", platform="trino")
End-to-end (Airflow + dbt on Trino):
dropping an existing table).
File ".../datahub/ingestion/source/dbt/dbt_common.py", line 1350, in get_column_type
TypeClass = resolve_sql_type(column_type, dbt_adapter)
File ".../datahub/ingestion/source/sql/sql_types.py", line 621, in resolve_sql_type
TypeClass = resolve_trino_modified_type(column_type)
File ".../datahub/ingestion/source/sql/sql_types.py", line 250, in resolve_trino_modified_type
return TRINO_SQL_TYPES_MAP[modified_type_base]
KeyError: 'TIMESTAMP'
Expected behavior
resolve_trino_modified_type() should normalise case before the dict lookup (mirroring the Snowflake branch in resolve_sql_type()
that already calls .upper()). Trino itself is case-insensitive for type names, and users do write VARCHAR, BIGINT, TIMESTAMP(6)
WITH TIME ZONE, etc. in dbt contracts — DataHub should accept them.
Additionally, a single unresolved column type should not terminate the whole ingestion pipeline. Downgrading to a warning (and
proceeding with the column typed as NullTypeClass / unknown) would make the ingestion resilient to the long tail of Trino type
strings that the map does not explicitly cover.
Proposed patch:
def resolve_trino_modified_type(type_string: str) -> Any:
type_string = type_string.lower()
match = re.match(r"([a-zA-Z]+)(.+)", type_string)
if match:
return TRINO_SQL_TYPES_MAP[match.group(1)]
return TRINO_SQL_TYPES_MAP[type_string]
Screenshots
N/A — CLI / Airflow task failure; the full stack trace is in the "To Reproduce" section.
Desktop (please complete the following information):
Additional context
The failure mode is easy to hit whenever a dbt project's manifest lists a model that hasn't been materialised yet in the
warehouse (and is therefore absent from catalog.json). DataHub's dbt source falls back to the raw data_type strings from the
manifest, and if any of those are uppercase, the whole ingestion pipeline dies on the first one.
A workaround is to lowercase every data_type value in schema.yml, but this is an unexpected foot-gun — contracts using the same
case convention as Trino DDL itself (uppercase, as in CREATE TABLE ... (col TIMESTAMP(6) WITH TIME ZONE)) shouldn't crash
ingestion.
TRINO_SQL_TYPES_MAP keys for reference:
boolean, tinyint, smallint, int, integer, bigint, real, double, decimal, varchar, char, varbinary, date, time, timestamp, row,
map, array, json.