Trino dbt source: resolve_trino_modified_type crashes with KeyError on uppercase data_type strings

**Describe the bug**                                                                                                             
   
  DataHub's dbt source crashes with an uncaught `KeyError` when the Trino type parser encounters an uppercase `data_type` string   
  from `manifest.json`. `resolve_trino_modified_type()` in `metadata-ingestion/src/datahub/ingestion/source/sql/sql_types.py`
  regex-extracts the base type name and looks it up in `TRINO_SQL_TYPES_MAP`, whose keys are all lowercase (`varchar`, `bigint`,   
  `timestamp`, …). No normalisation is applied on the Trino path — even though the Snowflake path in the same `resolve_sql_type()`
  function does call `column_type.upper()`.

  The crash only surfaces on the manifest-fallback path, i.e. when a node is missing from `catalog.json` (new model not yet        
  materialised, or a dropped table). Catalogued nodes avoid it because their types come from Trino already lowercase. A single
  unresolved column type aborts the entire ingestion pipeline.                                                                     
                                                            
  **To Reproduce**

  Minimal (no dbt/Airflow needed):

  1. `pip install 'acryl-datahub[dbt]==1.4.0.9'`                                                                                   
  2. Run:
     ```python                                                                                                                     
     from datahub.ingestion.source.sql.sql_types import resolve_sql_type
     resolve_sql_type("TIMESTAMP(6) WITH TIME ZONE", platform="trino")
  3. Observe KeyError: 'TIMESTAMP'.                                                                                                
  4. Compare with the lowercase form, which returns TimeTypeClass() correctly:
  resolve_sql_type("timestamp(6) with time zone", platform="trino")                                                                
                                                                                                                                   
  End-to-end (Airflow + dbt on Trino):                                                                                             
                                                                                                                                   
  1. Add a new dbt model with at least one uppercase column type in schema.yml, e.g. data_type: TIMESTAMP(6) WITH TIME ZONE.       
  2. Run dbt compile so the new node lands in manifest.json.
  3. Do not run dbt build on it — the table is absent from the warehouse and therefore from catalog.json (you can also reproduce by
   dropping an existing table).                                                                                                    
  4. Run the dbt DataHub ingestion against the resulting artifacts.
  5. Observe the warning Node missing from catalog ... => model.<project>.<model> followed immediately by:                         
  File ".../datahub/ingestion/source/dbt/dbt_common.py", line 1350, in get_column_type
      TypeClass = resolve_sql_type(column_type, dbt_adapter)                                                                       
  File ".../datahub/ingestion/source/sql/sql_types.py", line 621, in resolve_sql_type
      TypeClass = resolve_trino_modified_type(column_type)                                                                         
  File ".../datahub/ingestion/source/sql/sql_types.py", line 250, in resolve_trino_modified_type
      return TRINO_SQL_TYPES_MAP[modified_type_base]                                                                               
  KeyError: 'TIMESTAMP'                                     
                                                                                                                                   
  Expected behavior                                         
                                                                                                                                   
  resolve_trino_modified_type() should normalise case before the dict lookup (mirroring the Snowflake branch in resolve_sql_type() 
  that already calls .upper()). Trino itself is case-insensitive for type names, and users do write VARCHAR, BIGINT, TIMESTAMP(6) 
  WITH TIME ZONE, etc. in dbt contracts — DataHub should accept them.                                                              
                                                            
  Additionally, a single unresolved column type should not terminate the whole ingestion pipeline. Downgrading to a warning (and   
  proceeding with the column typed as NullTypeClass / unknown) would make the ingestion resilient to the long tail of Trino type
  strings that the map does not explicitly cover.                                                                                  
                                                            
  Proposed patch:

  def resolve_trino_modified_type(type_string: str) -> Any:
      type_string = type_string.lower()
      match = re.match(r"([a-zA-Z]+)\(.+\)", type_string)                                                                          
      if match:
          return TRINO_SQL_TYPES_MAP[match.group(1)]                                                                               
      return TRINO_SQL_TYPES_MAP[type_string]                                                                                      
   
  Screenshots                                                                                                                      
                                                            
  N/A — CLI / Airflow task failure; the full stack trace is in the "To Reproduce" section.                                         
   
  Desktop (please complete the following information):                                                                             
                                                            
  - OS: Linux (Airflow worker on Linux-5.14.0-x86_64-with-glibc2.36); also reproducible on macOS with a local venv.                
  - Python: 3.12.12
  - acryl-datahub: 1.4.0.9 (latest 1.4.x on PyPI at time of filing; code path unchanged on master)                                 
  - DataHub GMS: v1.5.0.1                                                                                                          
  - dbt-core: 1.11.8
  - Adapter: dbt-trino against Trino                                                                                               
                                                            
  Additional context

  The failure mode is easy to hit whenever a dbt project's manifest lists a model that hasn't been materialised yet in the         
  warehouse (and is therefore absent from catalog.json). DataHub's dbt source falls back to the raw data_type strings from the
  manifest, and if any of those are uppercase, the whole ingestion pipeline dies on the first one.                                 
                                                            
  A workaround is to lowercase every data_type value in schema.yml, but this is an unexpected foot-gun — contracts using the same  
  case convention as Trino DDL itself (uppercase, as in CREATE TABLE ... (col TIMESTAMP(6) WITH TIME ZONE)) shouldn't crash
  ingestion.                                                                                                                       
                                                            
  TRINO_SQL_TYPES_MAP keys for reference:                                                                                          
  boolean, tinyint, smallint, int, integer, bigint, real, double, decimal, varchar, char, varbinary, date, time, timestamp, row, 
  map, array, json.                                                                                                                
  ```        

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trino dbt source: resolve_trino_modified_type crashes with KeyError on uppercase data_type strings #17078

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trino dbt source: resolve_trino_modified_type crashes with KeyError on uppercase data_type strings #17078

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions