Skip to content

[gfx1150] RT-DETR fp16 model: wrong GPU output (label flips) on develop; non-MLIR path aborts with INVALID_ISA #4996

Description

@ycastill2-amd

File at: https://github.com/ROCm/AMDMIGraphX/issues/new

Summary

On gfx1150 an fp16 RT-DETR-style object-detection model produces incorrect results with the GPU target: migraphx-driver verify --gpu fails on every output, including the integer class-label output flipping to wrong classes. The only working code path on gfx1150 is MLIR — disabling MLIR aborts at runtime with HSA_STATUS_ERROR_INVALID_ISA, so the divergence cannot be A/B-isolated against the non-MLIR path on this device.

Environment

MIGraphX develop @ f54ca35 (2.16.0)
GPU gfx1150 (AMD Ryzen AI 9 HX PRO 370 / Radeon 890M, Strix Point iGPU)
ROCm 6.4.4-129
OS Ubuntu 24.04.3 LTS, kernel 6.14.0-29-generic
Model RT-DETR-style object detector, fp16, 512x512 input (3 outputs: scores {1,150}, boxes {1,150,4}, labels {1,150})

Repro

migraphx-driver verify object-detector-fast-512p-fp16.onnx --onnx --gpu

(Prerequisite: develop currently aborts earlier in simplify_reshapes for this model — see the note at the bottom.)

Actual result (MLIR enabled, default)

RMS Error: 0.336279 | Max diff: 74 | Mismatch at 2: 11 != 2  # labels
RMS Error: 0.312404 | Max diff: 0.480432 | Mismatch at 0: -0.229004 != -0.225708  # float

[gfx1150] RT-DETR fp16 model: wrong GPU output (label flips) on develop; non-MLIR path aborts with INVALID_ISA

File at: https://github.com/ROCm/AMDMIGraphX/issues/new

Summary

On gfx1150 an fp16 RT-DETR-style object-detection model produces incorrect results with the GPU target: migraphx-driver verify --gpu fails on every output, including the integer class-label output flipping to wrong classes. The only working code path on gfx1150 is MLIR — disabling MLIR aborts at runtime with HSA_STATUS_ERROR_INVALID_ISA, so the divergence cannot be A/B-isolated against the non-MLIR path on this device.

Environment

MIGraphX develop @ f54ca35 (2.16.0)
GPU gfx1150 (AMD Ryzen AI 9 HX PRO 370 / Radeon 890M, Strix Point iGPU)
ROCm 6.4.4-129
OS Ubuntu 24.04.3 LTS, kernel 6.14.0-29-generic
Model RT-DETR-style object detector, fp16, 512x512 input (3 outputs: scores {1,150}, boxes {1,150,4}, labels {1,150})

Repro

migraphx-driver verify object-detector-fast-512p-fp16.onnx --onnx --gpu

(Prerequisite: develop currently aborts earlier in simplify_reshapes for this model — see the note at the bottom. The results below are with that compile-time abort worked around so the model reaches execution.)

Actual result (MLIR enabled, default)

[ERROR] verify_args.cpp:56 FAILED: object-detector-fast-512p-fp16.onnx
[ERROR] verify_args.cpp:57 RMS Error: 0.336279
[ERROR] verify_args.cpp:68 Max diff: 74
[ERROR] verify_args.cpp:73 Mismatch at 2: 11 != 2          # labels output: GPU=11, ref=2

[ERROR] verify_args.cpp:56 FAILED: object-detector-fast-512p-fp16.onnx
[ERROR] verify_args.cpp:57 RMS Error: 0.312404
[ERROR] verify_args.cpp:68 Max diff: 0.480432
[ERROR] verify_args.cpp:73 Mismatch at 0: -0.229004 != -0.225708   # float output

The float outputs are mostly close (max abs diff ~0.48) but enough to flip low-confidence detection classes in the integer label output.

What was ruled out

  1. adjust_allocation "output buffer doesn't match" warnings are benign. During compile these fire for several gpu::precompile_op (pooling, MLIR convolutions, channelwise_conv). Instrumenting src/adjust_allocation.cpp shows every one is alias = slice with identical lens, identical meaningful strides, and identical byte size — they differ only in the stride of a size-1 outer dimension, e.g.

    alias_shape = half_type, {1,16,256,256}, {2097152,65536,256,1}
    ins_shape   = half_type, {1,16,256,256}, {1048576,65536,256,1}
    alias_bytes = 2097152   ins_bytes = 2097152
    

    These are legitimate "write into a concat slice" aliases; the pass correctly leaves them alone. They are not the cause of the wrong output. (Minor: the shape-equality check / warning could ignore size-1-dim strides to avoid the false positive.)

  2. Non-MLIR path is unusable on gfx1150. With MIGRAPHX_DISABLE_MLIR=1 the program aborts at runtime:

    :0:rocdevice.cpp :2992: Callback: Queue aborting with error : HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid. code: 0x100f
    

    So on gfx1150 the divergence cannot be compared against the non-MLIR fallback.

Likely area

The remaining divergence is consistent with an fp16 codegen/accuracy issue in the MLIR-generated convolutions on gfx1150 (rather than a buffer/aliasing or shape bug in MIGraphX core). Pointers for triage: confirm whether the gfx1150 non-MLIR kernels are expected to be built (the INVALID_ISA suggests they are not), and check rocMLIR fp16 conv accuracy on gfx1150 vs gfx1100/gfx1101.

Note: separate simplify_reshapes compile-time abort

This model also hits a separate simplify_reshapes abort on develop (find_reshape_dot building an invalid reshape: "Reshape: Wrong number of elements ... 524288 ... 32768"). Fix proposed in branch ycastill2-amd:fix-reshape-dot-element-count. That fix is required just to reach execution for this report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions