Skip to content

bfloat16 GEMM is noticeably slower than float16 #2033

@BBC-Esq

Description

@BBC-Esq

On an RTX 4090 (compute capability 8.9) running bert-base-uncased for masked-LM scoring with the prebuilt ctranslate2==4.7.1 wheel, bf16 inference is consistently slower than fp16 on the same inputs:

  • fp16, batch 64: ~34 ms per 30-sentence call
  • bf16, batch 64: ~42 ms per 30-sentence call (~24% slower)

The same workload run through PyTorch's BertForMaskedLM directly shows fp16 and bf16 within rounding error of each other (~32 ms each), so the gap looks specific to how CT2 invokes the bf16 path rather than anything inherent to bf16 on Ada.

Where it seems to come from

Reading src/cuda/primitives.cu, there's a noticeable asymmetry between the two GEMM specializations:

  • fp16 GEMM (around line 522) defaults to CUBLAS_COMPUTE_16F accumulation, with an opt-out via cuda::use_true_fp16_gemm() / the CT2_CUDA_TRUE_FP16_GEMM env var.
  • bf16 GEMM (around line 565) hardcodes CUBLAS_COMPUTE_32F, with no equivalent flag for CUBLAS_COMPUTE_16BF.

So bf16 never gets the half-precision Tensor Core accumulator path that fp16 has by default.

On top of that, both paths call cublasGemmEx with CUBLAS_GEMM_DEFAULT rather than going through cuBLASLt (cublasLtMatmul + cublasLtMatmulAlgoGetHeuristic). PyTorch's default on the bf16 path is bgemm_internal_cublaslt in aten/src/ATen/cuda/CUDABlas.cpp. Worth noting that PyTorch also uses CUBLAS_COMPUTE_32F for both fp16 and bf16 by default, so the parity they get isn't coming from a different accumulator — it appears to come from cuBLASLt's algorithm selection plus reduction-scheme tuning (CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK), which the legacy CUBLAS_GEMM_DEFAULT picker doesn't apply as aggressively for bf16.

What might help

A couple of options I can see, ranked by effort:

Smaller fix — add a CT2_CUDA_TRUE_BF16_GEMM env flag

  • Mirrors the existing fp16 toggle
  • Switches the bf16 path to CUBLAS_COMPUTE_16BF accumulation
  • A few-line change in primitives.cu plus the toggle wiring
  • Default off, since bf16 accumulators can be numerically risky for some transformer shapes
  • Roughly doubles bf16 Tensor Core throughput on Ampere+ when enabled

Larger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt

  • Pulls bf16 perf in line with what PyTorch users see, even at the default CUBLAS_COMPUTE_32F
  • Requires the usual cuBLASLt plumbing: matmul / layout / preference descriptors, algorithm-heuristic caching, an additional dlopen target alongside cublas_stub.cc
  • Workspace allocation could route through the existing Allocator interface that src/ops/conv1d_cudnn_gpu.cu already uses for cuDNN, so that part wouldn't be net-new infrastructure
  • Bigger refactor than the env flag, but the change with the broadest benefit since it would also help fp16 and fp32 algorithm selection

Happy to share the benchmark harness used to produce the numbers above if it'd help with reproduction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions