On an RTX 4090 (compute capability 8.9) running bert-base-uncased for masked-LM scoring with the prebuilt ctranslate2==4.7.1 wheel, bf16 inference is consistently slower than fp16 on the same inputs:
- fp16, batch 64: ~34 ms per 30-sentence call
- bf16, batch 64: ~42 ms per 30-sentence call (~24% slower)
The same workload run through PyTorch's BertForMaskedLM directly shows fp16 and bf16 within rounding error of each other (~32 ms each), so the gap looks specific to how CT2 invokes the bf16 path rather than anything inherent to bf16 on Ada.
Where it seems to come from
Reading src/cuda/primitives.cu, there's a noticeable asymmetry between the two GEMM specializations:
- fp16 GEMM (around line 522) defaults to
CUBLAS_COMPUTE_16F accumulation, with an opt-out via cuda::use_true_fp16_gemm() / the CT2_CUDA_TRUE_FP16_GEMM env var.
- bf16 GEMM (around line 565) hardcodes
CUBLAS_COMPUTE_32F, with no equivalent flag for CUBLAS_COMPUTE_16BF.
So bf16 never gets the half-precision Tensor Core accumulator path that fp16 has by default.
On top of that, both paths call cublasGemmEx with CUBLAS_GEMM_DEFAULT rather than going through cuBLASLt (cublasLtMatmul + cublasLtMatmulAlgoGetHeuristic). PyTorch's default on the bf16 path is bgemm_internal_cublaslt in aten/src/ATen/cuda/CUDABlas.cpp. Worth noting that PyTorch also uses CUBLAS_COMPUTE_32F for both fp16 and bf16 by default, so the parity they get isn't coming from a different accumulator — it appears to come from cuBLASLt's algorithm selection plus reduction-scheme tuning (CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK), which the legacy CUBLAS_GEMM_DEFAULT picker doesn't apply as aggressively for bf16.
What might help
A couple of options I can see, ranked by effort:
Smaller fix — add a CT2_CUDA_TRUE_BF16_GEMM env flag
- Mirrors the existing fp16 toggle
- Switches the bf16 path to
CUBLAS_COMPUTE_16BF accumulation
- A few-line change in
primitives.cu plus the toggle wiring
- Default off, since bf16 accumulators can be numerically risky for some transformer shapes
- Roughly doubles bf16 Tensor Core throughput on Ampere+ when enabled
Larger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt
- Pulls bf16 perf in line with what PyTorch users see, even at the default
CUBLAS_COMPUTE_32F
- Requires the usual cuBLASLt plumbing: matmul / layout / preference descriptors, algorithm-heuristic caching, an additional dlopen target alongside
cublas_stub.cc
- Workspace allocation could route through the existing
Allocator interface that src/ops/conv1d_cudnn_gpu.cu already uses for cuDNN, so that part wouldn't be net-new infrastructure
- Bigger refactor than the env flag, but the change with the broadest benefit since it would also help fp16 and fp32 algorithm selection
Happy to share the benchmark harness used to produce the numbers above if it'd help with reproduction.
On an RTX 4090 (compute capability 8.9) running
bert-base-uncasedfor masked-LM scoring with the prebuiltctranslate2==4.7.1wheel, bf16 inference is consistently slower than fp16 on the same inputs:The same workload run through PyTorch's
BertForMaskedLMdirectly shows fp16 and bf16 within rounding error of each other (~32 ms each), so the gap looks specific to how CT2 invokes the bf16 path rather than anything inherent to bf16 on Ada.Where it seems to come from
Reading
src/cuda/primitives.cu, there's a noticeable asymmetry between the two GEMM specializations:CUBLAS_COMPUTE_16Faccumulation, with an opt-out viacuda::use_true_fp16_gemm()/ theCT2_CUDA_TRUE_FP16_GEMMenv var.CUBLAS_COMPUTE_32F, with no equivalent flag forCUBLAS_COMPUTE_16BF.So bf16 never gets the half-precision Tensor Core accumulator path that fp16 has by default.
On top of that, both paths call
cublasGemmExwithCUBLAS_GEMM_DEFAULTrather than going through cuBLASLt (cublasLtMatmul+cublasLtMatmulAlgoGetHeuristic). PyTorch's default on the bf16 path isbgemm_internal_cublasltinaten/src/ATen/cuda/CUDABlas.cpp. Worth noting that PyTorch also usesCUBLAS_COMPUTE_32Ffor both fp16 and bf16 by default, so the parity they get isn't coming from a different accumulator — it appears to come from cuBLASLt's algorithm selection plus reduction-scheme tuning (CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK), which the legacyCUBLAS_GEMM_DEFAULTpicker doesn't apply as aggressively for bf16.What might help
A couple of options I can see, ranked by effort:
Smaller fix — add a
CT2_CUDA_TRUE_BF16_GEMMenv flagCUBLAS_COMPUTE_16BFaccumulationprimitives.cuplus the toggle wiringLarger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt
CUBLAS_COMPUTE_32Fcublas_stub.ccAllocatorinterface thatsrc/ops/conv1d_cudnn_gpu.cualready uses for cuDNN, so that part wouldn't be net-new infrastructureHappy to share the benchmark harness used to produce the numbers above if it'd help with reproduction.