bfloat16 GEMM is noticeably slower than float16

On an RTX 4090 (compute capability 8.9) running `bert-base-uncased` for masked-LM scoring with the prebuilt `ctranslate2==4.7.1` wheel, bf16 inference is consistently slower than fp16 on the same inputs:

- **fp16, batch 64**: ~34 ms per 30-sentence call
- **bf16, batch 64**: ~42 ms per 30-sentence call (~24% slower)

The same workload run through PyTorch's `BertForMaskedLM` directly shows fp16 and bf16 within rounding error of each other (~32 ms each), so the gap looks specific to how CT2 invokes the bf16 path rather than anything inherent to bf16 on Ada.

## Where it seems to come from

Reading `src/cuda/primitives.cu`, there's a noticeable asymmetry between the two GEMM specializations:

- **fp16 GEMM** (around line 522) defaults to `CUBLAS_COMPUTE_16F` accumulation, with an opt-out via `cuda::use_true_fp16_gemm()` / the `CT2_CUDA_TRUE_FP16_GEMM` env var.
- **bf16 GEMM** (around line 565) hardcodes `CUBLAS_COMPUTE_32F`, with no equivalent flag for `CUBLAS_COMPUTE_16BF`.

So bf16 never gets the half-precision Tensor Core accumulator path that fp16 has by default.

On top of that, both paths call `cublasGemmEx` with `CUBLAS_GEMM_DEFAULT` rather than going through cuBLASLt (`cublasLtMatmul` + `cublasLtMatmulAlgoGetHeuristic`). PyTorch's default on the bf16 path is `bgemm_internal_cublaslt` in `aten/src/ATen/cuda/CUDABlas.cpp`. Worth noting that PyTorch also uses `CUBLAS_COMPUTE_32F` for both fp16 and bf16 by default, so the parity they get isn't coming from a different accumulator — it appears to come from cuBLASLt's algorithm selection plus reduction-scheme tuning (`CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK`), which the legacy `CUBLAS_GEMM_DEFAULT` picker doesn't apply as aggressively for bf16.

## What might help

A couple of options I can see, ranked by effort:

### Smaller fix — add a `CT2_CUDA_TRUE_BF16_GEMM` env flag

- Mirrors the existing fp16 toggle
- Switches the bf16 path to `CUBLAS_COMPUTE_16BF` accumulation
- A few-line change in `primitives.cu` plus the toggle wiring
- Default off, since bf16 accumulators can be numerically risky for some transformer shapes
- Roughly doubles bf16 Tensor Core throughput on Ampere+ when enabled

### Larger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt

- Pulls bf16 perf in line with what PyTorch users see, even at the default `CUBLAS_COMPUTE_32F`
- Requires the usual cuBLASLt plumbing: matmul / layout / preference descriptors, algorithm-heuristic caching, an additional dlopen target alongside `cublas_stub.cc`
- Workspace allocation could route through the existing `Allocator` interface that `src/ops/conv1d_cudnn_gpu.cu` already uses for cuDNN, so that part wouldn't be net-new infrastructure
- Bigger refactor than the env flag, but the change with the broadest benefit since it would also help fp16 and fp32 algorithm selection

Happy to share the benchmark harness used to produce the numbers above if it'd help with reproduction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bfloat16 GEMM is noticeably slower than float16 #2033

Where it seems to come from

What might help

Smaller fix — add a `CT2_CUDA_TRUE_BF16_GEMM` env flag

Larger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bfloat16 GEMM is noticeably slower than float16 #2033

Description

Where it seems to come from

What might help

Smaller fix — add a CT2_CUDA_TRUE_BF16_GEMM env flag

Larger fix — migrate the bf16 GEMM (or all of them) to cuBLASLt

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Smaller fix — add a `CT2_CUDA_TRUE_BF16_GEMM` env flag