Memory Copies from Device Significantly Slower than CUDA

Thank you for providing this library. I have found it immensely helpful when using Vulkan compute shaders.

I am profiling my application, and I found that copying back to the CPU takes significantly more time than in CUDA. Specifically, this line cost 95% of my run time:
```c++
std::vector<std::pair<uint32_t, uint32_t>> candidates = 
    t_candidates->vector<std::pair<uint32_t, uint32_t>>();
```
where `t_candidates` is created as
```c++
auto t_candidates = mgr->tensor(
    candidates_size, 2 * sizeof(uint32_t),
    kp::Memory::DataTypes::eCustom, kp::Memory::MemoryTypes::eDevice);
```

I tried to reproduce this same effect in an as-simple-as-possible example using the Python bindings for Komput and PyCUDA. I have attached a PDF of the notebook. You can see copying 51 MiB from the GPU to CPU costs `7.3 ms ± 399 µs` for CUDA but `292 ms ± 160 µs` for Kompute.

Is this an inherent limitation of Vulkan, or is there a way to speed up this copy?

---

Here are my GPU specs:
```
Device: NVIDIA GeForce RTX 5080
Compute Capability: (12, 0)
Total Memory: 15817 MiB
Max threads per block: 1024
Total number of SMs: 84
```

---
Attachments:
[cuda_vs_kompute.pdf](https://github.com/user-attachments/files/20093487/cuda_vs_kompute.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Copies from Device Significantly Slower than CUDA #420

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory Copies from Device Significantly Slower than CUDA #420

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions