Skip to content

Memory Copies from Device Significantly Slower than CUDA #420

@zfergus

Description

@zfergus

Thank you for providing this library. I have found it immensely helpful when using Vulkan compute shaders.

I am profiling my application, and I found that copying back to the CPU takes significantly more time than in CUDA. Specifically, this line cost 95% of my run time:

std::vector<std::pair<uint32_t, uint32_t>> candidates = 
    t_candidates->vector<std::pair<uint32_t, uint32_t>>();

where t_candidates is created as

auto t_candidates = mgr->tensor(
    candidates_size, 2 * sizeof(uint32_t),
    kp::Memory::DataTypes::eCustom, kp::Memory::MemoryTypes::eDevice);

I tried to reproduce this same effect in an as-simple-as-possible example using the Python bindings for Komput and PyCUDA. I have attached a PDF of the notebook. You can see copying 51 MiB from the GPU to CPU costs 7.3 ms ± 399 µs for CUDA but 292 ms ± 160 µs for Kompute.

Is this an inherent limitation of Vulkan, or is there a way to speed up this copy?


Here are my GPU specs:

Device: NVIDIA GeForce RTX 5080
Compute Capability: (12, 0)
Total Memory: 15817 MiB
Max threads per block: 1024
Total number of SMs: 84

Attachments:
cuda_vs_kompute.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions