Conversation
|
I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels. |
I think that can be addressed in the RAJA cookbook or examples explaining when this would be performant. In the case of @tomstitt and I, it comes up when the threads per block are not high enough to saturate the GPU and using |
I think my ideal is an interface where there is a choice of policy. We have 2d/3d kernels on 1d iteration spaces, using mod/div like Arturo said, to expose more parallelism. When we switch some of those to using our "true" 2d/3d grid launcher we lose performance because our block (16x16 , 8x8x8) doesn't map well onto the grid (because we just idle threads). It's of course not always true that our div/mod approach is going to be better, like Jason said, and if we had an easy way to pick we could put both behind our abstraction and correctly dispatch |
@tomstitt , okay -- Let's get that imagination soaring, and cook up some ideas! #RAJA! |
| RAJA::launch_nd(res, policy, RAJA::segments(cells, comps), | ||
| [=] RAJA_HOST_DEVICE(int cell, int comp) { | ||
| const int idx = comp + num_comp * cell; | ||
| values_ptr[idx] = 1000 * cell + comp; | ||
| }); |
There was a problem hiding this comment.
@tomstitt , @MrBurmark I invite you to take a look at this example, I think this may be what we are looking for.
You certainly can use launch or teams to get some level of parallelism and then take those indices and do your own calculations with them. If we had the multiloop abstractions with a variety of policies that could take us a fair amount of the way. |
I along with @tomstitt work an application where we have been exploring using a 1D GPU index for multi-level loops and have found that may be more performant than using hierarchical parallelism. This PR uses concepts from RAJA to create the RAJA::forall_nd convenience function.
Cases in which this approach is more performant:
it comes up when the threads per block are not high enough to saturate the GPU and using
RAJA::forall + mods + divsallows us to increase the threads per block which ends up being more performant.