Draft: launch_nd by artv3 · Pull Request #2021 · llnl/RAJA

artv3 · 2026-04-30T18:20:24Z

I along with @tomstitt work an application where we have been exploring using a 1D GPU index for multi-level loops and have found that may be more performant than using hierarchical parallelism. This PR uses concepts from RAJA to create the RAJA::forall_nd convenience function.

Cases in which this approach is more performant:
it comes up when the threads per block are not high enough to saturate the GPU and using RAJA::forall + mods + divs allows us to increase the threads per block which ends up being more performant.

MrBurmark · 2026-04-30T19:45:05Z

I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels.

artv3 · 2026-04-30T20:39:49Z

I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels.

I think that can be addressed in the RAJA cookbook or examples explaining when this would be performant. In the case of @tomstitt and I, it comes up when the threads per block are not high enough to saturate the GPU and using RAJA::forall + mods + divs allows us to increase the threads per block which ends up being more performant. For 2D/3D gpu grids there are various ways to do that and perhaps we should direct developers to RAJA::launch or RAJA::kernel?

tomstitt · 2026-04-30T21:17:19Z

I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels.

I think that can be addressed in the RAJA cookbook or examples explaining when this would be performant. In the case of @tomstitt and I, it comes up when the threads per block are not high enough to saturate the GPU and using RAJA::forall + mods + divs allows us to increase the threads per block which ends up being more performant. For 2D/3D gpu grids there are various ways to do that and perhaps we should direct developers to RAJA::launch or RAJA::kernel?

I think my ideal is an interface where there is a choice of policy. We have 2d/3d kernels on 1d iteration spaces, using mod/div like Arturo said, to expose more parallelism. When we switch some of those to using our "true" 2d/3d grid launcher we lose performance because our block (16x16 , 8x8x8) doesn't map well onto the grid (because we just idle threads). It's of course not always true that our div/mod approach is going to be better, like Jason said, and if we had an easy way to pick we could put both behind our abstraction and correctly dispatch

artv3 · 2026-04-30T21:24:45Z

I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels.

I think that can be addressed in the RAJA cookbook or examples explaining when this would be performant. In the case of @tomstitt and I, it comes up when the threads per block are not high enough to saturate the GPU and using RAJA::forall + mods + divs allows us to increase the threads per block which ends up being more performant. For 2D/3D gpu grids there are various ways to do that and perhaps we should direct developers to RAJA::launch or RAJA::kernel?

I think my ideal is an interface where there is a choice of policy. We have 2d/3d kernels on 1d iteration spaces, using mod/div like Arturo said, to expose more parallelism. When we switch some of those to using our "true" 2d/3d grid launcher we lose performance because our block (16x16 , 8x8x8) doesn't map well onto the grid (because we just idle threads). It's of course not always true that our div/mod approach is going to be better, like Jason said, and if we had an easy way to pick we could put both behind our abstraction and correctly dispatch

@tomstitt , okay -- Let's get that imagination soaring, and cook up some ideas! #RAJA!

artv3 · 2026-04-30T23:04:34Z

+  RAJA::launch_nd(res, policy, RAJA::segments(cells, comps),
+                  [=] RAJA_HOST_DEVICE(int cell, int comp) {
+                    const int idx   = comp + num_comp * cell;
+                    values_ptr[idx] = 1000 * cell + comp;
+                  });


@tomstitt , @MrBurmark I invite you to take a look at this example, I think this may be what we are looking for.

MrBurmark · 2026-04-30T23:31:36Z

I am concerned that people will use this without knowing that its doing expensive div/mod calculations and see that it performs more poorly than a native cuda/hip 2d or 3d kernel. Would it make sense to have policies that allow mapping to 2d or 3d kernels.

I think that can be addressed in the RAJA cookbook or examples explaining when this would be performant. In the case of @tomstitt and I, it comes up when the threads per block are not high enough to saturate the GPU and using RAJA::forall + mods + divs allows us to increase the threads per block which ends up being more performant. For 2D/3D gpu grids there are various ways to do that and perhaps we should direct developers to RAJA::launch or RAJA::kernel?

I think my ideal is an interface where there is a choice of policy. We have 2d/3d kernels on 1d iteration spaces, using mod/div like Arturo said, to expose more parallelism. When we switch some of those to using our "true" 2d/3d grid launcher we lose performance because our block (16x16 , 8x8x8) doesn't map well onto the grid (because we just idle threads). It's of course not always true that our div/mod approach is going to be better, like Jason said, and if we had an easy way to pick we could put both behind our abstraction and correctly dispatch

@tomstitt , okay -- Let's get that imagination soaring, and cook up some ideas! #RAJA!

You certainly can use launch or teams to get some level of parallelism and then take those indices and do your own calculations with them. If we had the multiloop abstractions with a variety of policies that could take us a fair amount of the way.

draft of forall_nd

8e84c53

artv3 requested a review from tomstitt April 30, 2026 18:20

artv3 added this to the June 2026 Release milestone Apr 30, 2026

draft of launch_nd

7e6c77a

artv3 changed the title ~~Draft: Forall_nd~~ Draft: launch_nd Apr 30, 2026

artv3 commented Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: launch_nd#2021

Draft: launch_nd#2021
artv3 wants to merge 2 commits intodevelopfrom
artv3/feature/forall_nd

artv3 commented Apr 30, 2026 •

edited

Loading

Uh oh!

MrBurmark commented Apr 30, 2026 •

edited

Loading

Uh oh!

artv3 commented Apr 30, 2026

Uh oh!

tomstitt commented Apr 30, 2026 •

edited

Loading

Uh oh!

artv3 commented Apr 30, 2026

Uh oh!

artv3 Apr 30, 2026

Uh oh!

MrBurmark commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

artv3 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artv3 commented Apr 30, 2026

Uh oh!

tomstitt commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artv3 commented Apr 30, 2026

Uh oh!

artv3 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

MrBurmark commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

artv3 commented Apr 30, 2026 •

edited

Loading

MrBurmark commented Apr 30, 2026 •

edited

Loading

tomstitt commented Apr 30, 2026 •

edited

Loading