(Closes #3409) reproducible reduction performance improvements#3452
(Closes #3409) reproducible reduction performance improvements#3452LonelyCat124 wants to merge 13 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3452 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 393 393
Lines 55067 55097 +30
=========================================
+ Hits 55067 55097 +30 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@sergisiso I think this is ready for a first look now. |
|
@LonelyCat124 Today LFRic meeting discussion has reminded me that we still don't test reproducible reductions in our integration tests and we should because this is what the MetOffice always use. Can you update: |
|
A second step would be to update the: We could regenerate the KGO by doing manual run of There is another issue for this #3207, so I am not sure if it should be here or a separate PR. But it will be great to be confident that this produced the exact expected results (and continues doing so going forward). |
|
The LFRic colour tiling IT fails - I'll investigate. |
|
@sergisiso So I setup running this and its an issue due to This means we can't have This gives 2 options as I see, first is to make the transformation do something different if |
|
The standard has this reproducible order-modifier https://www.openmp.org/spec-html/5.2/openmpse59.html, Is this something we can use / the compilers implement? |
|
Any other suggestions @markb-epcc ? |
|
@sergisiso AFAIK, the reproducible order-modifier applies to the mapping of loop iterations to threads, not to the order of reduction operations. The spec says "The location in the OpenMP program at which values are combined and the order in which values are combined are unspecified. Thus, when comparing sequential and parallel executions, or when comparing one parallel execution to another (even if the number of threads used is the same), bitwise-identical results are not guaranteed." I would have thought that padding the temporary array would solve the problem, but that is not done in https://github.com/MetOffice/lfric_core/blob/bb10f897f7ecc4a931906b8d54bf1692f4a66b7f/components/science/source/psy/sci_psykal_light_mod.f90 where I noticed the false sharing problem coming from. If you want to implement this with !$omp parallel do then would you not need to split the parallel and do in any case, so that omp_get_thread_num() is only called once and not in every loop iteration? |
|
Yeah, I'm not sure what the best solution is here - essentially how should PSyclone do the splitting (its a mostly a bit of an inheritance nightmare). I could simply skip the reproducible reductions being created if the node is a OMPParallelDoDirective, but weirdly that requires checking the type in OMPParallelDirective which is a mess. The entire LFRicOMPParallelLoopTrans infrastructure currently never checks to see if reproducible reductions are enabled, it just (now? Maybe it did before and we never tested but the old version just worked) magically happens at lowering time, when it hits OMPParallelDirective.lower_to_language_level which performs the new reproducible reductions setup, and results in : This is a good sanity error checking test at least, but the only thing I can thing to do is when lowering the OMPParallelDoDirective, to do the following: I'm not sure how you feel about changing the tree like this during lowering @sergisiso ? This is a rough problem to solve in another way though - its possible I could do it at transformation application (or just in general, if someone requests LFRICParallelLoopTrans with reprod set to true in the config it applies the 2 transformations separately instead regardless of whether reductions exist), but I don't know if we know whether there are reductions required until after transformation time. Let me know if you have a preference between the options. |
|
I decided to implement the transformation approach, and have a test in PSyclone now that works (we never tested that functionality anywhere before...). I'm hoping the ITs pass now but waiting on them for now. |
|
@sergisiso This is ready for a look now, ITs all green. |
|
So I can restrict it to But I have some questions from reading the Kern documentation which I suspect isn't true: This claims that ancestor.reprod is always False for |
|
Ooh, that doesn't look like it's right anymore. Can we remove the |
I'm not sure, its tricky, in theory we need to check the ancestor's |
Assuming tests are all successful here, then ready for a first look. This does the changes we discussed in the issue, there are some rogue newline statements that seem to commen from adding preceding_comments, though I could also be tempted to remove the preceding comments if they seems a bit over the top.