-
Notifications
You must be signed in to change notification settings - Fork 418
SumBoundary reimplementation #4658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SumBoundary reimplementation #4658
Conversation
f4c63a0
to
5aefc2f
Compare
In the old approach, we make a temporary copy of the data, zero out the original data, and then perform parallel add. In the new approach that should be more efficient, we no longer need a temporary copy of the entire data. We have also added an optional argument that can be used to make SumBoundary deterministic. Note that being deterministic does not mean the data are consistent on shared nodes. For that, one still needs to call OverrideSync, or FillBounaryAndSync.
5aefc2f
to
0894a8e
Compare
Could you try this PR? To make testing easy, I have made two branches on my fork. Instead of using this PR branch, you can test https://github.com/WeiqunZhang/amrex/tree/sum_boundary_nondeter for non-deterministic SumBoundary and https://github.com/WeiqunZhang/amrex/tree/sum_boundary_deter for deterministic SumBoundary. You don't need to make any changes in your code. Could you also test the current development branch so that we can compare the performance of the three approaches. It would not be surprising if the deterministic approach is actually faster because it does not use atomics. The new non-deterministic approach is also expected to be faster because it does not make a temporary copy of the entire data. |
Just realized there is a small flaw. I used pointer for sorting. But that's not stable, unless they all come from a single Arena allocation. I will update it soon. |
Sorting is fixed. |
@AlexanderSinn Could you also give this a try? |
I am on vacation this week. Hopefully ChongChong can test it. |
This worked! This PR made particle-to-mesh deposition reproducible. just tested. |
I'm going to run more tests and report later, but a few simples tests with 100,000 particles and 64 tiles show that on a single GPU, the SumBoundary function is now deterministic. |
@WeiqunZhang What's the differences between the three branches and how do you plan to merge them to the main branch? Will there be a amrex runtime parameter to toggle this, or will you keep |
There are three branches. Which one will be merged depends on performance results.
|
@WeiqunZhang Here are the log files from the main branch run and the sum_boundary_deter run. It seems to suggest that the performance is comparable, although the tests are very basic: 128^3 cells with 64 tiles, running with 1 AMD GPU for 20 steps. In this new branch, the SumBoundary logger is missing, so it's not possible to compare the metrics of SumBoundary. log-4cubed-main.log In this test, SumBoundary is at most taking a tiny fraction of the total compute time. In a different test where the overall performance is much better (15 Mupdates/s) and SumBoundary might be relatively more important, the result from these two branches are very similar as well. |
Thanks for testing. As you said, the time spent in SumBoundary was tiny. So they did not show up in profiling. |
I tested it with hipace, so single rank/single box, MI250X, 4095*4095*1 cells with ncomp=3 and (2,2,0) ghost cells, executed 2000 times. I enclosed the sum boundary call in a TinyProfiler region with stream sync to isolate its time. For the last test I did the same for the kernel in deterministic_fab_to_fab. There were 8 tags with 1 block launched.
|
I wonder if using a large block size would help deterministic_fab_to_fab. It increases syncthreads' cost, but it can also use more threads. |
We could also try to cache those kernel meta-data. But let's leave that for the future. |
With 1024 threads, the kernel takes 0.9387 seconds. Caching the tags would save at most 0.12 seconds. It might be good to use BoxIndexer instead of the divisions, this should be done in general for TagParallelFor. If I understand it correctly, in my test there are 16 cells that overlap 3x each and about 32k that don't overlap. So the tags could be split into overlapping and non-overlapping ones, with all the non-overlapping cells executed in parallel. |
We could tile the destination fab box and put a block on a tile. That will allow us to launch more blocks. |
I did some tests on perlmutter using the latest version.
|
I am also convinced the new implementation is also correct. I used warpx for testing. It calls both old and new FillBoundary functions and discards the new values. The relative difference can be as big as 1.e-10 in the test. In this warpx test, for example, we have
|
I think we probably should make the data consistent when SumBoundary finishes, regardless being deterministic or not. |
On the other hand, a lot of codes have already been using OverrideSync and FillBoundaryAndSync for syncing the data. So this might be redundant. |
In the old approach, we make a temporary copy of the data, zero out the original data, and then perform parallel add. In the new approach that should be more efficient, we no longer need a temporary copy of the entire data.
We have also added an optional argument that can be used to make SumBoundary deterministic. Note that being deterministic does not mean the data are consistent on shared nodes. For that, one still needs to call OverrideSync, or FillBounaryAndSync.