Skip to content

Conversation

WeiqunZhang
Copy link
Member

In the old approach, we make a temporary copy of the data, zero out the original data, and then perform parallel add. In the new approach that should be more efficient, we no longer need a temporary copy of the entire data.

We have also added an optional argument that can be used to make SumBoundary deterministic. Note that being deterministic does not mean the data are consistent on shared nodes. For that, one still needs to call OverrideSync, or FillBounaryAndSync.

@WeiqunZhang WeiqunZhang force-pushed the sum_boundary branch 2 times, most recently from f4c63a0 to 5aefc2f Compare September 16, 2025 04:10
In the old approach, we make a temporary copy of the data, zero out the
original data, and then perform parallel add. In the new approach that
should be more efficient, we no longer need a temporary copy of the entire
data.

We have also added an optional argument that can be used to make SumBoundary
deterministic. Note that being deterministic does not mean the data are
consistent on shared nodes. For that, one still needs to call OverrideSync,
or FillBounaryAndSync.
@WeiqunZhang
Copy link
Member Author

@BenWibking @chongchonghe

Could you try this PR? To make testing easy, I have made two branches on my fork. Instead of using this PR branch, you can test https://github.com/WeiqunZhang/amrex/tree/sum_boundary_nondeter for non-deterministic SumBoundary and https://github.com/WeiqunZhang/amrex/tree/sum_boundary_deter for deterministic SumBoundary. You don't need to make any changes in your code. Could you also test the current development branch so that we can compare the performance of the three approaches. It would not be surprising if the deterministic approach is actually faster because it does not use atomics. The new non-deterministic approach is also expected to be faster because it does not make a temporary copy of the entire data.

@WeiqunZhang
Copy link
Member Author

Just realized there is a small flaw. I used pointer for sorting. But that's not stable, unless they all come from a single Arena allocation. I will update it soon.

@WeiqunZhang
Copy link
Member Author

Sorting is fixed.

@WeiqunZhang
Copy link
Member Author

@BenWibking @chongchonghe

Could you try this PR? To make testing easy, I have made two branches on my fork. Instead of using this PR branch, you can test https://github.com/WeiqunZhang/amrex/tree/sum_boundary_nondeter for non-deterministic SumBoundary and https://github.com/WeiqunZhang/amrex/tree/sum_boundary_deter for deterministic SumBoundary. You don't need to make any changes in your code. Could you also test the current development branch so that we can compare the performance of the three approaches. It would not be surprising if the deterministic approach is actually faster because it does not use atomics. The new non-deterministic approach is also expected to be faster because it does not make a temporary copy of the entire data.

@AlexanderSinn Could you also give this a try?

@BenWibking
Copy link
Contributor

I am on vacation this week. Hopefully ChongChong can test it.

@chongchonghe
Copy link

This worked! This PR made particle-to-mesh deposition reproducible. just tested.

@chongchonghe
Copy link

chongchonghe commented Sep 16, 2025

I'm going to run more tests and report later, but a few simples tests with 100,000 particles and 64 tiles show that on a single GPU, the SumBoundary function is now deterministic.

@chongchonghe
Copy link

@WeiqunZhang What's the differences between the three branches and how do you plan to merge them to the main branch? Will there be a amrex runtime parameter to toggle this, or will you keep deterministic as a parameter for SumBoundary and the user can choose to use whatever they want?

@WeiqunZhang
Copy link
Member Author

WeiqunZhang commented Sep 16, 2025

There are three branches. Which one will be merged depends on performance results.

  • sum_boundary: This changes the API by adding an optional argument bool deteministic to SumBoundary. The approach in this branch is different from that in the development branch, and it may result in roundoff differences.
  • sum_boundary_deter: This does not change the API. It's always deterministic and it does not use atomics. It's based on the sum_boundary branch.
  • sum_boundary_nondeter: This does not change the API. It's non-deterministic by using atomics. It's also based on the sum_boundary branch.

@chongchonghe
Copy link

chongchonghe commented Sep 16, 2025

@WeiqunZhang Here are the log files from the main branch run and the sum_boundary_deter run. It seems to suggest that the performance is comparable, although the tests are very basic: 128^3 cells with 64 tiles, running with 1 AMD GPU for 20 steps. In this new branch, the SumBoundary logger is missing, so it's not possible to compare the metrics of SumBoundary.

log-4cubed-main.log
log-4cubed-sum_boundary_deter.log

In this test, SumBoundary is at most taking a tiny fraction of the total compute time. In a different test where the overall performance is much better (15 Mupdates/s) and SumBoundary might be relatively more important, the result from these two branches are very similar as well.

@WeiqunZhang
Copy link
Member Author

Thanks for testing. As you said, the time spent in SumBoundary was tiny. So they did not show up in profiling.

@AlexanderSinn
Copy link
Member

AlexanderSinn commented Sep 16, 2025

I tested it with hipace, so single rank/single box, MI250X, 4095*4095*1 cells with ncomp=3 and (2,2,0) ghost cells, executed 2000 times. I enclosed the sum boundary call in a TinyProfiler region with stream sync to isolate its time. For the last test I did the same for the kernel in deterministic_fab_to_fab. There were 8 tags with 1 block launched.

dev                                                    4.976 seconds
dev but using FillBoundary instead of SumBoundary    0.05021 seconds
sum_boundary_nondeter                                 0.1478 seconds
sum_boundary_deter                                     1.354 seconds
time of launch in deterministic_fab_to_fab              1.24 seconds

@WeiqunZhang
Copy link
Member Author

I wonder if using a large block size would help deterministic_fab_to_fab. It increases syncthreads' cost, but it can also use more threads.

@WeiqunZhang
Copy link
Member Author

We could also try to cache those kernel meta-data. But let's leave that for the future.

@AlexanderSinn
Copy link
Member

With 1024 threads, the kernel takes 0.9387 seconds. Caching the tags would save at most 0.12 seconds. It might be good to use BoxIndexer instead of the divisions, this should be done in general for TagParallelFor. If I understand it correctly, in my test there are 16 cells that overlap 3x each and about 32k that don't overlap. So the tags could be split into overlapping and non-overlapping ones, with all the non-overlapping cells executed in parallel.

@WeiqunZhang
Copy link
Member Author

We could tile the destination fab box and put a block on a tile. That will allow us to launch more blocks.

@WeiqunZhang
Copy link
Member Author

I did some tests on perlmutter using the latest version.

* perlmutter, 1000 calls, nghost = 3, max_grid_size = 128
| Function      | # GPU | n_cell |   Time |
|---------------+-------+--------+--------|
| FillBoundary  |     1 |  256^3 | 0.0835 |
| old SumBndry  |       |        | 0.8567 |
| new SumBndry  |       |        | 0.2300 |
| deterministic |       |        | 0.4440 |
|---------------+-------+--------+--------|
| FillBoundary  |     4 |  512^3 | 0.3760 |
| old SumBndry  |       |        | 2.1269 |
| new SumBndry  |       |        | 0.6480 |
| deterministic |       |        | 1.0067 |
|---------------+-------+--------+--------|
| FillBoundary  |    32 | 1024^3 | 0.9380 |
| old SumBndry  |       |        | 2.1102 |
| new SumBndry  |       |        | 1.1455 |
| deterministic |       |        | 1.3571 |

@WeiqunZhang
Copy link
Member Author

I am also convinced the new implementation is also correct. I used warpx for testing. It calls both old and new FillBoundary functions and discards the new values. The relative difference can be as big as 1.e-10 in the test. In this warpx test, for example, we have

In [2]: a = 54189501671.266693

In [3]: b = -33492988872.169563

In [4]: c = -312998371924.91077

In [5]: d = 292301782490.75726

In [11]: ((a+b+c+d) - (a+c+d+b)) / (a+b+c+d)
Out[11]: 2.9866466692911565e-10

@WeiqunZhang WeiqunZhang marked this pull request as ready for review September 18, 2025 02:37
@WeiqunZhang
Copy link
Member Author

I think we probably should make the data consistent when SumBoundary finishes, regardless being deterministic or not.

@WeiqunZhang
Copy link
Member Author

I think we probably should make the data consistent when SumBoundary finishes, regardless being deterministic or not.

On the other hand, a lot of codes have already been using OverrideSync and FillBoundaryAndSync for syncing the data. So this might be redundant.

@WeiqunZhang WeiqunZhang merged commit 2ee8ae2 into AMReX-Codes:development Oct 1, 2025
74 of 75 checks passed
@WeiqunZhang WeiqunZhang deleted the sum_boundary branch October 1, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deterministic SumBoundary SumBoundary is much less efficient than FillBoundary
5 participants