SumBoundary reimplementation #4658

WeiqunZhang · 2025-09-16T03:22:51Z

In the old approach, we make a temporary copy of the data, zero out the original data, and then perform parallel add. In the new approach that should be more efficient, we no longer need a temporary copy of the entire data.

We have also added an optional argument that can be used to make SumBoundary deterministic. Note that being deterministic does not mean the data are consistent on shared nodes. For that, one still needs to call OverrideSync, or FillBounaryAndSync.

In the old approach, we make a temporary copy of the data, zero out the original data, and then perform parallel add. In the new approach that should be more efficient, we no longer need a temporary copy of the entire data. We have also added an optional argument that can be used to make SumBoundary deterministic. Note that being deterministic does not mean the data are consistent on shared nodes. For that, one still needs to call OverrideSync, or FillBounaryAndSync.

WeiqunZhang · 2025-09-16T05:14:58Z

@BenWibking @chongchonghe

Could you try this PR? To make testing easy, I have made two branches on my fork. Instead of using this PR branch, you can test https://github.com/WeiqunZhang/amrex/tree/sum_boundary_nondeter for non-deterministic SumBoundary and https://github.com/WeiqunZhang/amrex/tree/sum_boundary_deter for deterministic SumBoundary. You don't need to make any changes in your code. Could you also test the current development branch so that we can compare the performance of the three approaches. It would not be surprising if the deterministic approach is actually faster because it does not use atomics. The new non-deterministic approach is also expected to be faster because it does not make a temporary copy of the entire data.

WeiqunZhang · 2025-09-16T06:14:21Z

Just realized there is a small flaw. I used pointer for sorting. But that's not stable, unless they all come from a single Arena allocation. I will update it soon.

WeiqunZhang · 2025-09-16T06:45:26Z

Sorting is fixed.

WeiqunZhang · 2025-09-16T14:16:11Z

@BenWibking @chongchonghe

Could you try this PR? To make testing easy, I have made two branches on my fork. Instead of using this PR branch, you can test https://github.com/WeiqunZhang/amrex/tree/sum_boundary_nondeter for non-deterministic SumBoundary and https://github.com/WeiqunZhang/amrex/tree/sum_boundary_deter for deterministic SumBoundary. You don't need to make any changes in your code. Could you also test the current development branch so that we can compare the performance of the three approaches. It would not be surprising if the deterministic approach is actually faster because it does not use atomics. The new non-deterministic approach is also expected to be faster because it does not make a temporary copy of the entire data.

@AlexanderSinn Could you also give this a try?

BenWibking · 2025-09-16T14:19:05Z

I am on vacation this week. Hopefully ChongChong can test it.

chongchonghe · 2025-09-16T14:19:23Z

This worked! This PR made particle-to-mesh deposition reproducible. just tested.

chongchonghe · 2025-09-16T14:21:28Z

I'm going to run more tests and report later, but a few simples tests with 100,000 particles and 64 tiles show that on a single GPU, the SumBoundary function is now deterministic.

chongchonghe · 2025-09-16T15:15:17Z

@WeiqunZhang What's the differences between the three branches and how do you plan to merge them to the main branch? Will there be a amrex runtime parameter to toggle this, or will you keep deterministic as a parameter for SumBoundary and the user can choose to use whatever they want?

WeiqunZhang · 2025-09-16T15:24:12Z

There are three branches. Which one will be merged depends on performance results.

sum_boundary: This changes the API by adding an optional argument bool deteministic to SumBoundary. The approach in this branch is different from that in the development branch, and it may result in roundoff differences.
sum_boundary_deter: This does not change the API. It's always deterministic and it does not use atomics. It's based on the sum_boundary branch.
sum_boundary_nondeter: This does not change the API. It's non-deterministic by using atomics. It's also based on the sum_boundary branch.

chongchonghe · 2025-09-16T18:22:13Z

@WeiqunZhang Here are the log files from the main branch run and the sum_boundary_deter run. It seems to suggest that the performance is comparable, although the tests are very basic: 128^3 cells with 64 tiles, running with 1 AMD GPU for 20 steps. In this new branch, the SumBoundary logger is missing, so it's not possible to compare the metrics of SumBoundary.

log-4cubed-main.log
log-4cubed-sum_boundary_deter.log

In this test, SumBoundary is at most taking a tiny fraction of the total compute time. In a different test where the overall performance is much better (15 Mupdates/s) and SumBoundary might be relatively more important, the result from these two branches are very similar as well.

WeiqunZhang · 2025-09-16T18:38:14Z

Thanks for testing. As you said, the time spent in SumBoundary was tiny. So they did not show up in profiling.

AlexanderSinn · 2025-09-16T18:46:49Z

I tested it with hipace, so single rank/single box, MI250X, 4095*4095*1 cells with ncomp=3 and (2,2,0) ghost cells, executed 2000 times. I enclosed the sum boundary call in a TinyProfiler region with stream sync to isolate its time. For the last test I did the same for the kernel in deterministic_fab_to_fab. There were 8 tags with 1 block launched.

dev                                                    4.976 seconds
dev but using FillBoundary instead of SumBoundary    0.05021 seconds
sum_boundary_nondeter                                 0.1478 seconds
sum_boundary_deter                                     1.354 seconds
time of launch in deterministic_fab_to_fab              1.24 seconds

WeiqunZhang · 2025-09-16T18:55:41Z

I wonder if using a large block size would help deterministic_fab_to_fab. It increases syncthreads' cost, but it can also use more threads.

WeiqunZhang · 2025-09-16T18:57:33Z

We could also try to cache those kernel meta-data. But let's leave that for the future.

AlexanderSinn · 2025-09-16T19:07:40Z

With 1024 threads, the kernel takes 0.9387 seconds. Caching the tags would save at most 0.12 seconds. It might be good to use BoxIndexer instead of the divisions, this should be done in general for TagParallelFor. If I understand it correctly, in my test there are 16 cells that overlap 3x each and about 32k that don't overlap. So the tags could be split into overlapping and non-overlapping ones, with all the non-overlapping cells executed in parallel.

WeiqunZhang · 2025-09-16T19:28:00Z

We could tile the destination fab box and put a block on a tile. That will allow us to launch more blocks.

WeiqunZhang · 2025-09-18T01:18:21Z

I did some tests on perlmutter using the latest version.

* perlmutter, 1000 calls, nghost = 3, max_grid_size = 128
| Function      | # GPU | n_cell |   Time |
|---------------+-------+--------+--------|
| FillBoundary  |     1 |  256^3 | 0.0835 |
| old SumBndry  |       |        | 0.8567 |
| new SumBndry  |       |        | 0.2300 |
| deterministic |       |        | 0.4440 |
|---------------+-------+--------+--------|
| FillBoundary  |     4 |  512^3 | 0.3760 |
| old SumBndry  |       |        | 2.1269 |
| new SumBndry  |       |        | 0.6480 |
| deterministic |       |        | 1.0067 |
|---------------+-------+--------+--------|
| FillBoundary  |    32 | 1024^3 | 0.9380 |
| old SumBndry  |       |        | 2.1102 |
| new SumBndry  |       |        | 1.1455 |
| deterministic |       |        | 1.3571 |

WeiqunZhang · 2025-09-18T02:37:18Z

I am also convinced the new implementation is also correct. I used warpx for testing. It calls both old and new FillBoundary functions and discards the new values. The relative difference can be as big as 1.e-10 in the test. In this warpx test, for example, we have

In [2]: a = 54189501671.266693

In [3]: b = -33492988872.169563

In [4]: c = -312998371924.91077

In [5]: d = 292301782490.75726

In [11]: ((a+b+c+d) - (a+c+d+b)) / (a+b+c+d)
Out[11]: 2.9866466692911565e-10

WeiqunZhang · 2025-09-18T16:04:07Z

I think we probably should make the data consistent when SumBoundary finishes, regardless being deterministic or not.

WeiqunZhang · 2025-09-26T01:59:02Z

I think we probably should make the data consistent when SumBoundary finishes, regardless being deterministic or not.

On the other hand, a lot of codes have already been using OverrideSync and FillBoundaryAndSync for syncing the data. So this might be redundant.

This was linked to issues Sep 16, 2025

SumBoundary is much less efficient than FillBoundary #3739

Closed

Deterministic SumBoundary #4644

Closed

WeiqunZhang force-pushed the sum_boundary branch 2 times, most recently from f4c63a0 to 5aefc2f Compare September 16, 2025 04:10

WeiqunZhang force-pushed the sum_boundary branch from 5aefc2f to 0894a8e Compare September 16, 2025 04:35

Sort by box index

f3f0c39

typo and use hostvector

cb09c81

chongchonghe mentioned this pull request Sep 17, 2025

[WIP] Deterministic particle to mesh deposition quokka-astro/quokka#1299

Closed

5 tasks

WeiqunZhang added 2 commits September 17, 2025 14:12

deterministic_fab_to_fab: tiling

976bb49

Merge branch 'development' into sum_boundary

dc626ed

WeiqunZhang marked this pull request as ready for review September 18, 2025 02:37

WeiqunZhang added 5 commits September 17, 2025 19:49

minor

72dff57

minor

6cd5f14

minor

ea795aa

minor

c37cc34

comment

f55fc27

WeiqunZhang requested a review from atmyers September 18, 2025 04:25

WeiqunZhang added the do not merge label Sep 18, 2025

BenWibking mentioned this pull request Sep 18, 2025

[WIP] Deterministic SumBoundary #4645

Closed

5 tasks

BenWibking mentioned this pull request Sep 25, 2025

Deterministic YAFluxRegister on GPU #4677

Closed

WeiqunZhang added 2 commits September 25, 2025 18:57

comments

94b5e94

Merge branch 'development' into sum_boundary

c0e2d4f

spelling

f5c9eaf

WeiqunZhang removed the do not merge label Sep 30, 2025

WeiqunZhang mentioned this pull request Sep 30, 2025

Update sum_boundary AMReX-Codes/pyamrex#484

Open

atmyers approved these changes Sep 30, 2025

View reviewed changes

WeiqunZhang merged commit 2ee8ae2 into AMReX-Codes:development Oct 1, 2025
74 of 75 checks passed

WeiqunZhang deleted the sum_boundary branch October 1, 2025 15:31

SumBoundary reimplementation #4658

SumBoundary reimplementation #4658

Uh oh!

Conversation

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

BenWibking commented Sep 16, 2025

Uh oh!

chongchonghe commented Sep 16, 2025

Uh oh!

chongchonghe commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chongchonghe commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chongchonghe commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

AlexanderSinn commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

AlexanderSinn commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 16, 2025

Uh oh!

WeiqunZhang commented Sep 18, 2025

Uh oh!

WeiqunZhang commented Sep 18, 2025

Uh oh!

WeiqunZhang commented Sep 18, 2025

Uh oh!

WeiqunZhang commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

chongchonghe commented Sep 16, 2025 •

edited

Loading

WeiqunZhang commented Sep 16, 2025 •

edited

Loading

chongchonghe commented Sep 16, 2025 •

edited

Loading

AlexanderSinn commented Sep 16, 2025 •

edited

Loading