-
Notifications
You must be signed in to change notification settings - Fork 3
Cache avoidance experiment #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mg/reduce-cache-avoidance
Are you sure you want to change the base?
Cache avoidance experiment #9
Conversation
Uhm, that's very weird. I'd expect cache avoidance to increase latency, not to reduce it, as it looks like in the first couple of plots. If I understand IMB code correctly, they create vectors larger than the cache size and then each iteration uses a different chunk of the vector, which is the strategy I tried to implement (but didn't test because it was clearly not good for 0-element vectors, thanks for catching that). I guess using a random vector should be fine, but I don't have many other ideas. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## mg/reduce-cache-avoidance #9 +/- ##
============================================================
Coverage ? 99.70%
============================================================
Files ? 14
Lines ? 344
Branches ? 0
============================================================
Hits ? 343
Misses ? 1
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Experiment 2Configuration 1 (2 Nodes, 1 Rank each)
Configuration 2 (4 Nodes, 1 Rank each)
Hi Dr. @giordano, After running experiment 2, I think cache avoidance is working. The problem with the first experiment (#9 (comment)) is because of slurm configurational issue. Also in the first experiment, I was using 1 Node with 4 MPI ranks. And there was a high chance that MPI ranks are communicated through For the second experiment, Instead of using a single Node, I used 2 different Nodes with each node having one rank. This makes sure that there is no communication through a Furthermore, If we pass the parameter of Example
|
This is awesome, thanks a lot! Ok, I should get around running the reduce benchmark on Fugaku again in the next weeks, to if this solves the mysterious lower latency. If you don't mind, I'd put this PR on hold for the time being. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a bit of code duplication, but this isn't really your fault here, there has always been but this PR is a further example that things should be rationalised to avoid copying the same blocks of code everywhere.
Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>
Ok, sorry for the very follow up, but I finally got around running the benchmarks on Fugaku with this branch with benchmark(IMBReduce(Float32; off_cache=1<<16)) but sadly this doesn't seem to fill the gap 😞 At this point I have very little clue to explain why there's such a large difference in https://github.com/giordano/julia-on-fugaku/blob/91b07995bd2b63761edb57e6376a7b6b77c59728/benchmarks/mpi-collective/reduce-latency.pdf and the fact that this vanishes at 64 kiB, exactly size of L1 cache of A64FX, was a good indication this was related to the cache 😞 |
Hi,
In order to test the cache avoidance I ran 'osu_reduce.jl' on Noctua 1 multiple times but it seems that this strategy is not working.
First I try by setting the buffer size greater than the L1 Cache, then by setting the buffer size greater than L3 Cache. The results below are generated by setting the send_buffer greater than L3 Cache of Noctua 1. Also, I run it on one Node with 4 MPI ranks.
The graphs below show that when we run it with cache avoidance, then it is faster.
I also tried other strategies to avoid cache avoidance for example using random numbers. Or setting the new send and receive buffer in each iteration but all these strategies also did not work. Do have any other idea how to solve this problem?