Skip to content

Cache avoidance experiment #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: mg/reduce-cache-avoidance
Choose a base branch
from

Conversation

binhudakhalid
Copy link
Contributor

@binhudakhalid binhudakhalid commented Nov 18, 2022

Hi,

In order to test the cache avoidance I ran 'osu_reduce.jl' on Noctua 1 multiple times but it seems that this strategy is not working.
First I try by setting the buffer size greater than the L1 Cache, then by setting the buffer size greater than L3 Cache. The results below are generated by setting the send_buffer greater than L3 Cache of Noctua 1. Also, I run it on one Node with 4 MPI ranks.

The graphs below show that when we run it with cache avoidance, then it is faster.

Attempt #1 -
Run 1 image
-
Run 2 image
-
Run 3 image
-

I also tried other strategies to avoid cache avoidance for example using random numbers. Or setting the new send and receive buffer in each iteration but all these strategies also did not work. Do have any other idea how to solve this problem?

@giordano
Copy link
Member

Uhm, that's very weird. I'd expect cache avoidance to increase latency, not to reduce it, as it looks like in the first couple of plots. If I understand IMB code correctly, they create vectors larger than the cache size and then each iteration uses a different chunk of the vector, which is the strategy I tried to implement (but didn't test because it was clearly not good for 0-element vectors, thanks for catching that). I guess using a random vector should be fine, but I don't have many other ideas.

@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2022

Codecov Report

❗ No coverage uploaded for pull request base (mg/reduce-cache-avoidance@dfbb92b). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@                     Coverage Diff                      @@
##             mg/reduce-cache-avoidance       #9   +/-   ##
============================================================
  Coverage                             ?   99.70%           
============================================================
  Files                                ?       14           
  Lines                                ?      344           
  Branches                             ?        0           
============================================================
  Hits                                 ?      343           
  Misses                               ?        1           
  Partials                             ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@binhudakhalid
Copy link
Contributor Author

binhudakhalid commented Dec 11, 2022

Experiment 2

Configuration 1 (2 Nodes, 1 Rank each)

Attempt #1 -
Run 1 image -
Run 2 image -

Configuration 2 (4 Nodes, 1 Rank each)

Attempt #1 -
Run 1 image -
Run 2 image -

Hi Dr. @giordano,

After running experiment 2, I think cache avoidance is working. The problem with the first experiment (#9 (comment)) is because of slurm configurational issue.

Also in the first experiment, I was using 1 Node with 4 MPI ranks. And there was a high chance that MPI ranks are communicated through CPU-to-CPU interconnect. That is why the benchmarks with cache avoidance have lower latency.

For the second experiment, Instead of using a single Node, I used 2 different Nodes with each node having one rank. This makes sure that there is no communication through a CPU-to-CPU interconnect. Also, I have made sure that the Nodes are connected to the same switch. This guarantees reliable measurements between runs.


Furthermore, If we pass the parameter of off_cache, only then cache avoidance happens. The off_cache require the value in bytes.

Example

benchmark(IMBAllreduce(; verbose, filename, off_cache=28835))
benchmark(IMBAllreduce(; verbose, filename))

@giordano
Copy link
Member

giordano commented Dec 11, 2022

This is awesome, thanks a lot! Ok, I should get around running the reduce benchmark on Fugaku again in the next weeks, to if this solves the mysterious lower latency. If you don't mind, I'd put this PR on hold for the time being.

Copy link
Member

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a bit of code duplication, but this isn't really your fault here, there has always been but this PR is a further example that things should be rationalised to avoid copying the same blocks of code everywhere.

binhudakhalid and others added 2 commits December 11, 2022 23:41
Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>
@giordano
Copy link
Member

Ok, sorry for the very follow up, but I finally got around running the benchmarks on Fugaku with this branch with

benchmark(IMBReduce(Float32; off_cache=1<<16))

but sadly this doesn't seem to fill the gap 😞 At this point I have very little clue to explain why there's such a large difference in https://github.com/giordano/julia-on-fugaku/blob/91b07995bd2b63761edb57e6376a7b6b77c59728/benchmarks/mpi-collective/reduce-latency.pdf and the fact that this vanishes at 64 kiB, exactly size of L1 cache of A64FX, was a good indication this was related to the cache 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants