Cache avoidance experiment #9

binhudakhalid · 2022-11-18T17:57:26Z

Hi,

In order to test the cache avoidance I ran 'osu_reduce.jl' on Noctua 1 multiple times but it seems that this strategy is not working.
First I try by setting the buffer size greater than the L1 Cache, then by setting the buffer size greater than L3 Cache. The results below are generated by setting the send_buffer greater than L3 Cache of Noctua 1. Also, I run it on one Node with 4 MPI ranks.

The graphs below show that when we run it with cache avoidance, then it is faster.

Attempt	#1	-
Run 1
-
Run 2
-
Run 3
-

I also tried other strategies to avoid cache avoidance for example using random numbers. Or setting the new send and receive buffer in each iteration but all these strategies also did not work. Do have any other idea how to solve this problem?

giordano · 2022-11-18T18:34:48Z

Uhm, that's very weird. I'd expect cache avoidance to increase latency, not to reduce it, as it looks like in the first couple of plots. If I understand IMB code correctly, they create vectors larger than the cache size and then each iteration uses a different chunk of the vector, which is the strategy I tried to implement (but didn't test because it was clearly not good for 0-element vectors, thanks for catching that). I guess using a random vector should be fine, but I don't have many other ideas.

codecov-commenter · 2022-12-11T13:54:01Z

Codecov Report

❗ No coverage uploaded for pull request base (mg/reduce-cache-avoidance@dfbb92b). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@                     Coverage Diff                      @@
##             mg/reduce-cache-avoidance       #9   +/-   ##
============================================================
  Coverage                             ?   99.70%           
============================================================
  Files                                ?       14           
  Lines                                ?      344           
  Branches                             ?        0           
============================================================
  Hits                                 ?      343           
  Misses                               ?        1           
  Partials                             ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

binhudakhalid · 2022-12-11T16:14:50Z

Experiment 2

Configuration 1 (2 Nodes, 1 Rank each)

Attempt	#1	-
Run 1		-
Run 2		-

Configuration 2 (4 Nodes, 1 Rank each)

Attempt	#1	-
Run 1		-
Run 2		-

Hi Dr. @giordano,

After running experiment 2, I think cache avoidance is working. The problem with the first experiment (#9 (comment)) is because of slurm configurational issue.

Also in the first experiment, I was using 1 Node with 4 MPI ranks. And there was a high chance that MPI ranks are communicated through CPU-to-CPU interconnect. That is why the benchmarks with cache avoidance have lower latency.

For the second experiment, Instead of using a single Node, I used 2 different Nodes with each node having one rank. This makes sure that there is no communication through a CPU-to-CPU interconnect. Also, I have made sure that the Nodes are connected to the same switch. This guarantees reliable measurements between runs.

Furthermore, If we pass the parameter of off_cache, only then cache avoidance happens. The off_cache require the value in bytes.

Example

benchmark(IMBAllreduce(; verbose, filename, off_cache=28835))
benchmark(IMBAllreduce(; verbose, filename))

giordano · 2022-12-11T17:28:42Z

This is awesome, thanks a lot! Ok, I should get around running the reduce benchmark on Fugaku again in the next weeks, to if this solves the mysterious lower latency. If you don't mind, I'd put this PR on hold for the time being.

giordano

There's quite a bit of code duplication, but this isn't really your fault here, there has always been but this PR is a further example that things should be rationalised to avoid copying the same blocks of code everywhere.

src/imb_allreduce.jl

Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>

giordano · 2023-07-18T10:41:54Z

Ok, sorry for the very follow up, but I finally got around running the benchmarks on Fugaku with this branch with

benchmark(IMBReduce(Float32; off_cache=1<<16))

but sadly this doesn't seem to fill the gap 😞 At this point I have very little clue to explain why there's such a large difference in https://github.com/giordano/julia-on-fugaku/blob/91b07995bd2b63761edb57e6376a7b6b77c59728/benchmarks/mpi-collective/reduce-latency.pdf and the fact that this vanishes at 64 kiB, exactly size of L1 cache of A64FX, was a good indication this was related to the cache 😞

binhudakhalid added 3 commits November 18, 2022 18:29

remove division error

7fb3adc

corrected the cache size

a4250f1

remove division error from imb_reduce

5d99637

binhudakhalid added 3 commits December 11, 2022 14:18

added cache avoidance

2e2d34f

remove error

245626e

add cache_off parameter to imb_gatherv

0538a02

added test for cache avoidance

f8c0b39

giordano reviewed Dec 11, 2022

View reviewed changes

src/imb_allreduce.jl Outdated Show resolved Hide resolved

binhudakhalid and others added 2 commits December 11, 2022 23:41

Update src/imb_allreduce.jl

0b8bdb5

Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>

simplify num_buffer

e1a511e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache avoidance experiment #9

Cache avoidance experiment #9

Uh oh!

binhudakhalid commented Nov 18, 2022 •

edited

Loading

Uh oh!

giordano commented Nov 18, 2022

Uh oh!

codecov-commenter commented Dec 11, 2022 •

edited

Loading

Uh oh!

binhudakhalid commented Dec 11, 2022 •

edited

Loading

Uh oh!

giordano commented Dec 11, 2022 •

edited

Loading

Uh oh!

giordano left a comment

Uh oh!

Uh oh!

giordano commented Jul 18, 2023

Uh oh!

Uh oh!

Cache avoidance experiment #9

Are you sure you want to change the base?

Cache avoidance experiment #9

Uh oh!

Conversation

binhudakhalid commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Nov 18, 2022

Uh oh!

codecov-commenter commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

binhudakhalid commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiment 2

Configuration 1 (2 Nodes, 1 Rank each)

Configuration 2 (4 Nodes, 1 Rank each)

Example

Uh oh!

giordano commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giordano commented Jul 18, 2023

Uh oh!

Uh oh!

binhudakhalid commented Nov 18, 2022 •

edited

Loading

codecov-commenter commented Dec 11, 2022 •

edited

Loading

binhudakhalid commented Dec 11, 2022 •

edited

Loading

giordano commented Dec 11, 2022 •

edited

Loading