-
Notifications
You must be signed in to change notification settings - Fork 5
Feat: restructuring of communication methods (and buffered communication for CUDA-Aware MPI) #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A new DistributedMix class is create with the aim of simpflify and unify all comm. calls in both DistributedArray and operators (further hiding away all implementation details).
@tharittk great start! Regarding the setup, I completely agree with the need to change the installation process for CUDA-Aware MPI. I have personally so far mostly relied on conda to install Regarding the code, as I briefly mentioned offline, whilst I think this is the right way to go:
i am starting to feel that the number of branches in code is growing and it is about time to put it all in one place... what I am mostly concerned is that this kind of branches will not only be present in
@astroC86 we have also talked a bit about this in the context of your |
@tharittk I worked a bit more on this, but there is still quite a bit to do (added to the TODO list in the PR main comment).... Also i am not really sure why some tests are failing on some specific combinations of python/mpi/nranks but not on others... have not investigated yet... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk I left a few comments, Distributed
is partially unfinished as we need to make sure all methods get passed the same inputs and don't rely on self.*
so that they can be used by both DistributedArray and operators.
Also running tests locally I pass NumPy+MPI and CuPY+MPI but for CuPy+NCCL I get a seg fault at tests_nccl/test_solver_nccl.py::test_cgls_broadcastdata_nccl[par0] Fatal Python error: Segmentation fault
. Same for you?
return base_comm.allgather(send_buf) | ||
return mpi_allgather(base_comm, send_buf, recv_buf, engine) | ||
|
||
def _allgather_subcomm(self, send_buf, recv_buf=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to be modified like _allgather
to avoid using self
inside..
else: | ||
return mpi_allgather(self.sub_comm, send_buf, recv_buf, self.engine) | ||
|
||
def _bcast(self, local_array, index, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
mpi_bcast(self.base_comm, self.rank, self.local_array, index, value, | ||
engine=self.engine) | ||
|
||
def _send(self, send_buf, dest, count=None, tag=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
send_buf, dest, count, tag=tag, | ||
engine=self.engine) | ||
|
||
def _recv(self, recv_buf=None, source=0, count=None, tag=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
from pylops.utils.backend import get_module | ||
|
||
|
||
# TODO: return type annotation for both cupy and numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be handled and removed.
@rohanbabbar04 I remember we discussed long time ago about this and you were actually the first to suggest using mixins... feel free to take a look and provide any feedback 😄 |
I don't have the problem with the CuPy + NCCL - I still got 309 test passed. This is my seqeuence of command: I switch to |
MMh interesting... I installed I fixed that 😄 So I can now run locally the following with success:
Seems like that the installation I thought had Cuda-aware MPI was not and things worked as I had PYLOPS_MPI_CUDA_AWARE=0 set... since you have Cuda-aware MPI can you please test the entire suite?
Apart from this (which we definitely need to try to put on a CI (it is just too many things to do locally now...), once you can handle my code comments above we should be almost ready to merge 🚀 |
Objective
This PR has two main goals:
DistributedArray
and various linear operators to a single common place, namely theDistributedMixIn
class, whose methods are used by bothDistributedArray
and the linear operatorsPYLOPS_MPI_CUDA_AWARE
is introduced (defaults to 1 but can be set to 0 to force object communications)CUDA-Aware MPI
In order to have a CUDA-aware mpi4py installation mpi4py must be build against a CUDA-aware MPI installation. Since conda installations of mpi4py do not ship with a CUDA-aware MPI, it is therefore required to use pip for installing mpi4py. In the case for NCSA Delta, I create a new conda environment and do
module load openmpi/5.0.5+cuda
then
MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py
(where
--force-reinstall
is only needed because we install alreadympi4py
as part of the conda environment creation.And to run the test (assuming you're in the compute node already):
To Do
mpi_allgather
method uses the_prepare_allgather_inputs
method to prepare inputs such that they are all of the same size (via zero-padding). Whilst this is strictly needed for NCCL, we should instead consider leveraging MPI `AllGatherv' instead to avoid extra padding and cutting of arrays - Use inAllGatherv
inmpi_allgather
#169MatrixMult
to remove any direct call to mpi4py communication methods - Use new unified communication methods inMatrixMult
#170