Skip to content
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1df4f21
Buffered Send/Recv
tharittk Aug 17, 2025
647ce65
Buffered Allreduce
tharittk Aug 17, 2025
31068f9
minor clean up
tharittk Aug 28, 2025
ca558fd
feat: WIP DistributedMix
mrava87 Sep 7, 2025
64854bb
feat: added _mpi file with actual mpi comm. implementations
mrava87 Sep 7, 2025
838ed0b
feat: moved _send to Distributed
mrava87 Sep 7, 2025
ab97e3d
mpi_recv for MixIn
tharittk Sep 12, 2025
dbe1f30
MixIn for allgather.
tharittk Sep 12, 2025
a08924b
fix flake8
tharittk Sep 12, 2025
b8bcd29
feat: added _bcast to DistributedMixIn and added comms as input for a…
mrava87SW Sep 23, 2025
f362436
feat: adapted all comm calls in DistributedArray to new method signat…
mrava87SW Sep 23, 2025
693f078
feat: adapted all comm calls in VStack to new method signatures
mrava87SW Sep 23, 2025
c852fc4
feat: adapted all comm calls in Fredholm1 to new method signatures
mrava87SW Sep 23, 2025
78d7538
feat: moved methods shared by _mpi and _nccl to _common
mrava87SW Sep 23, 2025
0138e3a
fix env flag precedence bug
tharittk Oct 9, 2025
ec88371
fix flake8
tharittk Oct 9, 2025
02ba45b
doc: added details about cuda-aware mpi in doc
mrava87SW Oct 15, 2025
a317a88
Merge branch 'main' into cuda-aware
tharittk Oct 16, 2025
473cd97
doc: finalized gpu doc
mrava87SW Oct 19, 2025
2cdb8f7
doc: added some docstrings to Distributed
mrava87SW Oct 19, 2025
1511744
Merge branch 'cuda-aware' of https://github.com/tharittk/pylops-mpi i…
mrava87SW Oct 20, 2025
563db16
minor: fix flake8
mrava87SW Oct 20, 2025
50c5bd2
bug: fix import of methods in test_ncclutils_nccl
mrava87SW Oct 20, 2025
a80f00e
bug: added engine to x array i test_matrixmult
mrava87SW Oct 20, 2025
2c67755
feat: finalized passing parameters to all methods in Distributed
mrava87SW Oct 25, 2025
e33e8f1
doc: added documentation and type hints to _mpi
mrava87SW Oct 25, 2025
e0fd716
minor: added TEST_CUPY_PYLOPS=0 in tests target
mrava87SW Oct 25, 2025
83f7a8b
minor: fix flake8
mrava87SW Oct 25, 2025
02efdbb
minor: fix more flake8
mrava87SW Oct 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion docs/source/gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This library must be installed *before* PyLops-mpi is installed.

.. note::

Set environment variable ``CUPY_PYLOPS=0`` to force PyLops to ignore the ``cupy`` backend.
Set the environment variable ``CUPY_PYLOPS=0`` to force PyLops to ignore the ``cupy`` backend.
This can be also used if a previous (or faulty) version of ``cupy`` is installed in your system,
otherwise you will get an error when importing PyLops.

Expand All @@ -22,6 +22,14 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
cupy arrays.

.. note::

By default when using ``cupy`` arrays, PyLops-MPI will try to use methods in MPI4Py that communicate memory buffers.
However, this requires a CUDA-Aware MPI installation. If your MPI installation is not CUDA-Aware, set the
environment variable ``PYLOPS_MPI_CUDA_AWARE=0`` to force PyLops-MPI to use methods in MPI4Py that communicate
general Python objects (this will incur a loss of performance!).


Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
Expand Down
44 changes: 38 additions & 6 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@ The minimal set of dependencies for the PyLops-MPI project is:
* `MPI4py <https://mpi4py.readthedocs.io/en/stable/>`_
* `PyLops <https://pylops.readthedocs.io/en/stable/>`_

Additionally, to use the NCCL engine, the following additional
Additionally, to use the CUDA-aware MPI engine, the following additional
dependencies are required:

* `CuPy <https://cupy.dev/>`_
* CUDA-aware MPI

Similarly, to use the NCCL engine, the following additional
dependencies are required:

* `CuPy <https://cupy.dev/>`_
Expand All @@ -27,12 +33,18 @@ if this is not possible, some of the dependencies must be installed prior to ins

Download and Install MPI
========================
Visit the official MPI website to download an appropriate MPI implementation for your system.
Follow the installation instructions provided by the MPI vendor.
Visit the official website of your MPI vendor of choice to download an appropriate MPI
implementation for your system:

* `Open MPI <https://docs.open-mpi.org/>`_
* `MPICH <https://www.mpich.org/>`_
* `Intel MPI <https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html>`_
* ...

* `Open MPI <https://www.open-mpi.org/software/ompi/v1.10/>`_
* `MPICH <https://www.mpich.org/downloads/>`_
* `Intel MPI <https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.10j8fx>`_
Alternatively, the conda-forge community provides ready-to-use binary packages for four MPI implementations
(see `MPI4Py documentation <https://mpi4py.readthedocs.io/en/stable/install.html#conda-packages>`_ for more
details). In this case, you can defer the installation to the stage when the conda environment for your project
is created - see below for more details.

Verify MPI Installation
=======================
Expand All @@ -42,6 +54,17 @@ After installing MPI, verify its installation by opening a terminal and running

>> mpiexec --version

Install CUDA-Aware MPI (optional)
=================================
To be able to achieve the best performance when using PyLops-MPI with CuPy arrays, a CUDA-Aware version of
MPI must be installed.

For `Open MPI`, the conda-forge package has built-in CUDA support, as long as a pre-installed CUDA is detected.
Run the following `commands <https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#how-do-i-verify-that-open-mpi-has-been-built-with-cuda-support>`_
for diagnostics.

For the other MPI implementations, refer to their specific documentation.

Install NCCL (optional)
=======================
To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls
Expand Down Expand Up @@ -103,6 +126,15 @@ For a ``conda`` environment, run
This will create and activate an environment called ``pylops_mpi``, with all
required and optional dependencies.

If you want to also install MPI as part of the creation process of the conda environment,
modify the ``environment-dev.yml`` file by adding ``openmpi``\``mpich`\``impi_rt``\``msmpi``
just above ``mpi4py``. Note that only ``openmpi`` provides a CUDA-Aware MPI installation.

If you want to leverage CUDA-Aware MPI but prefer to use another MPI installation, you must
either switch to a `Pip`-based installation (see below), or move ``mpi4py`` into the ``pip``
section of the ``environment-dev.yml`` file and export the variable ``MPICC`` pointing to
the path of your CUDA-Aware MPI installation.

If you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ in PyLops-MPI, run this instead

.. code-block:: bash
Expand Down
114 changes: 114 additions & 0 deletions pylops_mpi/Distributed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
from mpi4py import MPI
from pylops.utils import deps as pylops_deps # avoid namespace crashes with pylops_mpi.utils
from pylops_mpi.utils._mpi import mpi_allreduce, mpi_allgather, mpi_bcast, mpi_send, mpi_recv, _prepare_allgather_inputs, _unroll_allgather_recv
from pylops_mpi.utils import deps

cupy_message = pylops_deps.cupy_import("the DistributedArray module")
nccl_message = deps.nccl_import("the DistributedArray module")

if nccl_message is None and cupy_message is None:
from pylops_mpi.utils._nccl import (
nccl_allgather, nccl_allreduce, nccl_bcast, nccl_send, nccl_recv
)


class DistributedMixIn:
r"""Distributed Mixin class

This class implements all methods associated with communication primitives
from MPI and NCCL. It is mostly charged to identifying which commuicator
to use and whether the buffered or object MPI primitives should be used
(the former in the case of NumPy arrays or CuPy arrays when a CUDA-Aware
MPI installation is available, the latter with CuPy arrays when a CUDA-Aware
MPI installation is not available).
"""
def _allreduce(self, base_comm, base_comm_nccl,
send_buf, recv_buf=None, op: MPI.Op = MPI.SUM,
engine="numpy"):
"""Allreduce operation
"""
if deps.nccl_enabled and base_comm_nccl is not None:
return nccl_allreduce(base_comm_nccl, send_buf, recv_buf, op)
else:
return mpi_allreduce(base_comm, send_buf,
recv_buf, engine, op)

def _allreduce_subcomm(self, sub_comm, base_comm_nccl,
send_buf, recv_buf=None, op: MPI.Op = MPI.SUM,
engine="numpy"):
"""Allreduce operation with subcommunicator
"""
if deps.nccl_enabled and base_comm_nccl is not None:
return nccl_allreduce(sub_comm, send_buf, recv_buf, op)
else:
return mpi_allreduce(sub_comm, send_buf,
recv_buf, engine, op)

def _allgather(self, base_comm, base_comm_nccl,
send_buf, recv_buf=None,
engine="numpy"):
"""Allgather operation
"""
if deps.nccl_enabled and base_comm_nccl is not None:
if isinstance(send_buf, (tuple, list, int)):
return nccl_allgather(base_comm_nccl, send_buf, recv_buf)
else:
send_shapes = base_comm.allgather(send_buf.shape)
(padded_send, padded_recv) = _prepare_allgather_inputs(send_buf, send_shapes, engine="cupy")
raw_recv = nccl_allgather(base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
return _unroll_allgather_recv(raw_recv, padded_send.shape, send_shapes)
else:
if isinstance(send_buf, (tuple, list, int)):
return base_comm.allgather(send_buf)
return mpi_allgather(base_comm, send_buf, recv_buf, engine)

def _allgather_subcomm(self, send_buf, recv_buf=None):
"""Allgather operation with subcommunicator
"""
if deps.nccl_enabled and getattr(self, "base_comm_nccl"):
if isinstance(send_buf, (tuple, list, int)):
return nccl_allgather(self.sub_comm, send_buf, recv_buf)
else:
send_shapes = self._allgather_subcomm(send_buf.shape)
(padded_send, padded_recv) = _prepare_allgather_inputs(send_buf, send_shapes, engine="cupy")
raw_recv = nccl_allgather(self.sub_comm, padded_send, recv_buf if recv_buf else padded_recv)
return _unroll_allgather_recv(raw_recv, padded_send.shape, send_shapes)
else:
return mpi_allgather(self.sub_comm, send_buf, recv_buf, self.engine)

def _bcast(self, local_array, index, value):
"""BCast operation
"""
if deps.nccl_enabled and getattr(self, "base_comm_nccl"):
nccl_bcast(self.base_comm_nccl, local_array, index, value)
else:
# self.local_array[index] = self.base_comm.bcast(value)
mpi_bcast(self.base_comm, self.rank, self.local_array, index, value,
engine=self.engine)

def _send(self, send_buf, dest, count=None, tag=0):
"""Send operation
"""
if deps.nccl_enabled and self.base_comm_nccl:
if count is None:
count = send_buf.size
nccl_send(self.base_comm_nccl, send_buf, dest, count)
else:
mpi_send(self.base_comm,
send_buf, dest, count, tag=tag,
engine=self.engine)

def _recv(self, recv_buf=None, source=0, count=None, tag=0):
"""Receive operation
"""
if deps.nccl_enabled and self.base_comm_nccl:
if recv_buf is None:
raise ValueError("recv_buf must be supplied when using NCCL")
if count is None:
count = recv_buf.size
nccl_recv(self.base_comm_nccl, recv_buf, source, count)
return recv_buf
else:
return mpi_recv(self.base_comm,
recv_buf, source, count, tag=tag,
engine=self.engine)
Loading