PyLops
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/gpu.rst‎
Lines changed: 36 additions & 4 deletions b/‎docs/source/gpu.rst‎
Lines changed: 36 additions & 4 deletions
diff --git a/‎docs/source/installation.rst‎
Lines changed: 38 additions & 6 deletions b/‎docs/source/installation.rst‎
Lines changed: 38 additions & 6 deletions
@@ -47,7 +47,7 @@ lint:
 	flake8 pylops_mpi/ tests/ examples/ tutorials/
 
 tests:
-	mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
+	export TEST_CUPY_PYLOPS=0 && mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
 
 # assuming NUM_PROCESSES <= number of gpus available
 tests_gpu:
 
@@ -11,7 +11,7 @@ This library must be installed *before* PyLops-mpi is installed.
 
 .. note::
 
-   Set environment variable ``CUPY_PYLOPS=0`` to force PyLops to ignore the ``cupy`` backend.
+   Set the environment variable ``CUPY_PYLOPS=0`` to force PyLops to ignore the ``cupy`` backend.
    This can be also used if a previous (or faulty) version of ``cupy`` is installed in your system,
    otherwise you will get an error when importing PyLops.
 
@@ -22,6 +22,14 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
 some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
 cupy arrays.
 
+.. note::
+
+   By default when using ``cupy`` arrays, PyLops-MPI will try to use methods in MPI4Py that communicate memory buffers.
+   However, this requires a CUDA-Aware MPI installation. If your MPI installation is not CUDA-Aware, set the 
+   environment variable ``PYLOPS_MPI_CUDA_AWARE=0`` to force PyLops-MPI to use methods in  MPI4Py that communicate
+   general Python objects (this will incur a loss of performance!).
+
+
 Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
 collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
 proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
@@ -30,13 +38,35 @@ proprietary technology like NVLink that might be available in their infrastructu
 
    Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
    However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
-   the :class:`pylops_mpi.DistributedArray` 
+   the :class:`pylops_mpi.DistributedArray`.
+
+In summary:
+
+.. list-table::
+   :widths: 50 25 25
+   :header-rows: 1
+
+   * - Operation model
+     - Enabled with
+     - Disabled with 
+   * - NumPy + MPI
+     - Default
+     - Cannot be disabled
+   * - CuPy + MPI
+     - ``PYLOPS_MPI_CUDA_AWARE=0``
+     - ``PYLOPS_MPI_CUDA_AWARE=1`` (default)
+   * - CuPy + CUDA-Aware MPI
+     - ``PYLOPS_MPI_CUDA_AWARE=1`` (default)
+     - ``PYLOPS_MPI_CUDA_AWARE=0``
+   * - CuPy + NCCL
+     - ``NCCL_PYLOPS_MPI=1`` (default)
+     - ``NCCL_PYLOPS_MPI=0``
 
 Example
 -------
 
 Finally, let's briefly look at an example. First we write a code snippet using
-``numpy`` arrays which PyLops-mpi will run on your CPU:
+``numpy`` arrays which PyLops-MPI will run on your CPU:
 
 .. code-block:: python
 
@@ -157,6 +187,8 @@ GPU+MPI, and GPU+NCCL):
      - ✅
      - ✅
      - ✅ 
+     - ✅ 
+     - ✅ 
    * - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
      - ✅
      - ✅
@@ -184,4 +216,4 @@ GPU+MPI, and GPU+NCCL):
    * - :class:`pylops_mpi.optimization.basic.cgls`
      - ✅
      - ✅
-     - ✅
+     - ✅
@@ -15,7 +15,13 @@ The minimal set of dependencies for the PyLops-MPI project is:
 * `MPI4py <https://mpi4py.readthedocs.io/en/stable/>`_
 * `PyLops <https://pylops.readthedocs.io/en/stable/>`_
 
-Additionally, to use the NCCL engine, the following additional 
+Additionally, to use the CUDA-aware MPI engine, the following additional 
+dependencies are required:
+
+* `CuPy <https://cupy.dev/>`_
+* CUDA-aware MPI
+
+Similarly, to use the NCCL engine, the following additional 
 dependencies are required:
 
 * `CuPy <https://cupy.dev/>`_
@@ -27,12 +33,18 @@ if this is not possible, some of the dependencies must be installed prior to ins
 
 Download and Install MPI
 ========================
-Visit the official MPI website to download an appropriate MPI implementation for your system.
-Follow the installation instructions provided by the MPI vendor.
+Visit the official website of your MPI vendor of choice to download an appropriate MPI 
+implementation for your system:
+
+* `Open MPI <https://docs.open-mpi.org/>`_
+* `MPICH <https://www.mpich.org/>`_
+* `Intel MPI <https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html>`_
+* ...
 
-* `Open MPI <https://www.open-mpi.org/software/ompi/v1.10/>`_
-* `MPICH <https://www.mpich.org/downloads/>`_
-* `Intel MPI <https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.10j8fx>`_
+Alternatively, the conda-forge community provides ready-to-use binary packages for four MPI implementations 
+(see `MPI4Py documentation <https://mpi4py.readthedocs.io/en/stable/install.html#conda-packages>`_ for more 
+details). In this case, you can defer the installation to the stage when the conda environment for your project 
+is created - see below for more details.
 
 Verify MPI Installation
 =======================
@@ -42,6 +54,17 @@ After installing MPI, verify its installation by opening a terminal and running
 
    >> mpiexec --version
 
+Install CUDA-Aware MPI (optional)
+=================================
+To be able to achieve the best performance when using PyLops-MPI with CuPy arrays, a CUDA-Aware version of 
+MPI must be installed.
+
+For `Open MPI`, the conda-forge package has built-in CUDA support, as long as a pre-installed CUDA is detected.
+Run the following `commands <https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#how-do-i-verify-that-open-mpi-has-been-built-with-cuda-support>`_
+for diagnostics.
+
+For the other MPI implementations, refer to their specific documentation.
+
 Install NCCL (optional)
 =======================
 To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls
@@ -103,6 +126,15 @@ For a ``conda`` environment, run
 This will create and activate an environment called ``pylops_mpi``, with all 
 required and optional dependencies.
 
+If you want to also install MPI as part of the creation process of the conda environment,
+modify the ``environment-dev.yml`` file by adding ``openmpi``\``mpich`\``impi_rt``\``msmpi``
+just above ``mpi4py``. Note that only ``openmpi`` provides a CUDA-Aware MPI installation.
+
+If you want to leverage CUDA-Aware MPI but prefer to use another MPI installation, you must
+either switch to a `Pip`-based installation (see below), or move ``mpi4py`` into the ``pip``
+section of the ``environment-dev.yml`` file and export the variable ``MPICC`` pointing to
+the path of your CUDA-Aware MPI installation.
+
 If you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ in PyLops-MPI, run this instead
 
 .. code-block:: bash