Skip to content

(Still) Excessive memory usage #118

@fstein93

Description

@fstein93

Dear authors,

I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.

A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:

  1. n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
  2. n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).

I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.

My questions are:

  1. What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
  2. Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
  3. Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?

EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.

EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions