-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Dear authors,
I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.
A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:
- n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
- n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).
I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.
My questions are:
- What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
- Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
- Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?
EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.
EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.