-
Notifications
You must be signed in to change notification settings - Fork 71
[WIP] added option to run as native MPI #998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution Kenneth! Please note that we are planning on merging a PR that will make changes to some of the same code lines that you have made changes to (#871). We will try to get these merged onto After we have merged those changes, would you please mind rebasing your commit on the latest version of the I will note that we take changes to the MPI API seriously, so this PR may end up taking a good bit of time and effort, particularly for testing. Finally, can I ask, what OS are you using? Are you primarily planning on using this from HPC environments? Thanks, |
|
Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current |
|
Hi,
Maybe you could try rebasing, then if you run into any issues,
let me know.
Kenneth
Sent with [Proton Mail](https://proton.me/mail/home) secure email.
…On Thursday, February 20th, 2025 at 2:41 PM, Austin E. Soplata ***@***.***> wrote:
Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current master branch. Alternatively, if you would like me to do it, I can do the rebasing; just let me know.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMQNRDO44K4PYVV6BLD2QZK25AVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZSHA2TQOJSGU).
You are receiving this because you authored the thread.Message ID: ***@***.***>
[asoplata]asoplata left a comment [(jonescompneurolab/hnn-core#998)](#998 (comment))
Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current master branch. Alternatively, if you would like me to do it, I can do the rebasing; just let me know.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMQNRDO44K4PYVV6BLD2QZK25AVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZSHA2TQOJSGU).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
4fa1d5b to
38e8ea5
Compare
Co-authored-by: Austin E. Soplata <me@asoplata.com>
|
Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits. I have rebased and rewritten the history of your This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make? |
|
I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using |
|
Austin,
The benchmark code would be invoked with mpiexec and the --alreadympi flag:
mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi'
I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails.
Kenneth
Sent with [Proton Mail](https://proton.me/mail/home) secure email.
…On Friday, February 28th, 2025 at 8:47 AM, Austin E. Soplata ***@***.***> wrote:
I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using mpiexec, but we also need this knowledge for the documentation of how to use this functionality (this will not be merged without adding documentation explaining how to use it). Thank you
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMUG7NS56U6IKCDIRAD2SCHKVAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEYDKNZUHA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
[asoplata]asoplata left a comment [(jonescompneurolab/hnn-core#998)](#998 (comment))
I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using mpiexec, but we also need this knowledge for the documentation of how to use this functionality (this will not be merged without adding documentation explaining how to use it). Thank you
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMUG7NS56U6IKCDIRAD2SCHKVAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEYDKNZUHA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
Austin,
Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch
in your repo? Should I pull that into my fork?
Kenneth
Sent with [Proton Mail](https://proton.me/mail/home) secure email.
…On Friday, February 28th, 2025 at 8:43 AM, Austin E. Soplata ***@***.***> wrote:
Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits.
I have rebased and rewritten the history of your nativempi branch. If you have any uncommitted or unpushed work, please make a backup before you pull from your origin remote! Once you've done that, if you pull then everyone should be on the same page.
This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make?
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMRPNBE6KSX4WBFTOSL2SCG2HAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGA4TOMRSGM).
You are receiving this because you authored the thread.Message ID: ***@***.***>
[asoplata]asoplata left a comment [(jonescompneurolab/hnn-core#998)](#998 (comment))
Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits.
I have rebased and rewritten the history of your nativempi branch. If you have any uncommitted or unpushed work, please make a backup before you pull from your origin remote! Once you've done that, if you pull then everyone should be on the same page.
This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make?
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMRPNBE6KSX4WBFTOSL2SCG2HAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGA4TOMRSGM).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.
Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon. To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature. However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child |
|
Austin,
I'll try to write up something on Monday.
In terms of justification, the reason this mode is
useful on HPC clusters is that it's more portable
between MPI environments. (It was about 10 seconds
faster for a three minute run, but that's not that much difference.)
One issue is the command to start the processes in the backend.
It could be mpi_exec, but it might also be srun or mpirun,
depending on how the environment is configured. Another
example would be use of Singularity containers. For Neuroscience Gateway,
we like to run apps in Singularity containers, so the parallel backend
startup command would need to look something like mpi_exec -n 26
singularity singularity.img nrniv ....
A lot of MPI code is written to be MPI environment-agnostic,
so the code doesn't have to worry about environment-specific
details, like the invocation command, or whether it's running
in a singularity container. For me, that style is easier to
manage and work with.
Kenneth
Sent with [Proton Mail](https://proton.me/mail/home) secure email.
…On Friday, February 28th, 2025 at 12:28 PM, Austin E. Soplata ***@***.***> wrote:
> Austin, Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork? Kenneth
I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.
> Austin,
> The benchmark code would be invoked with mpiexec and the --alreadympi flag:
> mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi'
> I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails.
> Kenneth
Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon.
To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature.
However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child mpiexec calls (I'm sure the benchmark script is intended for that question), does it make it easier/possible to run HNN across multiple nodes simultaneously, etc.? Any description of the benefits you provide will be very helpful, and we will want to incorporate it into our documentation itself.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMU7SJ4HZJMDHF2AME32SDBG5AVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGQ4TCNRSGA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
[asoplata]asoplata left a comment [(jonescompneurolab/hnn-core#998)](#998 (comment))
> Austin, Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork? Kenneth
I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.
> Austin,
> The benchmark code would be invoked with mpiexec and the --alreadympi flag:
> mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi'
> I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails.
> Kenneth
Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon.
To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature.
However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child mpiexec calls (I'm sure the benchmark script is intended for that question), does it make it easier/possible to run HNN across multiple nodes simultaneously, etc.? Any description of the benefits you provide will be very helpful, and we will want to incorporate it into our documentation itself.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMU7SJ4HZJMDHF2AME32SDBG5AVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGQ4TCNRSGA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
Hello Kenneth, I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by @rythorpe here: #506 (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR. To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that. A very incomplete list of the work left to do for this PR includes:
|
|
Austin,
There's no rush to get these in, from an NSG perspective.
We hacked the parallel_backends.py file to use our
mpi startup commandline options and made a special
Tool on NSG to start up the user script in serial.
Also, I retired yesterday, so I'm not sure how much
time I will be able to donate for cleanup.
One note on kill_proc: a cleaner way to kill off
leftover processes on shared nodes is to use
the process group id to identify processes started
from a given process and only kill those. That
should work for both cases.
Kenneth
Sent with [Proton Mail](https://proton.me/mail/home) secure email.
…On Tuesday, March 4th, 2025 at 8:55 AM, Austin E. Soplata ***@***.***> wrote:
Hello Kenneth,
I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by ***@***.***(https://github.com/rythorpe) here: [#506](#506) (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR.
To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that.
A very incomplete list of the work left to do for this PR includes:
- At minimum, any code changes need to pass a local run of make test on an installed development version of hnn-core. See our [Contributing Page](https://jonescompneurolab.github.io/hnn-core/stable/contributing.html) for how to set that up, and [this section](https://jonescompneurolab.github.io/hnn-core/stable/contributing.html#running-tests) for running the tests.
- Currently, the existing pytest tests pass, but this PR fails our ruff checks for style. Once you have installed the development version and packages for hnn-core, you can run ruff check hnn_core from the repository's top-level directory to see what errors are causing the problems. These need to be fixed, since they will cause our automated Github testing to fail as well.
- The use of the term "alreadympi" should be replaced with something more descriptive. Similarly, we could use a good term for the non-alreadympi case, which will help us communicate about these two distinct code paths.
- The commented-out uses of kill_proc_name('nrniv') should be re-enabled, BUT only in the non-alreadympi case. We still want to use that functionality in certain cases, or at least have a way to control it (maybe a no_kill_processes flag or something similar?).
- All new functions like mpi_child.py::runit need to have comprehensive docstring documentation. The name should also be improved.
- Tests! If we're going to support invoking hnn-core's MPI usage this way, then there need to be tests created inside our hnn_core/tests directory that explicitly run from inside such an external MPI invocation. I'm not sure how to do this, but maybe the easiest way is to start an MPI "host-thing" that executes some Python where the Python calls pytest directly, like https://docs.pytest.org/en/6.2.x/usage.html#calling-pytest-from-python-code .
- Documentation: as I said in an earlier comment, once the API for this PR is settled, we need to add guidance to our documentation on how to use this functionality, including why you would want to use one type of MPI invocation vs our pre-existing mpiexec use.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMRVFKGD5MNSTNIMAPD2SXLHXAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJYGMZDQMJUHA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
[asoplata]asoplata left a comment [(jonescompneurolab/hnn-core#998)](#998 (comment))
Hello Kenneth,
I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by ***@***.***(https://github.com/rythorpe) here: [#506](#506) (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR.
To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that.
A very incomplete list of the work left to do for this PR includes:
- At minimum, any code changes need to pass a local run of make test on an installed development version of hnn-core. See our [Contributing Page](https://jonescompneurolab.github.io/hnn-core/stable/contributing.html) for how to set that up, and [this section](https://jonescompneurolab.github.io/hnn-core/stable/contributing.html#running-tests) for running the tests.
- Currently, the existing pytest tests pass, but this PR fails our ruff checks for style. Once you have installed the development version and packages for hnn-core, you can run ruff check hnn_core from the repository's top-level directory to see what errors are causing the problems. These need to be fixed, since they will cause our automated Github testing to fail as well.
- The use of the term "alreadympi" should be replaced with something more descriptive. Similarly, we could use a good term for the non-alreadympi case, which will help us communicate about these two distinct code paths.
- The commented-out uses of kill_proc_name('nrniv') should be re-enabled, BUT only in the non-alreadympi case. We still want to use that functionality in certain cases, or at least have a way to control it (maybe a no_kill_processes flag or something similar?).
- All new functions like mpi_child.py::runit need to have comprehensive docstring documentation. The name should also be improved.
- Tests! If we're going to support invoking hnn-core's MPI usage this way, then there need to be tests created inside our hnn_core/tests directory that explicitly run from inside such an external MPI invocation. I'm not sure how to do this, but maybe the easiest way is to start an MPI "host-thing" that executes some Python where the Python calls pytest directly, like https://docs.pytest.org/en/6.2.x/usage.html#calling-pytest-from-python-code .
- Documentation: as I said in an earlier comment, once the API for this PR is settled, we need to add guidance to our documentation on how to use this functionality, including why you would want to use one type of MPI invocation vs our pre-existing mpiexec use.
—
Reply to this email directly, [view it on GitHub](#998 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABK2LMRVFKGD5MNSTNIMAPD2SXLHXAVCNFSM6AAAAABXRVLLOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJYGMZDQMJUHA).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
Congratulations on the retirement Kenneth!! |
modifications to parallel_backends.py and mpi_child.py allows running as native MPI.
Will still run as before, if mpi_cmd is not ALREADYMPI.