Fixing non-blocking functions to be actually non-blocking #1192

avincigu · 2025-03-10T13:36:39Z

No description provided.

avincigu · 2025-03-10T19:37:09Z

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

markbrown314

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

Hi @avincigu please do the following:

Make sure you pull the latest on top.
Squash up to latest main commit.
Resubmit with rebase -i.

Do you have any idea why this code was blocking in the first place? Was it to work around for some bug in the field? I think we should investigate that closer. Do we have any tests to make sure this working as advertised. Have those tests been run on a large fabric?

avincigu · 2025-03-11T15:46:58Z

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

Hi @avincigu please do the following:

Make sure you pull the latest on top.

Squash up to the tip of tree.

Resubmit with rebase -i

Do you have any idea why this code was blocking in the first place? Was it to work around for some bug in the field? I think we should investigate that closer. Do we have any tests to make sure this working as advertised. Have those tests been run on a large fabric?

I think it was a miss. The bug was filed by Wasi when he discovered it for CXI, but I was able to find another instance in portals code. I looked at the other providers and confirmed that they are non-blocking to the best of my knowledge.

lstewart · 2025-03-11T17:51:34Z

Why submit to the main branch at all until there is a finished thing? This is what PRs are for. -Larry/confused

…

On Mar 11, 2025, at 11:43 AM, Mark F. Brown ***@***.***> wrote: @markbrown314 requested changes on this pull request. Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved. Hi @avincigu <https://github.com/avincigu> please do the following: Make sure you pull the latest on top. Squash up to the tip of tree. Resubmit with rebase -i Do you have any idea why this code was blocking in the first place? Was it to work around for some bug in the field? I think we should investigate that closer. Do we have any tests to make sure this working as advertised. Have those tests been run on a large fabric? — Reply to this email directly, view it on GitHub <#1192 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEP3YSGUF5YPIDUAUWAOCD2T4ADZAVCNFSM6AAAAABYWESGJWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDMNZVGEZTENJUG4>. You are receiving this because you are subscribed to this thread.

markbrown314 · 2025-03-11T20:28:04Z

@lstewart I did not say to merge into main (which the system would not let you do without a review). I meant resubmit the PR for review as a single commit rather than multiple commits so that the change could be reviewed.

lstewart · 2025-03-11T23:41:27Z

Makes sense! ThanksOn Mar 11, 2025, at 4:28 PM, Mark F. Brown ***@***.***> wrote: @lstewart I did not say to merge into main (which the system would not let you do without a review). I meant resubmit the PR for review as a single commit rather than multiple commits so that the change could be reviewed.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***> markbrown314 left a comment (Sandia-OpenSHMEM/SOS#1192) @lstewart I did not say to merge into main (which the system would not let you do without a review). I meant resubmit the PR for review as a single commit rather than multiple commits so that the change could be reviewed. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

avincigu · 2025-03-12T03:01:40Z

I did a little more investigating ion the portals code and it seems like that one was correct after all, the queue handler was already set at initialization. I have confirmed that non-blocking fetch for all providers are now effectively, non-blocking.

davidozog · 2025-03-18T13:27:21Z

src/atomic_nbi_c.c4

+        TYPE tmp = 0;                                                   \
+        shmem_internal_atomic_fetch_nbi(ctx, (void *) source, &tmp,     \
+                                        fetch, sizeof(TYPE), pe,        \
+                                        SHM_INTERNAL_SUM, ITYPE);       \


I'm uncertain why we set the op here... Did this come up while you were looking into this @avincigu?

My hunch is this should be FI_ATOMIC_READ in the OFI transport layer, so do we need something like?

#define SHM_INTERNAL_ATOMIC_READ FI_ATOMIC_READ

I do see some funny notes about FI_ATOMIC_READ and CXI/MR_ENDPOINT... Do they preclude this?

In other words, I'm unsure if it's always correct to pass FI_SUM to an fi_fetch_atomicmsg operation when no sum is meant to be performed. Are we sure about this?

It does look like FI_ATOMIC_READ is not supported in CXI. Not sure what to do here

One solution that comes to mind is to rely on ENABLE_MR_ENDPOINT to support the workaround (which we do throughout the OFI transport already), this time in shmem_transport_fetch_atomic_nbi. It might look like:

#ifdef ENABLE_MR_ENDPOINT op = FI_SUM #endif

Have you tested FI_ATOMIC_READ lately with newer CXI software @avincigu? I can't imagine how/why FI_ATOMIC_READ is not a supported op for fi_fetch_atomicmsg... but FI_SUM is... it's strange, right?

Hey @davidozog , I'm having trouble determining where these changes need to happen. What you are suggesting covers only OFI but breaks every other transport. Do I have to define a new SHM_INTERNAL_ATOMIC_READ on every transport (portals, ucx, shared, none)?

@davidozog I believe you are referencing this code:

in transport_ofi.h

static inline void shmem_transport_atomic_fetch(shmem_transport_ctx_t* ctx, void *target, const void *source, size_t len, int pe, int datatype) { #ifdef ENABLE_MR_ENDPOINT /* CXI provider currently does not support fetch atomics with FI_DELIVERY_COMPLETE * That is why non-blocking API is used which uses FI_INJECT. FI_ATOMIC_READ is * also not supported currently */ long long dummy = 0; shmem_transport_fetch_atomic_nbi(ctx, (void *) source, (const void *) &dummy, target, len, pe, FI_SUM, datatype); #else shmem_transport_fetch_atomic_nbi(ctx, (void *) source, (const void *) NULL, target, len, pe, FI_ATOMIC_READ, datatype); #endif }

I think there should be a separate ticket to address if the FI_SUM and FI_DELIVERY_COMPLETE workarounds are still needed. (I will make a new ticket for this).

In this particular fix we are just cleaning up the problem: shmem_atomic_fetch_nbi() calls blocking fetch.

#define SHMEM_DEF_FETCH_NBI(STYPE,TYPE,ITYPE) \ void SHMEM_FUNCTION_ATTRIBUTES \ SHMEM_FUNC_PROTOTYPE(STYPE, fetch_nbi, TYPE *fetch, \ const TYPE *source, int pe) \ SHMEM_ERR_CHECK_INITIALIZED(); \ SHMEM_ERR_CHECK_PE(pe); \ SHMEM_ERR_CHECK_CTX(ctx); \ SHMEM_ERR_CHECK_SYMMETRIC(source, sizeof(TYPE)); \ shmem_internal_atomic_fetch(ctx, fetch, (void *) source, \ sizeof(TYPE), pe, ITYPE); \ }

We should fix that first.

I think the problem @avincigu is running into is that there is no function shmem_internal_atomic_fetch_nbi() like there is for blocking fetch (see shmem_internal_atomic_fetch()).

Note: There are two functions called shmem_internal_fetch_atomic_nbi() and shmem_internal_fetch_atomic() which require the op parameter (very confusing). This is very similar syntactically to the function name shmem_internal_atomic_fetch(). This should probably be refactored in the future. 😤

@avincigu you need to just implement a the function shmem_internal_atomic_fetch_nbi() which does not take in the op parameter see: shmem_internal_atomic_fetch() in shmem_comm.h and then define shmem_transport_atomic_fetch_nbi() for the various transports.

This is cleaner because it would not impose a potentially unnecessary sum operation at the higher level (read operations are likely faster than sum operations).

Hope that helps clarify. 🤔

Hey @markbrown314 Thank for chiming in. I did think about that, the trouble is that for some transports the fetch does require an op, check transport_portals4.h and transport_ucx.h for examples. In fact if you look at the UCX code you will notice that they are also using the ADD operation for the blocking fetch. This issue just keeps getting bigger the longer you look at it. My question is, if the code is working with the add operation, though we know it is conceptually wrong, do we really want to go through a big code rewrite? My proposal would be to file a separate issue where we would use the correct op codes for fetch, similar to what @davidozog is proposing but make sure it is developed for all transports and all fetches (blocking and non-blocking). I would say since the code is working the priority would be lower.

Sorry if I missed it. But, even though the name does not imply, the implementation is non-blocking IIUC. The blocking version of the same function has an shmem_internal_get_wait() which makes sure the operation is complete before returning.
The issue with FI_SUM being used instead of FI_ATOMIC_READ is just for CXI. And, the code is weirdly handling it for now. If that issue is not existent any more, the changes related to the work-around can be reverted.

@avincigu yeah it is a bit of a lift, but I think correctness takes higher priority. I will reach out to you directly collaborate on the patch set with you.

Thanks @markbrown314, @avincigu, and @wrrobin - agreed.

To summarize, shmem_internal_fetch_atomic_nbi and shmem_transport_fetch_atomic_nbi don't really need to take an op. But "fetch_atomic" in all remote transports take an op, so the transport layer should pass one that makes sense.

In Portals, see v4.3 Table 3-4. I'd go with PTL_SUM and a value of zero maybe. Or something like that..

In UCX, same idea... maybe UCP_ATOMIC_FETCH_OP_FADD a value of 0? See: the UCP docs

For OFI, probably FI_ATOMIC_READ (except for whatever CXI workaround).

(Alternatively, maybe SHM_ATOMIC_READ could be defined accordingly in each transport and passed through from the top layer, i.e. shmem_internal_fetch_atomic_nbi... but I think I prefer Mark's idea.)

markbrown314 · 2025-03-19T15:52:45Z

Created issue #1196 to address CXI atomic support concerns

davidozog

2afa333 doesn't look quite right to me in Portals... @avincigu doesn't this re-route the atomic to the "get" implementation? (shmem_transport_atomic_fetch does a get whereas shmem_transport_fetch_atomic does PtlFetchAtomic).

I think the two separate atomic_fetch vs. fetch_atomic paths is making me go a little cross-eyed and might lead to mistakes like the above. I'm sorry to backpedal a bit, but wouldn't passing a new opcode, SHM_INTERNAL_ATOMIC_READ (or similar) to a "unified" shmem_transport_fetch_atomic_nbi (that takes an op) be more straightforward?

avincigu · 2025-03-19T16:45:28Z

2afa333 doesn't look quite right to me in Portals... @avincigu doesn't this re-route the atomic to the "get" implementation? (shmem_transport_atomic_fetch does a get whereas shmem_transport_fetch_atomic does PtlFetchAtomic).

I think the two separate atomic_fetch vs. fetch_atomic paths is making me go a little cross-eyed and might lead to mistakes like the above. I'm sorry to backpedal a bit, but wouldn't passing a new opcode, SHM_INTERNAL_ATOMIC_READ (or similar) to a "unified" shmem_transport_fetch_atomic_nbi (that takes an op) be more straightforward?

I did notice that. But I feel like I keep inheriting previous issues on this fix. I'm really just calling what the blocking fetch already calls. I am not sure why the code has a get.

markbrown314 · 2025-03-19T17:27:10Z

@davidozog correct me if I am wrong, but I believe PtlFetchAtomic() is non-blocking (along with most Portals API functions). Perhaps @avincigu can do fetch as sum of current value and zero technique here as in CXI and OFI. We can flag the atomic with get call for future investigation.

davidozog · 2025-03-19T17:37:24Z

@davidozog correct me if I am wrong, but I believe PtlFetchAtomic() is non-blocking (along with most Portals API function).

Yeah, it's non-blocking. My concern was that this function is not being called after 2afa333, it's the PtlGet path... But I think @avincigu is saying that the "get" version was already in place for SOS non-blocking atomics? That looks to be the case...

Either way... the shmem_*_atomic_fetch vs shmem_*_fetch_atomic is difficult to follow, and 2afa333 leans into it a bit... maybe we can clean this up now is all I'm suggesting. It's also fine to defer it.

src/transport_ucx.h

markbrown314

This is coming along well. Please do the following:

You need to release the operation and check status. (I missed that one when we discussed it earlier). See comment in transport_portals4.h.

Change the commit message to state that you addressing a bug #1066
And mention as a sub item the following:
- Adding new definitions for shmem_transport_atomic_fetch_nbi
Squash all commits in to a single

markbrown314 · 2025-03-24T15:25:43Z

src/transport_ofi.h

+void shmem_transport_atomic_fetch_nbi(shmem_transport_ctx_t* ctx, void *target,
+                                  const void *source, size_t len, int pe,
+                                  int datatype)
+{


Once we validate that FI_DELIVERY_COMPLETE is supported in CXI this shmem_transport_atomic_fetch may go back to being blocking. I will make a note of that in issue #1196.

markbrown314

Looks good to me. However, this branch has diverged from main. Could you please rebase this on top of main and make sure this is just a single commit, and I will give final approval? (assuming there are no other objections).

…SHMEM#1066 Adding new definitions for shmem_transport_atomic_fetch_nbi

avincigu · 2025-03-24T17:17:07Z

Looks good to me. However, this branch has diverged from main. Could you please rebase this on top of main and make sure this is just a single commit, and I will give final approval? (assuming there are no other objections).

Done

davidozog · 2025-03-25T15:25:52Z

LGTM too. I just recommend unifying the transport_atomic_fetch vs transport_fetch_atomic paths at some point - it looks possible to do, albeit painful.

avincigu mentioned this pull request Mar 10, 2025

Non-blocking atomic fetch is not non-blocking #1066

Closed

avincigu assigned davidozog, wrrobin, markbrown314 and bcmIntc and unassigned davidozog, wrrobin, markbrown314 and bcmIntc Mar 10, 2025

avincigu requested review from bcmIntc, davidozog, markbrown314 and wrrobin March 10, 2025 19:35

avincigu requested a review from philipmarshall21 March 10, 2025 19:37

markbrown314 requested changes Mar 11, 2025

View reviewed changes

avincigu force-pushed the nbi branch from 1cf3c86 to 5ac145f Compare March 11, 2025 15:52

avincigu requested a review from markbrown314 March 11, 2025 16:56

avincigu force-pushed the nbi branch from 81df6e5 to b942a02 Compare March 12, 2025 02:59

avincigu requested review from abrooks98 and removed request for philipmarshall21 March 13, 2025 15:47

davidozog reviewed Mar 18, 2025

View reviewed changes

avincigu requested a review from davidozog March 19, 2025 15:39

davidozog reviewed Mar 19, 2025

View reviewed changes

markbrown314 reviewed Mar 20, 2025

View reviewed changes

src/transport_ucx.h Show resolved Hide resolved

src/transport_ucx.h Show resolved Hide resolved

markbrown314 requested changes Mar 20, 2025

View reviewed changes

avincigu requested a review from markbrown314 March 21, 2025 18:07

markbrown314 mentioned this pull request Mar 24, 2025

Do We Still Need OFI CXI (FI_DELIVERY_COMPLETE and FI_ATOMIC_READ) Work Arounds? #1196

Open

markbrown314 reviewed Mar 24, 2025

View reviewed changes

Bug Fix for Non-blocking atomic fetch is not non-blocking Sandia-Open…

684bffd

…SHMEM#1066 Adding new definitions for shmem_transport_atomic_fetch_nbi

avincigu force-pushed the nbi branch from 1a7bb10 to 684bffd Compare March 24, 2025 17:16

avincigu requested a review from markbrown314 March 24, 2025 17:17

markbrown314 approved these changes Mar 25, 2025

View reviewed changes

avincigu merged commit 757a33e into Sandia-OpenSHMEM:main Mar 25, 2025
36 checks passed

avincigu deleted the nbi branch March 25, 2025 15:07

markbrown314 mentioned this pull request Mar 25, 2025

Rename Confusing Internal Transport Fetch Atomic APIs #1199

Open

Fixing non-blocking functions to be actually non-blocking #1192

Fixing non-blocking functions to be actually non-blocking #1192

Uh oh!

Conversation

avincigu commented Mar 10, 2025

Uh oh!

avincigu commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markbrown314 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avincigu commented Mar 11, 2025

Uh oh!

lstewart commented Mar 11, 2025 via email

Uh oh!

markbrown314 commented Mar 11, 2025

Uh oh!

lstewart commented Mar 11, 2025 via email

Uh oh!

avincigu commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidozog Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 commented Mar 19, 2025

Uh oh!

davidozog left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avincigu commented Mar 19, 2025

Uh oh!

markbrown314 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidozog commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

markbrown314 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 left a comment

Choose a reason for hiding this comment

Uh oh!

avincigu commented Mar 24, 2025

Uh oh!

Uh oh!

davidozog commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

avincigu commented Mar 10, 2025 •

edited

Loading

markbrown314 left a comment •

edited

Loading

avincigu commented Mar 12, 2025 •

edited

Loading

markbrown314 Mar 19, 2025 •

edited

Loading

davidozog Mar 19, 2025 •

edited

Loading

davidozog left a comment •

edited

Loading

markbrown314 commented Mar 19, 2025 •

edited

Loading