Skip to content

Conversation

avincigu
Copy link
Collaborator

No description provided.

@avincigu
Copy link
Collaborator Author

avincigu commented Mar 10, 2025

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

Copy link
Collaborator

@markbrown314 markbrown314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

Hi @avincigu please do the following:

  1. Make sure you pull the latest on top.
  2. Squash up to latest main commit.
  3. Resubmit with rebase -i.

Do you have any idea why this code was blocking in the first place? Was it to work around for some bug in the field? I think we should investigate that closer. Do we have any tests to make sure this working as advertised. Have those tests been run on a large fabric?

@avincigu
Copy link
Collaborator Author

Sorry for the amount of commits. There was a lot of experimentation. I will squash merge if approved.

Hi @avincigu please do the following:

  1. Make sure you pull the latest on top.
  2. Squash up to the tip of tree.
  3. Resubmit with rebase -i

Do you have any idea why this code was blocking in the first place? Was it to work around for some bug in the field? I think we should investigate that closer. Do we have any tests to make sure this working as advertised. Have those tests been run on a large fabric?

I think it was a miss. The bug was filed by Wasi when he discovered it for CXI, but I was able to find another instance in portals code. I looked at the other providers and confirmed that they are non-blocking to the best of my knowledge.

@lstewart
Copy link
Collaborator

lstewart commented Mar 11, 2025 via email

@markbrown314
Copy link
Collaborator

@lstewart I did not say to merge into main (which the system would not let you do without a review). I meant resubmit the PR for review as a single commit rather than multiple commits so that the change could be reviewed.

@lstewart
Copy link
Collaborator

lstewart commented Mar 11, 2025 via email

@avincigu
Copy link
Collaborator Author

avincigu commented Mar 12, 2025

I did a little more investigating ion the portals code and it seems like that one was correct after all, the queue handler was already set at initialization. I have confirmed that non-blocking fetch for all providers are now effectively, non-blocking.

@avincigu avincigu requested review from abrooks98 and removed request for philipmarshall21 March 13, 2025 15:47
TYPE tmp = 0; \
shmem_internal_atomic_fetch_nbi(ctx, (void *) source, &tmp, \
fetch, sizeof(TYPE), pe, \
SHM_INTERNAL_SUM, ITYPE); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain why we set the op here... Did this come up while you were looking into this @avincigu?

My hunch is this should be FI_ATOMIC_READ in the OFI transport layer, so do we need something like?

#define SHM_INTERNAL_ATOMIC_READ            FI_ATOMIC_READ

I do see some funny notes about FI_ATOMIC_READ and CXI/MR_ENDPOINT... Do they preclude this?

In other words, I'm unsure if it's always correct to pass FI_SUM to an fi_fetch_atomicmsg operation when no sum is meant to be performed. Are we sure about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like FI_ATOMIC_READ is not supported in CXI. Not sure what to do here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution that comes to mind is to rely on ENABLE_MR_ENDPOINT to support the workaround (which we do throughout the OFI transport already), this time in shmem_transport_fetch_atomic_nbi. It might look like:

#ifdef ENABLE_MR_ENDPOINT
op = FI_SUM
#endif

Have you tested FI_ATOMIC_READ lately with newer CXI software @avincigu? I can't imagine how/why FI_ATOMIC_READ is not a supported op for fi_fetch_atomicmsg... but FI_SUM is... it's strange, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davidozog , I'm having trouble determining where these changes need to happen. What you are suggesting covers only OFI but breaks every other transport. Do I have to define a new SHM_INTERNAL_ATOMIC_READ on every transport (portals, ucx, shared, none)?

Copy link
Collaborator

@markbrown314 markbrown314 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidozog I believe you are referencing this code:

in transport_ofi.h

static inline
void shmem_transport_atomic_fetch(shmem_transport_ctx_t* ctx, void *target,
                                  const void *source, size_t len, int pe,
                                  int datatype)
{
#ifdef ENABLE_MR_ENDPOINT
    /* CXI provider currently does not support fetch atomics with FI_DELIVERY_COMPLETE
     * That is why non-blocking API is used which uses FI_INJECT. FI_ATOMIC_READ is
     * also not supported currently */
    long long dummy = 0;
    shmem_transport_fetch_atomic_nbi(ctx, (void *) source, (const void *) &dummy,
                                     target, len, pe, FI_SUM, datatype);
#else
    shmem_transport_fetch_atomic_nbi(ctx, (void *) source, (const void *) NULL,
                                     target, len, pe, FI_ATOMIC_READ, datatype);
#endif
}

I think there should be a separate ticket to address if the FI_SUM and FI_DELIVERY_COMPLETE workarounds are still needed. (I will make a new ticket for this).

In this particular fix we are just cleaning up the problem: shmem_atomic_fetch_nbi() calls blocking fetch.

#define SHMEM_DEF_FETCH_NBI(STYPE,TYPE,ITYPE)                           \
    void SHMEM_FUNCTION_ATTRIBUTES                                      \
    SHMEM_FUNC_PROTOTYPE(STYPE, fetch_nbi, TYPE *fetch,                 \
                         const TYPE *source, int pe)                    \
        SHMEM_ERR_CHECK_INITIALIZED();                                  \
        SHMEM_ERR_CHECK_PE(pe);                                         \
        SHMEM_ERR_CHECK_CTX(ctx);                                       \
        SHMEM_ERR_CHECK_SYMMETRIC(source, sizeof(TYPE));                \
        shmem_internal_atomic_fetch(ctx, fetch, (void *) source,        \
                                    sizeof(TYPE), pe, ITYPE);           \
    }

We should fix that first.

I think the problem @avincigu is running into is that there is no function shmem_internal_atomic_fetch_nbi() like there is for blocking fetch (see shmem_internal_atomic_fetch()).

Note: There are two functions called shmem_internal_fetch_atomic_nbi() and shmem_internal_fetch_atomic() which require the op parameter (very confusing). This is very similar syntactically to the function name shmem_internal_atomic_fetch(). This should probably be refactored in the future. 😤

@avincigu you need to just implement a the function shmem_internal_atomic_fetch_nbi() which does not take in the op parameter see: shmem_internal_atomic_fetch() in shmem_comm.h and then define shmem_transport_atomic_fetch_nbi() for the various transports.

This is cleaner because it would not impose a potentially unnecessary sum operation at the higher level (read operations are likely faster than sum operations).

Hope that helps clarify. 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @markbrown314 Thank for chiming in. I did think about that, the trouble is that for some transports the fetch does require an op, check transport_portals4.h and transport_ucx.h for examples. In fact if you look at the UCX code you will notice that they are also using the ADD operation for the blocking fetch. This issue just keeps getting bigger the longer you look at it. My question is, if the code is working with the add operation, though we know it is conceptually wrong, do we really want to go through a big code rewrite? My proposal would be to file a separate issue where we would use the correct op codes for fetch, similar to what @davidozog is proposing but make sure it is developed for all transports and all fetches (blocking and non-blocking). I would say since the code is working the priority would be lower.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed it. But, even though the name does not imply, the implementation is non-blocking IIUC. The blocking version of the same function has an shmem_internal_get_wait() which makes sure the operation is complete before returning.
The issue with FI_SUM being used instead of FI_ATOMIC_READ is just for CXI. And, the code is weirdly handling it for now. If that issue is not existent any more, the changes related to the work-around can be reverted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avincigu yeah it is a bit of a lift, but I think correctness takes higher priority. I will reach out to you directly collaborate on the patch set with you.

Copy link
Member

@davidozog davidozog Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markbrown314, @avincigu, and @wrrobin - agreed.

To summarize, shmem_internal_fetch_atomic_nbi and shmem_transport_fetch_atomic_nbi don't really need to take an op. But "fetch_atomic" in all remote transports take an op, so the transport layer should pass one that makes sense.

In Portals, see v4.3 Table 3-4. I'd go with PTL_SUM and a value of zero maybe. Or something like that..

In UCX, same idea... maybe UCP_ATOMIC_FETCH_OP_FADD a value of 0? See: the UCP docs

For OFI, probably FI_ATOMIC_READ (except for whatever CXI workaround).

(Alternatively, maybe SHM_ATOMIC_READ could be defined accordingly in each transport and passed through from the top layer, i.e. shmem_internal_fetch_atomic_nbi... but I think I prefer Mark's idea.)

@avincigu avincigu requested a review from davidozog March 19, 2025 15:39
@markbrown314
Copy link
Collaborator

Created issue #1196 to address CXI atomic support concerns

Copy link
Member

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2afa333 doesn't look quite right to me in Portals... @avincigu doesn't this re-route the atomic to the "get" implementation? (shmem_transport_atomic_fetch does a get whereas shmem_transport_fetch_atomic does PtlFetchAtomic).

I think the two separate atomic_fetch vs. fetch_atomic paths is making me go a little cross-eyed and might lead to mistakes like the above. I'm sorry to backpedal a bit, but wouldn't passing a new opcode, SHM_INTERNAL_ATOMIC_READ (or similar) to a "unified" shmem_transport_fetch_atomic_nbi (that takes an op) be more straightforward?

@avincigu
Copy link
Collaborator Author

2afa333 doesn't look quite right to me in Portals... @avincigu doesn't this re-route the atomic to the "get" implementation? (shmem_transport_atomic_fetch does a get whereas shmem_transport_fetch_atomic does PtlFetchAtomic).

I think the two separate atomic_fetch vs. fetch_atomic paths is making me go a little cross-eyed and might lead to mistakes like the above. I'm sorry to backpedal a bit, but wouldn't passing a new opcode, SHM_INTERNAL_ATOMIC_READ (or similar) to a "unified" shmem_transport_fetch_atomic_nbi (that takes an op) be more straightforward?

I did notice that. But I feel like I keep inheriting previous issues on this fix. I'm really just calling what the blocking fetch already calls. I am not sure why the code has a get.

@markbrown314
Copy link
Collaborator

markbrown314 commented Mar 19, 2025

@davidozog correct me if I am wrong, but I believe PtlFetchAtomic() is non-blocking (along with most Portals API functions). Perhaps @avincigu can do fetch as sum of current value and zero technique here as in CXI and OFI. We can flag the atomic with get call for future investigation.

@davidozog
Copy link
Member

@davidozog correct me if I am wrong, but I believe PtlFetchAtomic() is non-blocking (along with most Portals API function).

Yeah, it's non-blocking. My concern was that this function is not being called after 2afa333, it's the PtlGet path... But I think @avincigu is saying that the "get" version was already in place for SOS non-blocking atomics? That looks to be the case...

Either way... the shmem_*_atomic_fetch vs shmem_*_fetch_atomic is difficult to follow, and 2afa333 leans into it a bit... maybe we can clean this up now is all I'm suggesting. It's also fine to defer it.

Copy link
Collaborator

@markbrown314 markbrown314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along well. Please do the following:

You need to release the operation and check status. (I missed that one when we discussed it earlier). See comment in transport_portals4.h.

  • Change the commit message to state that you addressing a bug #1066

  • And mention as a sub item the following:

    • Adding new definitions for shmem_transport_atomic_fetch_nbi
  • Squash all commits in to a single

void shmem_transport_atomic_fetch_nbi(shmem_transport_ctx_t* ctx, void *target,
const void *source, size_t len, int pe,
int datatype)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we validate that FI_DELIVERY_COMPLETE is supported in CXI this shmem_transport_atomic_fetch may go back to being blocking. I will make a note of that in issue #1196.

Copy link
Collaborator

@markbrown314 markbrown314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. However, this branch has diverged from main. Could you please rebase this on top of main and make sure this is just a single commit, and I will give final approval? (assuming there are no other objections).

…SHMEM#1066

Adding new definitions for shmem_transport_atomic_fetch_nbi
@avincigu
Copy link
Collaborator Author

Looks good to me. However, this branch has diverged from main. Could you please rebase this on top of main and make sure this is just a single commit, and I will give final approval? (assuming there are no other objections).

Done

@avincigu avincigu requested a review from markbrown314 March 24, 2025 17:17
@avincigu avincigu merged commit 757a33e into Sandia-OpenSHMEM:main Mar 25, 2025
36 checks passed
@avincigu avincigu deleted the nbi branch March 25, 2025 15:07
@davidozog
Copy link
Member

LGTM too. I just recommend unifying the transport_atomic_fetch vs transport_fetch_atomic paths at some point - it looks possible to do, albeit painful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants