Skip to content

v5.0.x: Fix Libfabric MR caching issues #13327 #13332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: v5.0.x
Choose a base branch
from

Conversation

bwbarrett
Copy link
Member

Backport of #13327.

This was an optimization around a bug in the EFA provider.  The EFA
provider shouldn't be caching explicit registrations anyway, so
avoiding the double cache is silly (and breaks when EFA fixes the
explicit registration cache bug).

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit c3e22c6)
The OFI MTL exports a memory monitor to Libfabric (so that OMPI's
patcher wins), but in cases where OB1 is directly selected, that
code won't run.  So make sure to also configure Libfabric so that
it won't try to use a suboptimial memory monitor in the case that
only the OFI BTL is used.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 937429f)
Rather than use the CXI provider name to disable explicit hmem
registration, use the FI_MR_HMEM flag.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit edf3634)
The OFI MTL was creating a registration for every operation that used
HMEM when FI_MR_HMEM is required.  This is really performance
inefficient, since creating registrations is expensive.  So stick a
rcache in front of the registrations.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 72a7e0e)
@bwbarrett bwbarrett requested review from hppritcha and a team July 14, 2025 20:02
@github-actions github-actions bot added this to the v5.0.8 milestone Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant