Skip to content

Conversation

markbrown314
Copy link
Collaborator

  • XPMEM build failure due to use of deprecated paging macros XPMEM incompatible with latest Ubuntu 24.04 Kernel #1206
    • pud_large() and pmd_large() are not supported in newer Linux kernels
    • Stubbed out pud_large() and pmd_large() referenced in xpmem_vaddr_to_pte_offset() via CI build script
    • xpmem_vaddr_to_pte_offset() not used by SOS, so tests are not impacted
  • Added timeout for SOS tests
  • Added core dump artifact uploading and logging support
  • Updated OFI testing to version 2.1.x
  • Disabled failing Portals 4 tests Portals 4 Tests Fail on Ubuntu 24.04 #1208

exit 2
fi

sudo bash -c 'echo '"${CORE_DIR}"'/%E.%p.core > /proc/sys/kernel/core_pattern'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - Any issues with sudo being used in the CI script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sudo is required to modify sysfs. Sudo is okay in github runners

echo "notice: no core dump files found"
exit 0
fi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be done before the loop?

Copy link
Collaborator Author

@markbrown314 markbrown314 Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference the loop will just break out if core_list is an empty string. Good idea! Moving it up looks cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call. I would move it to make it clear.

- config_name: Without Non-fetch AMO
sos_config: --disable-nonfetch-amo --enable-pmi-simple
libfabric_version: v1.13.x
libfabric_version: v2.1.x
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can a github variable or secret be used for version info instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I do plan to refactor this to be array based and if there is a way we configure this externally as an override. That would be good.

SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2" check
ulimit -c unlimited
ulimit -a
SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2 timeout --signal=ABRT 15m" check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add FI_PROVIDER=sockets also.

- name: Install dependencies
run: |
sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev
sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev gdb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Given the rampant use of sudo in this file, I guess it is assumed that CI has privs. Can ignore my first comment in that case.

@markbrown314
Copy link
Collaborator Author

There was a failure in the XPMEM_Only test with mt_membar. It is not caused by this change it is being monitored here #1191.

Copy link
Collaborator

@bcmIntc bcmIntc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

sudo apt-get install linux-headers-`uname -r`
./autogen.sh
sed -i 's/^#define pmd_is_huge(p) pmd_large(p)/#define pmd_is_huge(p) (0)/' kernel/xpmem_pfn.c
sed -i 's/^#define pud_is_huge(p) pud_large(p)/#define pud_is_huge(p) (0)/' kernel/xpmem_pfn.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I love this... have you considered using github.com:openucx/xpmem? It looks like they've forked and seem to do better w.r.t. pmd_is_huge...

Copy link
Collaborator Author

@markbrown314 markbrown314 Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check it out. The function that references these macros is not used by SOS, but it would be nice to use a better maintained xpmem.

* XPMEM build failure due to use of deprecated paging macros Sandia-OpenSHMEM#1206
  * pud_large() and pmd_large() are not supported in newer Linux kernels
  * Moved to OpenUCX version of XPMEM which is better supported
* Added timeout for SOS tests
* Added core dump artifact uploading and logging support
* Updated OFI testing to version 2.1.x
* Disabled failing Portals 4 tests Sandia-OpenSHMEM#1208

Signed-off-by: Mark F. Brown <mark.f.brown@intel.com>
Copy link
Collaborator Author

@markbrown314 markbrown314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed commit message and switch to using OpenUCX XPMEM fork

@markbrown314 markbrown314 merged commit 26dc340 into Sandia-OpenSHMEM:main Jul 1, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants