-
Notifications
You must be signed in to change notification settings - Fork 65
CI: Various Improvements to CI #1209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Various Improvements to CI #1209
Conversation
markbrown314
commented
Jun 24, 2025
- XPMEM build failure due to use of deprecated paging macros XPMEM incompatible with latest Ubuntu 24.04 Kernel #1206
- pud_large() and pmd_large() are not supported in newer Linux kernels
- Stubbed out pud_large() and pmd_large() referenced in xpmem_vaddr_to_pte_offset() via CI build script
- xpmem_vaddr_to_pte_offset() not used by SOS, so tests are not impacted
- Added timeout for SOS tests
- Added core dump artifact uploading and logging support
- Updated OFI testing to version 2.1.x
- Disabled failing Portals 4 tests Portals 4 Tests Fail on Ubuntu 24.04 #1208
exit 2 | ||
fi | ||
|
||
sudo bash -c 'echo '"${CORE_DIR}"'/%E.%p.core > /proc/sys/kernel/core_pattern' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking - Any issues with sudo being used in the CI script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sudo is required to modify sysfs. Sudo is okay in github runners
.github/scripts/scan_core.sh
Outdated
echo "notice: no core dump files found" | ||
exit 0 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be done before the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no difference the loop will just break out if core_list is an empty string. Good idea! Moving it up looks cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your call. I would move it to make it clear.
- config_name: Without Non-fetch AMO | ||
sos_config: --disable-nonfetch-amo --enable-pmi-simple | ||
libfabric_version: v1.13.x | ||
libfabric_version: v2.1.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can a github variable or secret be used for version info instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I do plan to refactor this to be array based and if there is a way we configure this externally as an override. That would be good.
.github/workflows/ci.yml
Outdated
SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2" check | ||
ulimit -c unlimited | ||
ulimit -a | ||
SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2 timeout --signal=ABRT 15m" check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add FI_PROVIDER=sockets also.
- name: Install dependencies | ||
run: | | ||
sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev | ||
sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev gdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Given the rampant use of sudo in this file, I guess it is assumed that CI has privs. Can ignore my first comment in that case.
6654e7d
to
6f23dee
Compare
There was a failure in the XPMEM_Only test with mt_membar. It is not caused by this change it is being monitored here #1191. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
.github/workflows/ci.yml
Outdated
sudo apt-get install linux-headers-`uname -r` | ||
./autogen.sh | ||
sed -i 's/^#define pmd_is_huge(p) pmd_large(p)/#define pmd_is_huge(p) (0)/' kernel/xpmem_pfn.c | ||
sed -i 's/^#define pud_is_huge(p) pud_large(p)/#define pud_is_huge(p) (0)/' kernel/xpmem_pfn.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I love this... have you considered using github.com:openucx/xpmem? It looks like they've forked and seem to do better w.r.t. pmd_is_huge
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check it out. The function that references these macros is not used by SOS, but it would be nice to use a better maintained xpmem.
6f23dee
to
feaa6a5
Compare
* XPMEM build failure due to use of deprecated paging macros Sandia-OpenSHMEM#1206 * pud_large() and pmd_large() are not supported in newer Linux kernels * Moved to OpenUCX version of XPMEM which is better supported * Added timeout for SOS tests * Added core dump artifact uploading and logging support * Updated OFI testing to version 2.1.x * Disabled failing Portals 4 tests Sandia-OpenSHMEM#1208 Signed-off-by: Mark F. Brown <mark.f.brown@intel.com>
feaa6a5
to
97c6627
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed commit message and switch to using OpenUCX XPMEM fork