Problem Description
When running hipSOLVER gtest CI tests, around 1-6 of them fail. The exact number varies on each run, but all of the affected tests are:
- In the checkin_lapack/POTRF*** suites (performing POTRF/cholesky factorization of a matrix)
- Specifically, on 32-bit floating point data (double-precision data is fine)
- Operating on input matrices of 50x50 or 70x70 (smaller ones do not seem to show intermittent failures)
- Are all using the potf2_kernel_small to calculate POTRF
- Usually failing on gfx1152/1153 due to an error threshold well above the tolerance (CPU vs GPU calculation comparison)
- Passing consistently on a gfx1151 (Strix Halo) machine with the same OS and ROCm build
When a test failure occurs, the max error of any matrix element tends to be around 0.2-0.3%, compared to the ~0.0006% threshold. It appears that floating point imprecision is one possible cause, perhaps with lossy math optimizations on certain GFX targets but not others. This is a guess, though.
Examples from test output:
[ RUN ] checkin_lapack/POTRF.batched__float/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0022823080812497975 vs 8.3446502685546875e-06
[ FAILED ] checkin_lapack/POTRF.batched__float/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (2 ms)
[ RUN ] checkin_lapack/POTRF_FORTRAN.batched__float/8
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0023806761494299983 vs 5.9604644775390625e-06
[ FAILED ] checkin_lapack/POTRF_FORTRAN.batched__float/8, where GetParam() = ({ 50, 50 }, 'L' (76, 0x4C)) (2 ms)
[ RUN ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0019181809698055414 vs 8.3446502685546875e-06
[ FAILED ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (3 ms)
Operating System
Ubuntu 24.04 LTS
CPU
AMD Ryzen AI 5 PRO 340 (gfx1152) + AMD Ryzen AI 5 330 (gfx1153)
GPU
iGPU: Radeon 840M (gfx1152) + Radeon 820M (gfx1153)
ROCm Version
Latest/nightly source build
ROCm Component
hipSOLVER
Steps to Reproduce
- Build TheRock with
-DTHEROCK_AMDGPU_FAMILIES=gfx115X-igpu
- Obtain a board with a gfx1152/gfx1153 GPU
- On this board, run hipSOLVER tests through the github_actions script, or directly:
hipsolver-test --gtest_filter=checkin_lapack/POTRF.batched__float* (similar for POTRF_FORTRAN and POTRF_COMPAT)
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Problem Description
When running hipSOLVER gtest CI tests, around 1-6 of them fail. The exact number varies on each run, but all of the affected tests are:
When a test failure occurs, the max error of any matrix element tends to be around 0.2-0.3%, compared to the ~0.0006% threshold. It appears that floating point imprecision is one possible cause, perhaps with lossy math optimizations on certain GFX targets but not others. This is a guess, though.
Examples from test output:
Operating System
Ubuntu 24.04 LTS
CPU
AMD Ryzen AI 5 PRO 340 (gfx1152) + AMD Ryzen AI 5 330 (gfx1153)
GPU
iGPU: Radeon 840M (gfx1152) + Radeon 820M (gfx1153)
ROCm Version
Latest/nightly source build
ROCm Component
hipSOLVER
Steps to Reproduce
-DTHEROCK_AMDGPU_FAMILIES=gfx115X-igpuhipsolver-test --gtest_filter=checkin_lapack/POTRF.batched__float*(similar forPOTRF_FORTRANandPOTRF_COMPAT)(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response