[hipSOLVER] Intermittent POTRF test failures on gfx1152/gfx1153

### Problem Description

When running hipSOLVER gtest CI tests, around 1-6 of them fail. The exact number varies on each run, but all of the affected tests are:
- In the checkin_lapack/POTRF*** suites (performing POTRF/cholesky factorization of a matrix)
- Specifically, on 32-bit floating point data (double-precision data is fine)
- Operating on input matrices of 50x50 or 70x70 (smaller ones do not seem to show intermittent failures)
- Are all using the [potf2_kernel_small](https://github.com/ROCm/rocm-libraries/blob/f4cfa91fc0807ee1733c52d70ea4770ab5264575/projects/rocsolver/library/src/specialized/roclapack_potf2_specialized_kernels.hpp#L279) to calculate POTRF
- Usually failing on gfx1152/1153 due to an error threshold well above the tolerance (CPU vs GPU calculation comparison)
- Passing consistently on a gfx1151 (Strix Halo) machine with the same OS and ROCm build

When a test failure occurs, the max error of any matrix element tends to be around 0.2-0.3%, compared to the ~0.0006% threshold. It appears that floating point imprecision is one possible cause, perhaps with lossy math optimizations on certain GFX targets but not others. This is a guess, though.

Examples from test output:
```
[ RUN      ] checkin_lapack/POTRF.batched__float/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0022823080812497975 vs 8.3446502685546875e-06
[  FAILED  ] checkin_lapack/POTRF.batched__float/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (2 ms)

[ RUN      ] checkin_lapack/POTRF_FORTRAN.batched__float/8
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0023806761494299983 vs 5.9604644775390625e-06
[  FAILED  ] checkin_lapack/POTRF_FORTRAN.batched__float/8, where GetParam() = ({ 50, 50 }, 'L' (76, 0x4C)) (2 ms)

[ RUN      ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0019181809698055414 vs 8.3446502685546875e-06
[  FAILED  ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (3 ms)
```

### Operating System

Ubuntu 24.04 LTS

### CPU

AMD Ryzen AI 5 PRO 340 (gfx1152) + AMD Ryzen AI 5 330 (gfx1153)

### GPU

iGPU: Radeon 840M (gfx1152) + Radeon 820M (gfx1153)

### ROCm Version

Latest/nightly source build

### ROCm Component

hipSOLVER

### Steps to Reproduce

1. Build TheRock with `-DTHEROCK_AMDGPU_FAMILIES=gfx115X-igpu`
2. Obtain a board with a gfx1152/gfx1153 GPU
3. On this board, run hipSOLVER tests through the [github_actions script](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/test_executable_scripts/test_hipsolver.py), or directly: `hipsolver-test --gtest_filter=checkin_lapack/POTRF.batched__float*` (similar for `POTRF_FORTRAN` and `POTRF_COMPAT`)

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipSOLVER] Intermittent POTRF test failures on gfx1152/gfx1153 #3380

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[hipSOLVER] Intermittent POTRF test failures on gfx1152/gfx1153 #3380

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions