Skip to content

[hipSOLVER] Intermittent POTRF test failures on gfx1152/gfx1153 #3380

@amd-callumm

Description

@amd-callumm

Problem Description

When running hipSOLVER gtest CI tests, around 1-6 of them fail. The exact number varies on each run, but all of the affected tests are:

  • In the checkin_lapack/POTRF*** suites (performing POTRF/cholesky factorization of a matrix)
  • Specifically, on 32-bit floating point data (double-precision data is fine)
  • Operating on input matrices of 50x50 or 70x70 (smaller ones do not seem to show intermittent failures)
  • Are all using the potf2_kernel_small to calculate POTRF
  • Usually failing on gfx1152/1153 due to an error threshold well above the tolerance (CPU vs GPU calculation comparison)
  • Passing consistently on a gfx1151 (Strix Halo) machine with the same OS and ROCm build

When a test failure occurs, the max error of any matrix element tends to be around 0.2-0.3%, compared to the ~0.0006% threshold. It appears that floating point imprecision is one possible cause, perhaps with lossy math optimizations on certain GFX targets but not others. This is a guess, though.

Examples from test output:

[ RUN      ] checkin_lapack/POTRF.batched__float/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0022823080812497975 vs 8.3446502685546875e-06
[  FAILED  ] checkin_lapack/POTRF.batched__float/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (2 ms)

[ RUN      ] checkin_lapack/POTRF_FORTRAN.batched__float/8
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0023806761494299983 vs 5.9604644775390625e-06
[  FAILED  ] checkin_lapack/POTRF_FORTRAN.batched__float/8, where GetParam() = ({ 50, 50 }, 'L' (76, 0x4C)) (2 ms)

[ RUN      ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11
/therock/src/rocm-libraries/projects/hipsolver/clients/gtest/../include/testing_potrf.hpp:486: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.0019181809698055414 vs 8.3446502685546875e-06
[  FAILED  ] checkin_lapack/POTRF_FORTRAN.batched__float_complex/11, where GetParam() = ({ 70, 80 }, 'U' (85, 0x55)) (3 ms)

Operating System

Ubuntu 24.04 LTS

CPU

AMD Ryzen AI 5 PRO 340 (gfx1152) + AMD Ryzen AI 5 330 (gfx1153)

GPU

iGPU: Radeon 840M (gfx1152) + Radeon 820M (gfx1153)

ROCm Version

Latest/nightly source build

ROCm Component

hipSOLVER

Steps to Reproduce

  1. Build TheRock with -DTHEROCK_AMDGPU_FAMILIES=gfx115X-igpu
  2. Obtain a board with a gfx1152/gfx1153 GPU
  3. On this board, run hipSOLVER tests through the github_actions script, or directly: hipsolver-test --gtest_filter=checkin_lapack/POTRF.batched__float* (similar for POTRF_FORTRAN and POTRF_COMPAT)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    TODO

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions