Skip to content

xshseqr and xdhseqr fail with FPE if run in parallel #69

@drhpc

Description

@drhpc

In current master, two tests fail if run in parallel:

69/70 Testing: xshseqr
69/70 Test: xshseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xshseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xshseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PSHSEQR

 epsilon   =    5.96046448E-08
 threshold =    30.0000000    

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x151fa27c93ff in ???
#1  0x151fa455124f in pstrord_
        at /home/rrztest/src/scalapack/SRC/pstrord.f:1087
#2  0x151fa457a300 in pslaqr3_
        at /home/rrztest/src/scalapack/SRC/pslaqr3.f:880
#3  0x151fa4565178 in pslaqr0_
        at /home/rrztest/src/scalapack/SRC/pslaqr0.f:598
#4  0x151fa456209d in pshseqr_
        at /home/rrztest/src/scalapack/SRC/pshseqr.f:441
#5  0x4036cf in pshseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:413
#6  0x404427 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.91 sec
----------------------------------------------------------
Test Failed.
"xshseqr" end time: Jul 25 20:04 CEST
"xshseqr" time elapsed: 00:00:02
----------------------------------------------------------

70/70 Testing: xdhseqr
70/70 Test: xdhseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xdhseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xdhseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PDHSEQR

 epsilon   =    1.1102230246251565E-016
 threshold =    30.000000000000000     

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x1488be0113ff in ???
#1  0x1488bff4ebae in pdtrord_
        at /home/rrztest/src/scalapack/SRC/pdtrord.f:1087
#2  0x1488bff77f2f in pdlaqr3_
        at /home/rrztest/src/scalapack/SRC/pdlaqr3.f:878
#3  0x1488bff62d2b in pdlaqr0_
        at /home/rrztest/src/scalapack/SRC/pdlaqr0.f:598
#4  0x1488bff5fc1d in pdhseqr_
        at /home/rrztest/src/scalapack/SRC/pdhseqr.f:441
#5  0x4036e2 in pdhseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:412
#6  0x404445 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:564
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.70 sec
----------------------------------------------------------
Test Failed.
"xdhseqr" end time: Jul 25 20:04 CEST
"xdhseqr" time elapsed: 00:00:02
----------------------------------------------------------

End testing: Jul 25 20:04 CEST

Both tests pass fine with -n 1. I tested on two machines with differing compilers and MPI versions (4.1.1 and 1.10.7).

I observe weirdly long runtimes (hundreds of seconds) for some 2.2.0 tests when run inside the pkgsrc build framework, but they do succeed eventually. These FPEs are more definite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions