Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanadis exits with "Assertion `isNaN_boxed( src_1 )' failed." #2317

Open
plafratt opened this issue Feb 8, 2024 · 2 comments
Open

Vanadis exits with "Assertion `isNaN_boxed( src_1 )' failed." #2317

plafratt opened this issue Feb 8, 2024 · 2 comments
Assignees

Comments

@plafratt
Copy link

plafratt commented Feb 8, 2024

New Issue for sst-elements

1 - Detailed description of problem or enhancement

Vanadis exits with FATAL.

sst: ./inst/vfpmul.h:119: void SST::Vanadis::VanadisFPMultiplyInstruction<fp_format>::execute(SST::Output*, SST::Vanadis::VanadisRegisterFile*) [with fp_format = float]: Assertion `isNaN_boxed( src_1 )' failed.

2 - Describe how to reproduce

Note that I've run this program natively on a RISCV machine, and it ran successfully to completion.

Download attached dgemm executable (dgemm.zip). From src/sst/elements/vanadis/tests/, run

$> VANADIS_EXE=./dgemm; VANADIS_EXE_ARGS="16 1"; sst ./basic_vanadis.py

(I zipped the executable, because GitHub wouldn't let me upload the file with no extension or with an "exe" extension.)

3 - What Operating system(s) and versions

Rocky Linux 8.9

4 - What version of external libraries (Boost, MPI)

5 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc)

sst-core e952a81bc
sst-elements 7e67f8f

6 - Fill out Labels, Milestones, and Assignee fields as best possible

It doesn't appear to me that GitHub will allow me to edit these fields. But maybe I am missing it.

@plafratt
Copy link
Author

plafratt commented Mar 28, 2024

I have looked at the ROB (see below) and instruction trace when this assertion failure occurs. The floating point instruction that is causing the assertion failure is one near the back of the ROB, at address 0x5f8e0.

What confuses me is that this instruction is in the executable at an address after the final instruction (a return instruction) of another function. That is, that return instruction (JR at 0x5f8da) is in function X, and 0x5f8e0 is in function Y. Because of this, I think this instruction is a throw-away instruction (because once the return instruction is eventually executed, the pipeline should recognize that the instruction at 0x5f8e0 should not have been executed at all).

Strictly speaking, I don't see any problem with this situation - from the perspective of functional correctness - up to this point. However, the problem I see is that this assertion is failing based on the contents of the floating point registers used by this instruction (this assertion can be found here). Since this instruction is a throw-away instruction within a function that will not be reached in the actual control flow, I don't see how there can be any guarantee about what its source registers contain. Is there any reason not to believe the source registers may contain random garbage? If not, I don't see how we can assert anything about their contents. But, can someone please correct me if I'm wrong about this and thinking about this situation incorrectly?

[node0,Core:    0/1623572047]: 0: -- ROB: 45 out of 64 entries:
[node0,Core:    0/1623572047]: 0: ----> ROB[44]: ins: 0x000000000005f8fc /      FPCMP / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[43]: ins: 0x000000000005f8f8 /    FP32ADD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[42]: ins: 0x000000000005f8f4 /    FP32SUB / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[41]: ins: 0x000000000005f8f0 /   FPSIGN32 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[40]: ins: 0x000000000005f8ec /    FP32MUL / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[39]: ins: 0x000000000005f8e8 /   FPSIGN32 / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[38]: ins: 0x000000000005f8e4 /    FP32MUL / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[37]: ins: 0x000000000005f8e0 /    FP32MUL / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[36]: ins: 0x000000000005f8dc /    FP32MUL / error:  no / issued: yes / spec:  no / rob-front:  no / exe: yes
[node0,Core:    0/1623572047]: 0: ----> ROB[35]: ins: 0x000000000005f8da /         JR / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[34]: ins: 0x000000000005f8d8 /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[33]: ins: 0x000000000005f8d6 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[32]: ins: 0x000000000005f8d4 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[31]: ins: 0x000000000005f8d2 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[30]: ins: 0x000000000005f8d0 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[29]: ins: 0x000000000005f8ce /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[28]: ins: 0x000000000005f8cc /        JMP / error:  no / issued: yes / spec: yes / rob-front:  no / exe: yes
[node0,Core:    0/1623572047]: 0: ----> ROB[27]: ins: 0x000000000005f8ca /     SETREG / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[26]: ins: 0x000000000005f8c8 /      BCMPI / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[25]: ins: 0x000000000005f8c6 /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[24]: ins: 0x000000000005f8c4 /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[23]: ins: 0x000000000005f8c2 /      BCMPI / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[22]: ins: 0x000000000005f8be /         JL / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[21]: ins: 0x000000000005f8bc /      ADD64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[20]: ins: 0x000000000005f8b8 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[19]: ins: 0x000000000005f8b6 /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[18]: ins: 0x000000000005f8b4 /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[17]: ins: 0x000000000005f8b2 /      ADD64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[16]: ins: 0x000000000005f8ae /   BCMP_GTE / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[15]: ins: 0x000000000005f8ac /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[14]: ins: 0x000000000005f8aa /      STORE / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[13]: ins: 0x000000000005f8a8 /      STORE / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[12]: ins: 0x000000000005f8a6 /      STORE / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[11]: ins: 0x000000000005f8a4 /      STORE / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[10]: ins: 0x000000000005f8a2 /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 9]: ins: 0x000000000005f8a0 /         JR / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 8]: ins: 0x000000000005f89e /     ADDI64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 7]: ins: 0x000000000005f89c /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 6]: ins: 0x000000000005f89a /       LOAD / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 5]: ins: 0x000000000005f896 /      STORE / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 4]: ins: 0x000000000005f892 /      STORE / error:  no / issued: yes / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 3]: ins: 0x000000000005f88e /         JL / error:  no / issued:  no / spec: yes / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 2]: ins: 0x000000000005f88c /      ADD64 / error:  no / issued:  no / spec:  no / rob-front:  no / exe:  no
[node0,Core:    0/1623572047]: 0: ----> ROB[ 1]: ins: 0x000000000005f888 /         JL / error:  no / issued: yes / spec: yes / rob-front:  no / exe: yes
[node0,Core:    0/1623572047]: 0: ----> ROB[ 0]: ins: 0x000000000005f886 /       LOAD / error:  no / issued: yes / spec:  no / rob-front: yes / exe:  no

@Anunalla
Copy link
Contributor

Can this be dealt using the trap flag instead of assertions? That way, if the offending instruction is eventually thrown away by pipeline clear before it reaches retire stage, the trap error would be forgotten enabling execution to continue from the new path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants