Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: Bad data in HDF5 file #1450

Closed
Tracked by #1032
mariodirenzo opened this issue Mar 31, 2023 · 1 comment
Closed
Tracked by #1032

Realm: Bad data in HDF5 file #1450

mariodirenzo opened this issue Mar 31, 2023 · 1 comment
Assignees

Comments

@mariodirenzo
Copy link

Some of my one-node regression tests are failing because the data produced by HTR is non-deterministically wrong in one of its output files.
The data that is found wrong in the file is the input for many calculations in the solver, which turn out to be correct at the end of the run. This suggests that the data is correctly computed and the issue is in the output.
Moreover, the issue goes away if -lg:inorder flag is passed.

Legion spy physical and logical analysis validates the task graph analysis even in the case when bad data is outputted.

After a preliminary investigation on Sapling, the error seems to be related to the order of the dimensions of the source instance for the copy to the HDF file.
In particular, the HDF file has a dimension order [0, 1, 2], if the source instance for the copy to HDF file has the same dim_order, the data looks good.
If the mapper picks a source instance that, for instance, has the order [1, 0, 2], as in the copy below

[0 - 7f540bbae840]    1.174715 {2}{xplan}: created: plan=0x7f540419f410 domain=IS:<16,0,0>..<31,15,31>,dense srcs=1 dsts=1
[0 - 7f540bbae840]    1.174724 {1}{xplan}: created: plan=0x7f540419f410 srcs[0]=field(101, inst=4000000001000002, size=24)
[0 - 7f540bbae840]    1.174735 {1}{xplan}: created: plan=0x7f540419f410 dsts[0]=field(101, inst=4000000002c00001, size=24)
[0 - 7f540bbae840]    1.175153 {1}{xplan}: analysis: plan=0x7f540419f410 dim_order=[1, 0, 2] xds=2 ibs=1
[0 - 7f540bbae840]    1.175164 {1}{xplan}: analysis: plan=0x7f540419f410 xds[0]: target=0 inputs=[inst(4000000001000002,0+1)] outputs=[edge(0)]
[0 - 7f540bbae840]    1.175169 {1}{xplan}: analysis: plan=0x7f540419f410 xds[1]: target=0 inputs=[edge(0)] outputs=[inst(4000000002c00001,0+1)]
[0 - 7f540bbae840]    1.175174 {1}{xplan}: analysis: plan=0x7f540419f410 ibs[0]: memory=1a00000000000004 size=196608
[0 - 7f540bbae840]    1.175178 {1}{xplan}: analysis: plan=0x7f540419f410 ib_alloc=[0]
[0 - 7f540bbae840]    1.175201 {2}{dma}: dma request 0x7f540419f630 created - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f540bbae840]    1.175211 {2}{dma}: dma request 0x7f540419f630 ready - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f540bbae840]    1.175228 {2}{dma}: dma request 0x7f540419f630 started - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f54185da840]    1.234420 {2}{dma}: dma request 0x7f540419f630 completed - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f54185da840]    1.234458 {2}{xplan}: destroyed: plan=0x7f540419f410

the data in the output file looks bad.

To reproduce the issue on Sapling, one needs to

  • load my bash environment
  • go into /home/mariodr/htr/solverTests/3DPeriodic/
  • execute rm -rf slurm-2* sample0/; ../../prometeo.sh -i base.json -level dma=2,xplan=1,inst=1 -logfile spy_%.log

This command will submit an execution of the code to one of the gpu nodes of Sapling. The execution lasts about 3 seconds.
As many runtime flags as needed can be added to the command.

The bad data is produced in ./sample0/cellCenter_grid/0,0,0-31,31,31.hdf.
To check if the data is bad, you can execute the following command ../../scripts/compare_hdf.py sample0/cellCenter_grid/0,0,0-31,31,31.hdf ../referenceData/Cartesian/3DPeriodic/cpu_ref.hdf
This command will print a list of all the wrong points.

Adding @streichler and @lightsighter for visibility.

@elliottslaughter could you please add this issue to the Realm section of #1032 with top priority.

@mariodirenzo
Copy link
Author

This issue has been fixed by 3e2c20e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants