Realm: Bad data in HDF5 file #1450

mariodirenzo · 2023-03-31T16:29:18Z

Some of my one-node regression tests are failing because the data produced by HTR is non-deterministically wrong in one of its output files.
The data that is found wrong in the file is the input for many calculations in the solver, which turn out to be correct at the end of the run. This suggests that the data is correctly computed and the issue is in the output.
Moreover, the issue goes away if -lg:inorder flag is passed.

Legion spy physical and logical analysis validates the task graph analysis even in the case when bad data is outputted.

After a preliminary investigation on Sapling, the error seems to be related to the order of the dimensions of the source instance for the copy to the HDF file.
In particular, the HDF file has a dimension order [0, 1, 2], if the source instance for the copy to HDF file has the same dim_order, the data looks good.
If the mapper picks a source instance that, for instance, has the order [1, 0, 2], as in the copy below

[0 - 7f540bbae840]    1.174715 {2}{xplan}: created: plan=0x7f540419f410 domain=IS:<16,0,0>..<31,15,31>,dense srcs=1 dsts=1
[0 - 7f540bbae840]    1.174724 {1}{xplan}: created: plan=0x7f540419f410 srcs[0]=field(101, inst=4000000001000002, size=24)
[0 - 7f540bbae840]    1.174735 {1}{xplan}: created: plan=0x7f540419f410 dsts[0]=field(101, inst=4000000002c00001, size=24)
[0 - 7f540bbae840]    1.175153 {1}{xplan}: analysis: plan=0x7f540419f410 dim_order=[1, 0, 2] xds=2 ibs=1
[0 - 7f540bbae840]    1.175164 {1}{xplan}: analysis: plan=0x7f540419f410 xds[0]: target=0 inputs=[inst(4000000001000002,0+1)] outputs=[edge(0)]
[0 - 7f540bbae840]    1.175169 {1}{xplan}: analysis: plan=0x7f540419f410 xds[1]: target=0 inputs=[edge(0)] outputs=[inst(4000000002c00001,0+1)]
[0 - 7f540bbae840]    1.175174 {1}{xplan}: analysis: plan=0x7f540419f410 ibs[0]: memory=1a00000000000004 size=196608
[0 - 7f540bbae840]    1.175178 {1}{xplan}: analysis: plan=0x7f540419f410 ib_alloc=[0]
[0 - 7f540bbae840]    1.175201 {2}{dma}: dma request 0x7f540419f630 created - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f540bbae840]    1.175211 {2}{dma}: dma request 0x7f540419f630 ready - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f540bbae840]    1.175228 {2}{dma}: dma request 0x7f540419f630 started - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f54185da840]    1.234420 {2}{dma}: dma request 0x7f540419f630 completed - plan=0x7f540419f410 before=800000002750000b after=800000000c000020
[0 - 7f54185da840]    1.234458 {2}{xplan}: destroyed: plan=0x7f540419f410

the data in the output file looks bad.

To reproduce the issue on Sapling, one needs to

load my bash environment
go into /home/mariodr/htr/solverTests/3DPeriodic/
execute rm -rf slurm-2* sample0/; ../../prometeo.sh -i base.json -level dma=2,xplan=1,inst=1 -logfile spy_%.log

This command will submit an execution of the code to one of the gpu nodes of Sapling. The execution lasts about 3 seconds.
As many runtime flags as needed can be added to the command.

The bad data is produced in ./sample0/cellCenter_grid/0,0,0-31,31,31.hdf.
To check if the data is bad, you can execute the following command ../../scripts/compare_hdf.py sample0/cellCenter_grid/0,0,0-31,31,31.hdf ../referenceData/Cartesian/3DPeriodic/cpu_ref.hdf
This command will print a list of all the wrong points.

Adding @streichler and @lightsighter for visibility.

@elliottslaughter could you please add this issue to the Realm section of #1032 with top priority.

The text was updated successfully, but these errors were encountered:

mariodirenzo · 2023-04-09T10:56:58Z

This issue has been fixed by 3e2c20e.

streichler self-assigned this Mar 31, 2023

elliottslaughter mentioned this issue Mar 31, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

mariodirenzo closed this as completed Apr 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: Bad data in HDF5 file #1450

Realm: Bad data in HDF5 file #1450

mariodirenzo commented Mar 31, 2023

mariodirenzo commented Apr 9, 2023

Realm: Bad data in HDF5 file #1450

Realm: Bad data in HDF5 file #1450

Comments

mariodirenzo commented Mar 31, 2023

mariodirenzo commented Apr 9, 2023