You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of my one-node regression tests are failing because the data produced by HTR is non-deterministically wrong in one of its output files.
The data that is found wrong in the file is the input for many calculations in the solver, which turn out to be correct at the end of the run. This suggests that the data is correctly computed and the issue is in the output.
Moreover, the issue goes away if -lg:inorder flag is passed.
Legion spy physical and logical analysis validates the task graph analysis even in the case when bad data is outputted.
After a preliminary investigation on Sapling, the error seems to be related to the order of the dimensions of the source instance for the copy to the HDF file.
In particular, the HDF file has a dimension order [0, 1, 2], if the source instance for the copy to HDF file has the same dim_order, the data looks good.
If the mapper picks a source instance that, for instance, has the order [1, 0, 2], as in the copy below
This command will submit an execution of the code to one of the gpu nodes of Sapling. The execution lasts about 3 seconds.
As many runtime flags as needed can be added to the command.
The bad data is produced in ./sample0/cellCenter_grid/0,0,0-31,31,31.hdf.
To check if the data is bad, you can execute the following command ../../scripts/compare_hdf.py sample0/cellCenter_grid/0,0,0-31,31,31.hdf ../referenceData/Cartesian/3DPeriodic/cpu_ref.hdf
This command will print a list of all the wrong points.
Some of my one-node regression tests are failing because the data produced by HTR is non-deterministically wrong in one of its output files.
The data that is found wrong in the file is the input for many calculations in the solver, which turn out to be correct at the end of the run. This suggests that the data is correctly computed and the issue is in the output.
Moreover, the issue goes away if
-lg:inorder
flag is passed.Legion spy physical and logical analysis validates the task graph analysis even in the case when bad data is outputted.
After a preliminary investigation on Sapling, the error seems to be related to the order of the dimensions of the source instance for the copy to the HDF file.
In particular, the HDF file has a dimension order
[0, 1, 2]
, if the source instance for the copy to HDF file has the samedim_order
, the data looks good.If the mapper picks a source instance that, for instance, has the order
[1, 0, 2]
, as in the copy belowthe data in the output file looks bad.
To reproduce the issue on Sapling, one needs to
/home/mariodr/htr/solverTests/3DPeriodic/
rm -rf slurm-2* sample0/; ../../prometeo.sh -i base.json -level dma=2,xplan=1,inst=1 -logfile spy_%.log
This command will submit an execution of the code to one of the gpu nodes of Sapling. The execution lasts about 3 seconds.
As many runtime flags as needed can be added to the command.
The bad data is produced in
./sample0/cellCenter_grid/0,0,0-31,31,31.hdf
.To check if the data is bad, you can execute the following command
../../scripts/compare_hdf.py sample0/cellCenter_grid/0,0,0-31,31,31.hdf ../referenceData/Cartesian/3DPeriodic/cpu_ref.hdf
This command will print a list of all the wrong points.
Adding @streichler and @lightsighter for visibility.
@elliottslaughter could you please add this issue to the Realm section of #1032 with top priority.
The text was updated successfully, but these errors were encountered: