-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XRT] ERROR trying to run DGEMM build on Xilinx U280. #32
Comments
Additionally during 16384 runs i'm now getting warnings of soft lock-up on the CPU when it reaches executing kernel:
Again this does not occur on smaller matrix sizes. |
Hi Again, Further progress has been made into the issue. Having built a SGEMM build on a U250 card we encounter the same XRT error when running 16k matrices. Here is the system configuration as given by
We noticed that both our U250 and U280 cards fail test 7 when using
Could this possibly be the source of the issue? |
Hey! Since this only occurs with large matrix sizes and throws an I/O error, it could be related to the size of the memory transfer. If my math is right, transferring 3x 16384x16384 matrices amounts to 6.4 GB, which I suppose could be an issue for the virtual HBM channels on the U280 (I believe the individual virtual channels have smaller capacity than this), but should work fine in DDR 🤔 Are you completely sure the issue you see is identical between the U280 and the U250, or is there any chance that they are separate issues? |
Hi, thanks for the reply. We were suggested this as well by AMD/Xilinx, that it is a memory issue and we're in the processes of checking the usage. The issue is not completely identical as SGEMM works on U280 but doesn't work on U250 and has the same issue DGEMM has on U280. I've checked the Config.h in the directories and SGEMM was built with the same parameters on both cards so why U250 gives the same XRT issue is a mystery at the moment. |
Any news @A-Kibats? |
Hi,
Running a matrix with the size of 16384 on a DGEMM build returns the following errors:
This seems to only occur when the card has the bit stream already loaded as resetting the card with
xbutil reset
and running it for the first time does not give the same error.Smaller size matrices seem to work fine with 12288 being the highest that reliably worked. (4k, 8k, 12k, 16k was the test range).
CmakeLists.txt was kept with relatively default settings exceptions being the card string was changed to "xilinx_u280_xdma_201920_3", and being modified to build dgemm based on the README.md found within gemm_hls:
Here is the system configuration as given by
xbutil examine
:Any help would be much appreciated.
Cheers, Andrew.
The text was updated successfully, but these errors were encountered: