pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

xuhuisheng · 2021-11-07T02:29:57Z

@cgmb
Please help to have a look this bug. Thank you very much.
If you haven't enough time, please give me some clue. I am really not familiar with GCN assembles. (T_T)

What is the expected behavior

train FC properly

What actually happens

hang with GPU 100%

How to reproduce

train sample FC on pytorch-1.10.0
https://github.com/xuhuisheng/rocm-build/blob/master/check/test-pytorch-rocblas.py

Environment

Hardware	description
GPU	RX580
GPU chip ID	0x67df
CPU	E5-2620 v3

Software	version
ROCK	v4.5.0
ROCR	v4.5.0
HCC	v4.5.0
Library	v4.5.0

workaround

If I delete library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml and rebuild rocBLAS. This problem resolved.
Here is my patched rocblas https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm450/rocblas_2.41.0-337552f0.dirty_amd64.deb

etc

Here is my personal issue for tracing this issue on ROCm-4.0.1
xuhuisheng/rocm-build#4

Here is my building documentation for gfx803. Glade to see there is only one issue on ROCm-4.5.0 with gfx803.
https://github.com/xuhuisheng/rocm-build/tree/master/gfx803

I try some other scripts :

The mnist on tensorflow with gfx803 shows NaN loss.
The text-classification of tensorflow shows Invalid argument: indices[5,284] = 997212422 is not in [0, 5001) (text classification).
#1172

I guess this problem is caused by the new assemble codes of Tensile, we can see after I delete n9nano related tensile yaml, the legal c gemm implements run properly.
But I dont know how to reproduce it with plain c codes. But it affects the pytorch/tensorflow.

The text was updated successfully, but these errors were encountered:

cgmb · 2021-11-10T06:24:18Z

This appears to be caused by out-of-bounds writes in the gemm batched routines for gfx803. By deleting the Tensile yaml files for that architecture, you fall back to source kernels rather than using the assembly kernels. In my tests, the source kernels appear to have fewer failures, but I'm not sure that this entirely solves the problem. I still see failures when running rocblas-test --gtest_filter='*checkin*-*known_bug*' on my RX 570.

That's about all the investigation I have time for, but I'll organize and post my logs here later.

cgmb · 2021-11-13T23:55:48Z

You can see what kernels are run using the TENSILE_DB environment variable. I used that to determine exactly what Tensile kernels that your example calls.

Using rocBLAS built with the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW2_GSU8_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_8_4_WGM1
Running kernel: Cijk_Ailk_Bljk_SB_MT32x8x64_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS0_FL0_GRVW2_GSU16_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT4_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_4_8_WGM1

Using rocBLAS built without the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_AMAS0_BL0_GRVW1_GSU1_K1_LRVW1_NLCA1_NLCB1_PGR0_PLR0_SU32_SUS256_SVW4_TT8_4_USFGRO0_VAW1_VS1_VW1_WG16_16_1_WGM8

cgmb · 2021-11-14T00:09:27Z

These are my logs running rocblas-test on the RX 570 with and without the r9nano yaml files:
gemm_test_logs.zip

Unfortunately, in both cases you see failures like these:

/root/rocBLAS/clients/gtest/../include/d_vector.hpp:118: Failure
Expected equality of these values:
  memcmp(host, guard, sizeof(guard))
    Which is: -112
  0
[  FAILED  ] pre_checkin/gemm.blas3_tensile/gemm_medium_f16_r_NN_4_4_4_5_4_4_0_4, where GetParam() = { M: 4, N: 4, K: 4, KL: 128, KU: 128, lda: 4, ldb: 4, ldc: 4, ldd: 4, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, compute_type: f16_r, incx: 0, incy: 0, incd: 0, incb: 0, alpha: 5.0, alphai: 0.0, beta: 0.0, betai: 0.0, transA: 'N', transB: 'N', side: '*', uplo: '*', diag: '*', batch_count: 1, HMM: false, threads: 0, streams: 0, devices: 0, stride_a: 16, stride_b: 16, stride_c: 16, stride_d: 16, stride_x: 0, stride_y: 0, fortran: false, norm_check: 0, unit_check: 1, timing: 0, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: 0, function: "gemm", name: "gemm_medium", category: "pre_checkin", initialization: "rand_int", known_bug_platforms: "", c_noalias_d: false, atomics_mode: atomics_allowed, user_allocated_workspace: 0 }

That failed check in d_vector.hpp is looking at the contents of the padding which is added to device allocations in the test code. If the library code changes the padding by writing beyond the boundaries of its requested allocations, that's the check where it will be caught.

bragadeesh · 2021-11-30T00:29:45Z

Please note that gfx803 is not officially supported on ROCm.
Best we can do at this point is to provide you information (such as above) to help enable you to resolve these issues yourself.

xuhuisheng · 2021-11-30T01:48:05Z

Thank @cgmb help, I will try to test to this direction.

Thank @bragadeesh noticed, I am very appreciate for your help.

xuhuisheng · 2022-06-07T03:12:11Z

@cgmb Sorry for the delay response.
I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems.
Thank you.

cgmb · 2022-06-07T06:29:32Z

I'm surprised. That's different from the results I had got. I'd honestly given up after trying ROCm 5.0 and discovering my gfx803 card didn't seem to work at all anymore. It was just today that I learned you need to set ROC_ENABLE_PRE_VEGA=true to enable gfx803 in ROCclr now.

In any case, my conclusion from this issue was that the complexity of Tensile limits the ability of the community to understand and debug rocBLAS. I've been asking about potentially having fallback implementations for rocBLAS functions when building without Tensile. That's something that's been desired for a number of different people for different reasons (e.g., spack/spack#28846), and I think it would be helpful for cases like this.

xuhuisheng · 2022-06-11T16:05:48Z

@cgmb
I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

ubuntu-20.04.4
ROCm-5.1.3
RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

LamEnder · 2022-06-16T10:10:47Z

@cgmb Sorry for the delay response. I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems. Thank you.

Cool! But is it possible to have a working tensorflow or pytorch installation from the current ROCm 5.1.3 toolchain?
I used to spent a whole one day just to get tensorflow complied and it failed spectacularly, that was quite sometimes ago so I cannot provide the log.

I actually have a gfx803 in my gaming laptop, which further complicates the process of getting ROCm to run. If I remember correctly, it was the different memory management model between the iGPU and the dGPU, causing segmentation fault if ROCm tries to access the other GPU with wrong memory management model (e.g: accessing the dGPU with iGPU memory management mode).

I would be grateful if @xuhuiseng can provide me some clues to get this to work.

Currently I have an Manjaro installation so I have to build all the ROCm components from source if I want to install them.

LamEnder · 2022-06-17T00:52:27Z

@cgmb I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

* ubuntu-20.04.4

* ROCm-5.1.3

* RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

Can you provide me a guide to build tensorflow from source for gfx803?

xuhuisheng · 2022-06-17T01:35:09Z

@LamEnder
Build tensorflow-rocm on ROCm is out of this topic.
I create a disscusion on my repository, we can talk about tf there
xuhuisheng/rocm-build#29

xuhuisheng · 2022-06-18T01:00:20Z

@cgmb
Sorry to bother. But I find a intereting thing.

The way to reproduce this issue , It is the simplest way what I found.

/home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*

It will match 4 tests, the third test always failed.
And If I run the third test individual, It can run properly.

work@6d56e6a8963e:~/ROCm/rocBLAS$ /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*'
rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 5
lda     : 3
ldb     : 3
h_beta  : 0
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 4 -3 9 -5 9 -1
10 2 9 10 4 7 6 3 7
-350 -480 -200 350 380 290 105 -10 75
-350 -480 -200 350 380 290 105 -10 75
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (14142 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 0
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -7 10 -4 10 -5
10 2 9 10 4 7 6 3 7
30 6 27 30 12 21 18 9 21
30 6 27 30 12 21 18 9 21
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (3 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -10 1 -8 4 -10
10 2 9 10 4 7 6 3 7
-40 -90 -13 100 88 79 39 7 36
-40 -90 -13 30 48 7 -80 -107 -45
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (1 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 1
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 6 -7 3 -2 2 -5
10 2 9 10 4 7 6 3 7
-60 -94 -31 24 40 13 -29 -35 -24
-60 -94 -31 24 40 13 -29 -35 -24
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (1 ms)
[----------] 4 tests from _/gemm (14351 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14494 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*

It looks like stable, gtest always failed at the third test - pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3. I try to calculate by handle, and dont know where is 100 comes from.

1 * (7 * 7 + 7 * -10 + 7 * 3) + 3 *  10 = 30

And I try to copy the gtest codes to a single cpp source code, since the gtest need parse test-data first, and it costs times, And the problem disappeared. I just dont know what different between gtest and my cpp. Since the log display one by one, I guess we are not need using threads.

Here is my cpp test code:
https://github.com/xuhuisheng/rocm-build/blob/master/check/src/test_rocblas3.cpp

cgmb · 2022-06-18T03:03:39Z

Thanks for the information. It will be a while before I can get to it, but I will take another look.

xuhuisheng · 2022-06-22T17:34:29Z

@cgmb
Sorry to bother again.
It is interesting that if I use device_vector to control db, device_vector<float> db(9, 1, false). I can reproduce the issue.
If I use float * db, issue disappeared.

with device_vector:
https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas3.sh

without device_vector:
https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas4.sh

Keep digging.

LamEnder · 2022-11-24T01:09:30Z

@xuhuisheng @cgmb Since rocBLAS uses Tensile as its backend, I suggest we can generate new kernels by running Tensile benchmarks on gfx803.

Because the newly generated kernels will be built against the new Tensile codebase, I think that would fix some of the rocBLAS test failure I guess.

What do you think?

This wiki link would be useful: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/
P/S: This is just my speculation based on my own digging for both rocBLAS and Tensile codebase, please correct me if I am wrong

cgmb · 2022-11-24T05:15:01Z

That might help or it might not. I'm not sure.

To address this issue (and for a wide variety of other reasons), I've asked rocBLAS to provide fallback implementations that can work without Tensile. If nothing else, that would mean that rocBLAS will at least be functional on all platforms that have HIP support in the driver, compiler and runtime.

As a rocSOLVER developer, I will also appreciate the ability to build and test rocSOLVER without having to build the (incredibly slow to compile) Tensile kernels. And to build and test rocSOLVER on platforms that Tensile has not yet added support for. And the ability to debug by comparing results with and without Tensile.

So, it will be a useful feature for a number of reasons. I'm not sure when it fits into the schedule, but I put in the feature request at the start of June. It's in the backlog.

LamEnder · 2022-11-24T07:01:17Z

Yes, and by having a generic, architecture-agnostic implementation, we could also eliminate the pain of dealing with large binary size of rocBLAS.

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

I'm looking forward to it btw.

Anyway I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

cgmb · 2022-11-24T17:39:50Z

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

That would be nice. I had been considering a similar strategy for rocSOLVER's size-specialized kernels, but it hadn't occurred to me that it would be useful for rocBLAS too.

I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

The development backlog is not publicly accessible. Unfortunately, it contains a mix of information that could reasonably be made public and information that must remain private.

TorreZuk assigned bragadeesh Nov 29, 2021

FelixSchwarz mentioned this issue Feb 17, 2022

OpenCL 2.0 support for Ellesmere (RX580) ROCm/ROCm-OpenCL-Runtime#127

Open

bragadeesh closed this as completed Mar 17, 2022

mlse-lib-jenkins pushed a commit that referenced this issue Apr 28, 2022

legacy check compat., add bad_arg tests (#1218)

c922737

xuhuisheng mentioned this issue Jun 7, 2022

invalid output with gfx803 and BUILD_WITH_TENSILE_HOST ON xuhuisheng/rocm-build#4

Open

xuhuisheng mentioned this issue Jun 24, 2022

Installing rocblas 5.1.1 deb might break system updates or gets uninstalled xuhuisheng/rocm-gfx803#14

Open

This comment was marked as outdated.

Sign in to view

wfjsw mentioned this issue Apr 20, 2024

ZLUDA Wave64 implementation may have issue on GFX8/9 vosen/ZLUDA#208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

xuhuisheng commented Nov 7, 2021 •

edited

Loading

cgmb commented Nov 10, 2021

cgmb commented Nov 13, 2021

cgmb commented Nov 14, 2021

bragadeesh commented Nov 30, 2021

xuhuisheng commented Nov 30, 2021

xuhuisheng commented Jun 7, 2022

cgmb commented Jun 7, 2022 •

edited

Loading

xuhuisheng commented Jun 11, 2022 •

edited

Loading

LamEnder commented Jun 16, 2022 •

edited

Loading

LamEnder commented Jun 17, 2022

xuhuisheng commented Jun 17, 2022

xuhuisheng commented Jun 18, 2022 •

edited

Loading

cgmb commented Jun 18, 2022

xuhuisheng commented Jun 22, 2022

LamEnder commented Nov 24, 2022 •

edited

Loading

cgmb commented Nov 24, 2022

LamEnder commented Nov 24, 2022 •

edited

Loading

cgmb commented Nov 24, 2022

This comment was marked as outdated.

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

Comments

xuhuisheng commented Nov 7, 2021 • edited Loading

What is the expected behavior

What actually happens

How to reproduce

Environment

workaround

etc

cgmb commented Nov 10, 2021

cgmb commented Nov 13, 2021

cgmb commented Nov 14, 2021

bragadeesh commented Nov 30, 2021

xuhuisheng commented Nov 30, 2021

xuhuisheng commented Jun 7, 2022

cgmb commented Jun 7, 2022 • edited Loading

xuhuisheng commented Jun 11, 2022 • edited Loading

LamEnder commented Jun 16, 2022 • edited Loading

LamEnder commented Jun 17, 2022

xuhuisheng commented Jun 17, 2022

xuhuisheng commented Jun 18, 2022 • edited Loading

cgmb commented Jun 18, 2022

xuhuisheng commented Jun 22, 2022

LamEnder commented Nov 24, 2022 • edited Loading

cgmb commented Nov 24, 2022

LamEnder commented Nov 24, 2022 • edited Loading

cgmb commented Nov 24, 2022

This comment was marked as outdated.

xuhuisheng commented Nov 7, 2021 •

edited

Loading

cgmb commented Jun 7, 2022 •

edited

Loading

xuhuisheng commented Jun 11, 2022 •

edited

Loading

LamEnder commented Jun 16, 2022 •

edited

Loading

xuhuisheng commented Jun 18, 2022 •

edited

Loading

LamEnder commented Nov 24, 2022 •

edited

Loading

LamEnder commented Nov 24, 2022 •

edited

Loading