Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

Closed
xuhuisheng opened this issue Nov 7, 2021 · 19 comments
Closed

pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218

xuhuisheng opened this issue Nov 7, 2021 · 19 comments
Assignees

Comments

@xuhuisheng
Copy link

xuhuisheng commented Nov 7, 2021

@cgmb
Please help to have a look this bug. Thank you very much.
If you haven't enough time, please give me some clue. I am really not familiar with GCN assembles. (T_T)

What is the expected behavior

  • train FC properly

What actually happens

  • hang with GPU 100%

How to reproduce

Environment

Hardware description
GPU RX580
GPU chip ID 0x67df
CPU E5-2620 v3
Software version
ROCK v4.5.0
ROCR v4.5.0
HCC v4.5.0
Library v4.5.0

workaround

If I delete library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml and rebuild rocBLAS. This problem resolved.
Here is my patched rocblas https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm450/rocblas_2.41.0-337552f0.dirty_amd64.deb

etc

Here is my personal issue for tracing this issue on ROCm-4.0.1
xuhuisheng/rocm-build#4

Here is my building documentation for gfx803. Glade to see there is only one issue on ROCm-4.5.0 with gfx803.
https://github.com/xuhuisheng/rocm-build/tree/master/gfx803

I try some other scripts :

The mnist on tensorflow with gfx803 shows NaN loss.
The text-classification of tensorflow shows Invalid argument: indices[5,284] = 997212422 is not in [0, 5001) (text classification).
#1172

I guess this problem is caused by the new assemble codes of Tensile, we can see after I delete n9nano related tensile yaml, the legal c gemm implements run properly.
But I dont know how to reproduce it with plain c codes. But it affects the pytorch/tensorflow.

@cgmb
Copy link
Contributor

cgmb commented Nov 10, 2021

This appears to be caused by out-of-bounds writes in the gemm batched routines for gfx803. By deleting the Tensile yaml files for that architecture, you fall back to source kernels rather than using the assembly kernels. In my tests, the source kernels appear to have fewer failures, but I'm not sure that this entirely solves the problem. I still see failures when running rocblas-test --gtest_filter='*checkin*-*known_bug*' on my RX 570.

That's about all the investigation I have time for, but I'll organize and post my logs here later.

@cgmb
Copy link
Contributor

cgmb commented Nov 13, 2021

You can see what kernels are run using the TENSILE_DB environment variable. I used that to determine exactly what Tensile kernels that your example calls.

Using rocBLAS built with the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW2_GSU8_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_8_4_WGM1
Running kernel: Cijk_Ailk_Bljk_SB_MT32x8x64_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS0_FL0_GRVW2_GSU16_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT4_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_4_8_WGM1

Using rocBLAS built without the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_AMAS0_BL0_GRVW1_GSU1_K1_LRVW1_NLCA1_NLCB1_PGR0_PLR0_SU32_SUS256_SVW4_TT8_4_USFGRO0_VAW1_VS1_VW1_WG16_16_1_WGM8

@cgmb
Copy link
Contributor

cgmb commented Nov 14, 2021

These are my logs running rocblas-test on the RX 570 with and without the r9nano yaml files:
gemm_test_logs.zip

Unfortunately, in both cases you see failures like these:

/root/rocBLAS/clients/gtest/../include/d_vector.hpp:118: Failure
Expected equality of these values:
  memcmp(host, guard, sizeof(guard))
    Which is: -112
  0
[  FAILED  ] pre_checkin/gemm.blas3_tensile/gemm_medium_f16_r_NN_4_4_4_5_4_4_0_4, where GetParam() = { M: 4, N: 4, K: 4, KL: 128, KU: 128, lda: 4, ldb: 4, ldc: 4, ldd: 4, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, compute_type: f16_r, incx: 0, incy: 0, incd: 0, incb: 0, alpha: 5.0, alphai: 0.0, beta: 0.0, betai: 0.0, transA: 'N', transB: 'N', side: '*', uplo: '*', diag: '*', batch_count: 1, HMM: false, threads: 0, streams: 0, devices: 0, stride_a: 16, stride_b: 16, stride_c: 16, stride_d: 16, stride_x: 0, stride_y: 0, fortran: false, norm_check: 0, unit_check: 1, timing: 0, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: 0, function: "gemm", name: "gemm_medium", category: "pre_checkin", initialization: "rand_int", known_bug_platforms: "", c_noalias_d: false, atomics_mode: atomics_allowed, user_allocated_workspace: 0 }

That failed check in d_vector.hpp is looking at the contents of the padding which is added to device allocations in the test code. If the library code changes the padding by writing beyond the boundaries of its requested allocations, that's the check where it will be caught.

@bragadeesh
Copy link
Contributor

Please note that gfx803 is not officially supported on ROCm.
Best we can do at this point is to provide you information (such as above) to help enable you to resolve these issues yourself.

@xuhuisheng
Copy link
Author

Thank @cgmb help, I will try to test to this direction.

Thank @bragadeesh noticed, I am very appreciate for your help.

@xuhuisheng
Copy link
Author

@cgmb Sorry for the delay response.
I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems.
Thank you.

@cgmb
Copy link
Contributor

cgmb commented Jun 7, 2022

I'm surprised. That's different from the results I had got. I'd honestly given up after trying ROCm 5.0 and discovering my gfx803 card didn't seem to work at all anymore. It was just today that I learned you need to set ROC_ENABLE_PRE_VEGA=true to enable gfx803 in ROCclr now.

In any case, my conclusion from this issue was that the complexity of Tensile limits the ability of the community to understand and debug rocBLAS. I've been asking about potentially having fallback implementations for rocBLAS functions when building without Tensile. That's something that's been desired for a number of different people for different reasons (e.g., spack/spack#28846), and I think it would be helpful for cases like this.

@xuhuisheng
Copy link
Author

xuhuisheng commented Jun 11, 2022

@cgmb
I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

  • ubuntu-20.04.4
  • ROCm-5.1.3
  • RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

@LamEnder
Copy link

LamEnder commented Jun 16, 2022

@cgmb Sorry for the delay response. I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems. Thank you.

Cool! But is it possible to have a working tensorflow or pytorch installation from the current ROCm 5.1.3 toolchain?
I used to spent a whole one day just to get tensorflow complied and it failed spectacularly, that was quite sometimes ago so I cannot provide the log.

I actually have a gfx803 in my gaming laptop, which further complicates the process of getting ROCm to run. If I remember correctly, it was the different memory management model between the iGPU and the dGPU, causing segmentation fault if ROCm tries to access the other GPU with wrong memory management model (e.g: accessing the dGPU with iGPU memory management mode).

I would be grateful if @xuhuiseng can provide me some clues to get this to work.

Currently I have an Manjaro installation so I have to build all the ROCm components from source if I want to install them.

@LamEnder
Copy link

@cgmb I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

* ubuntu-20.04.4

* ROCm-5.1.3

* RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

Can you provide me a guide to build tensorflow from source for gfx803?

@xuhuisheng
Copy link
Author

@LamEnder
Build tensorflow-rocm on ROCm is out of this topic.
I create a disscusion on my repository, we can talk about tf there
xuhuisheng/rocm-build#29

@xuhuisheng
Copy link
Author

xuhuisheng commented Jun 18, 2022

@cgmb
Sorry to bother. But I find a intereting thing.

The way to reproduce this issue , It is the simplest way what I found.

/home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*

It will match 4 tests, the third test always failed.
And If I run the third test individual, It can run properly.

work@6d56e6a8963e:~/ROCm/rocBLAS$ /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*'
rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 5
lda     : 3
ldb     : 3
h_beta  : 0
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 4 -3 9 -5 9 -1
10 2 9 10 4 7 6 3 7
-350 -480 -200 350 380 290 105 -10 75
-350 -480 -200 350 380 290 105 -10 75
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (14142 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 0
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -7 10 -4 10 -5
10 2 9 10 4 7 6 3 7
30 6 27 30 12 21 18 9 21
30 6 27 30 12 21 18 9 21
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (3 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -10 1 -8 4 -10
10 2 9 10 4 7 6 3 7
-40 -90 -13 100 88 79 39 7 36
-40 -90 -13 30 48 7 -80 -107 -45
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (1 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 1
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 6 -7 3 -2 2 -5
10 2 9 10 4 7 6 3 7
-60 -94 -31 24 40 13 -29 -35 -24
-60 -94 -31 24 40 13 -29 -35 -24
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (1 ms)
[----------] 4 tests from _/gemm (14351 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14494 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*

It looks like stable, gtest always failed at the third test - pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3. I try to calculate by handle, and dont know where is 100 comes from.

1 * (7 * 7 + 7 * -10 + 7 * 3) + 3 *  10 = 30

And I try to copy the gtest codes to a single cpp source code, since the gtest need parse test-data first, and it costs times, And the problem disappeared. I just dont know what different between gtest and my cpp. Since the log display one by one, I guess we are not need using threads.

Here is my cpp test code:
https://github.com/xuhuisheng/rocm-build/blob/master/check/src/test_rocblas3.cpp

@cgmb
Copy link
Contributor

cgmb commented Jun 18, 2022

Thanks for the information. It will be a while before I can get to it, but I will take another look.

@xuhuisheng
Copy link
Author

@cgmb
Sorry to bother again.
It is interesting that if I use device_vector to control db, device_vector<float> db(9, 1, false). I can reproduce the issue.
If I use float * db, issue disappeared.

with device_vector:
https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas3.sh

without device_vector:
https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas4.sh

Keep digging.

@LamEnder
Copy link

LamEnder commented Nov 24, 2022

@xuhuisheng @cgmb Since rocBLAS uses Tensile as its backend, I suggest we can generate new kernels by running Tensile benchmarks on gfx803.

Because the newly generated kernels will be built against the new Tensile codebase, I think that would fix some of the rocBLAS test failure I guess.

What do you think?

This wiki link would be useful: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/
P/S: This is just my speculation based on my own digging for both rocBLAS and Tensile codebase, please correct me if I am wrong

@cgmb
Copy link
Contributor

cgmb commented Nov 24, 2022

That might help or it might not. I'm not sure.

To address this issue (and for a wide variety of other reasons), I've asked rocBLAS to provide fallback implementations that can work without Tensile. If nothing else, that would mean that rocBLAS will at least be functional on all platforms that have HIP support in the driver, compiler and runtime.

As a rocSOLVER developer, I will also appreciate the ability to build and test rocSOLVER without having to build the (incredibly slow to compile) Tensile kernels. And to build and test rocSOLVER on platforms that Tensile has not yet added support for. And the ability to debug by comparing results with and without Tensile.

So, it will be a useful feature for a number of reasons. I'm not sure when it fits into the schedule, but I put in the feature request at the start of June. It's in the backlog.

@LamEnder
Copy link

LamEnder commented Nov 24, 2022

Yes, and by having a generic, architecture-agnostic implementation, we could also eliminate the pain of dealing with large binary size of rocBLAS.

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

I'm looking forward to it btw.

Anyway I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

@cgmb
Copy link
Contributor

cgmb commented Nov 24, 2022

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

That would be nice. I had been considering a similar strategy for rocSOLVER's size-specialized kernels, but it hadn't occurred to me that it would be useful for rocBLAS too.

I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

The development backlog is not publicly accessible. Unfortunately, it contains a mix of information that could reasonably be made public and information that must remain private.

@wfjsw

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants