Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: __builtin_amdgcn_workgroup_size_x incorrectly returns 0 on Code Object Model 5 #119

Closed
deepankarsharma opened this issue Jul 16, 2024 · 7 comments

Comments

@deepankarsharma
Copy link

deepankarsharma commented Jul 16, 2024

Problem Description

Machine Details:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
GPU:
Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Name: gfx1103
Marketing Name: AMD Radeon Graphics
Name: amdgcn-amd-amdhsa--gfx1103

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 7840HS w/ Radeon 780M Graphics

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

clang-ocl

Steps to Reproduce

In the following kernel __builtin_amdgcn_workgroup_size_x always returns zero on APU. This causes the indexing calculation to be incorrect. If I replace that call with a hardcoded value of 64 (which is what the code in main.cpp is setting it) then the calculation comes out to be correct.

Trace from the running program -

Using agent: gfx1103
Engine init: OK
Kernel arg size: 280
Workgroup sizes: 64 1 1
Grid sizes: 100000 1 1
Setup dispatch: OK
Dispatch: OK
Wait: OK
We expected the sum to be :1409965408. Calculated sum is 4032
__attribute__((visibility("default"), amdgpu_kernel)) void add_arrays(int* input_a, int* input_b, int* output)
{
    int index =  __builtin_amdgcn_workgroup_id_x() * __builtin_amdgcn_workgroup_size_x() + __builtin_amdgcn_workitem_id_x();
    output[index] = input_a[index] + input_b[index];
}

You can run the above kernel using the following steps

  1. git clone https://github.com/deepankarsharma/hansa.git
  2. cd hansa
  3. mkdir build
  4. cd build
  5. cmake ..
  6. make
  7. ./hansa

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded

HSA System Attributes

Runtime Version: 1.13
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5137
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5567(0x15bf)
ASIC Revision: 7(0x7)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 1024
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 35
SDMA engine uCode:: 17
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1103
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

@deepankarsharma
Copy link
Author

deepankarsharma commented Jul 18, 2024

I made some progress in tracking this down and I think I see why __builtin_amdgcn_workgroup_size_x is returning the wrong results.

Looking at AMDGPULowerKernelAttributes.cpp in llvm I noticed that __builtin_amdgcn_implicitarg_ptr points to a region which is not being populated.

Instead if I use the dispatch ptr which points to an instance of hsa_kernel_dispatch_packet_t I can get things to work.

So the in memory representation of that should look like -

typedef struct hsa_signal_s {
  uint64_t handle;
} hsa_signal_t;

typedef struct hsa_kernel_dispatch_packet_s {
  uint16_t header;
  uint16_t setup;
  uint16_t workgroup_size_x;
  uint16_t workgroup_size_y;
  uint16_t workgroup_size_z;
  uint16_t reserved0;
  uint32_t grid_size_x;
  uint32_t grid_size_y;
  uint32_t grid_size_z;
  uint32_t private_segment_size;
  uint32_t group_segment_size;
  uint64_t kernel_object;
  void* kernarg_address;
  uint64_t reserved2;
  hsa_signal_t completion_signal;
} hsa_kernel_dispatch_packet_t;

And I was able to indeed verify this by printing this out for every workitem:

in_memory___builtin_amdgcn_implicitarg_ptr

To verify that this hypothesis was correct, using this snippet of code that reads the workgroup size at the right offset indeed fixes the issue and causes the kernel to calculate correctly -

hsa_kernel_dispatch_packet_t* ptr = (hsa_kernel_dispatch_packet_t*)__builtin_amdgcn_dispatch_ptr();

    // Read uint16_t starting at offset 4 bytes from ptr
    uint16_t workgroup_size_x;
    memcpy(&workgroup_size_x, (uint8_t*)ptr + 4, sizeof(uint16_t));

@deepankarsharma
Copy link
Author

Think I was able to track this down further. Code Object Version 5 uses implicitarg ptr for __builtin_amdgcn_workgroup_size_x which is not being populated.

Code Object Version 4 uses the dispatch pointer to answer __builtin_amdgcn_workgroup_size_x and hence works. To get this to work I had to set -mcode-object-version=4 in my call to clang.

Based on MetadataStreamerMsgPackV5::emitHiddenKernalArgs I verified that the struct pointed to by the implicitarg ptr which should probably look like

#pragma pack(push, 1)
struct ImplicitArg {
    uint32_t hidden_block_count_x;
    uint32_t hidden_block_count_y;
    uint32_t hidden_block_count_z;
    
    uint16_t hidden_group_size_x;
    uint16_t hidden_group_size_y;
    uint16_t hidden_group_size_z;
    
    uint16_t hidden_remainder_x;
    uint16_t hidden_remainder_y;
    uint16_t hidden_remainder_z;
    
    uint64_t hidden_tool_correlation_id;
    uint64_t hidden_reserved_1;
    
    uint64_t hidden_global_offset_x;
    uint64_t hidden_global_offset_y;
    uint64_t hidden_global_offset_z;
    
    uint16_t hidden_grid_dims;
    
    uint16_t hidden_reserved_2;
    uint16_t hidden_reserved_3;
    uint16_t hidden_reserved_4;

    uint64_t hidden_printf_buffer;
    uint64_t hidden_hostcall_buffer;
    uint64_t hidden_multigrid_sync_arg;
    uint64_t hidden_heap_v1;
    uint64_t hidden_default_queue;
    uint64_t hidden_completion_action;
    uint32_t hidden_dynamic_lds_size;
    uint32_t hidden_private_base;
    uint32_t hidden_shared_base;

    

};
#pragma pack(pop)

Is zeroed in all cases for all my workitems

implicitarg

Open questions:

  1. Is Code Object Version 5 ready to be used?
  2. What is causing the implicit structs to not be populated for Code Object Version 5?
  3. I see an attribute called amdgpu-implicitarg-num-bytes but it doesnt seem completely implemented. clang/lib/CodeGen/Targets/AMDGPU.cpp does not appear to set it. I dont see it being set in other parts of the llvm codebase either.

@deepankarsharma deepankarsharma changed the title [Issue]: __builtin_amdgcn_workgroup_size_x incorrectly returns 0 on APUs [Issue]: __builtin_amdgcn_workgroup_size_x incorrectly returns 0 on Code Object Model 5 Jul 20, 2024
@lamb-j
Copy link
Collaborator

lamb-j commented Jul 23, 2024

@kzhuravl @yxsamliu @b-sumner

@deepankarsharma
Copy link
Author

After studying llvm/libc/utils/gpu/loader/amdgpu/Loader.cpp I was able to make this work even on Code Object Version 5. Will blog with a simple standalone example over the next day or two.

@b-sumner
Copy link

@deepankarsharma I have not looked at "hansa". I don't understand why the kernel above is calling builtins directly. The compiler expects the runtime that launches the kernels it builds to obey a contract which include setting up the implicit kernel arguments as dictated by the code object metadata. I assume "hansa" has not done so? The metadata is described at https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata .

@deepankarsharma
Copy link
Author

hansa is an experimental effort on my part to see if one can use upstream clang to do gpgpu out of the box on linux distros using amd gpus/apus. Issue with implicit kernel arguments and other setup was exactly what I ran into. Fortunately I now have vector add example working out of the box.

@deepankarsharma
Copy link
Author

https://github.com/deepankarsharma/hansa/blob/main/main.cpp contains a working example of launching a simple kernel that works out of the box on clang on both a cutting edge distro (archlinux with clang v18.1.8) that uses code model v5 and an older distro (ubuntu 22.04 with clang 15). Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants