Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU.jl doesn't seem to work with 7900 series GPUs #371

Closed
gbaraldi opened this issue Feb 1, 2023 · 22 comments
Closed

AMDGPU.jl doesn't seem to work with 7900 series GPUs #371

gbaraldi opened this issue Feb 1, 2023 · 22 comments
Labels
bug Something isn't working hsa

Comments

@gbaraldi
Copy link
Contributor

gbaraldi commented Feb 1, 2023

There are two failures here, the first one is that it doesn't recognize the new core, but setting HSA_OVERRIDE_GFX_VERSION=10.3.0 seems to fix that.
I do get a separate error when trying to create a ROCmArray.

a_d = ROCArray(rand(5))
HSA exception: Queue create failed at hsaKmtCreateQueue

ERROR: HSA error (code #4104, HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.)

I imagine this has to do with needing more recent versions of some libraries, because I can run HIP programs locally.

I'm specifically using an rx7900xtx (gfx1100)

@jpsamaroo jpsamaroo added bug Something isn't working hsa labels Feb 1, 2023
@pxl-th
Copy link
Member

pxl-th commented Feb 1, 2023

I think you need at least ROCm 5.4 for Navi3 to work (judging from rocBLAS).
But that requires LLVM 15, when Julia 1.9 uses LLVM 14.

@mzy2240
Copy link

mzy2240 commented May 5, 2023

rocm 5.5 was recently released and it is the first version that officially supports 7900xt and 7900xtx.

@gbaraldi
Copy link
Contributor Author

gbaraldi commented May 5, 2023

Does it require LLVM16 or is 15 enough?

@pxl-th
Copy link
Member

pxl-th commented May 5, 2023

Just tried ROCm 5.5, complains at opaque pointers: producer LLVM 16, consumer LLVM 14

@pepijndevos
Copy link

Julia master is now on LLVM 15 I think, does that help here?

@Krastanov
Copy link

Not completely sure, but it seems #508 (ROCm mixed mode) does not yet provide 7900XTX support (at least on arch)

The failure is:

julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

pytorch with rocm 5.6 works fine

Here is what I tried (on nightly from Oct 14th 2023, and on 1.10-beta3)

# sudo pacman -S rocm-ml-sdk # installing rocm 5.6
# HSA_OVERRIDE_GFX_VERSION=11.0.0 julia +nightly --project=. -tauto
julia> ]add [email protected]
julia> ]add AMDGPU#master
julia> using AMDGPU
julia> AMDGPU.ROCmDiscovery.use_devlibs_jll!()
# restart julia
julia> using AMDGPU
julia> ROCArray(rand(5))
julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

@pxl-th
Copy link
Member

pxl-th commented Oct 15, 2023

I've seen this error, but I don't think it is related to Julia.
Maybe @jpsamaroo has an idea.

I'm using Ubuntu 22.04 with ROCm 5.6 and we do support 7900 XTX, I'm actually using it with Julia 1.10-beta3.

julia> using AMDGPU

julia> AMDGPU.versioninfo()
ROCm provided by: system
[+] HSA Runtime v1.1.0
    @ /opt/rocm-5.6.1/lib/libhsa-runtime64.so
[+] ld.lld
    @ /opt/rocm/llvm/bin/ld.lld
[+] ROCm-Device-Libs
    @ /home/pxl-th/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode
[+] HIP Runtime v5.6.31062
    @ /opt/rocm-5.6.1/lib/libamdhip64.so
[+] rocBLAS v3.0.0
    @ /opt/rocm-5.6.1/lib/librocblas.so
[+] rocSOLVER v3.22.0
    @ /opt/rocm-5.6.1/lib/librocsolver.so
[+] rocALUTION
    @ /opt/rocm-5.6.1/lib/librocalution.so
[+] rocSPARSE
    @ /opt/rocm-5.6.1/lib/librocsparse.so.0
[+] rocRAND v2.10.5
    @ /opt/rocm-5.6.1/lib/librocrand.so
[+] rocFFT v1.0.21
    @ /opt/rocm-5.6.1/lib/librocfft.so
[+] MIOpen v2.20.0
    @ /opt/rocm-5.6.1/lib/libMIOpen.so

HIP Devices [1]
    1. HIPDevice(name="Radeon RX 7900 XTX", id=1)

julia> sum(AMDGPU.ones(Float32, 16))
16.0f0

You also don't need HSA_OVERRIDE_GFX_VERSION=11.0.0, since there are device libraries for this model.

@pxl-th
Copy link
Member

pxl-th commented Oct 15, 2023

However there are other issues with Navi 3: #518.
Although it happens mostly during testsing (or with --check-bounds=yes).
Other than that everything seems to work normally.

@Krastanov
Copy link

Weirdly, whether I use HSA_OVERRIDE_GFX_VERSION does change where I get the segfault (on julia 1.10.0-beta3):

❯ julia +beta --project=. -tauto
julia> using AMDGPU; AMDGPU.versioninfo()
ROCm provided by: system
[+] HSA Runtime v1.1.0
    @ /opt/rocm/hsa/lib/libhsa-runtime64.so
[+] ld.lld
    @ /opt/rocm/llvm/bin/ld.lld
[+] ROCm-Device-Libs
    @ /home/stefan/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode
[+] HIP Runtime v5.6.31062
    @ /opt/rocm/hip/lib/libamdhip64.so
julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.

[887239] signal (6.-6): Aborted
in expression starting at REPL[1]:1
unknown function (ip: 0x7fa4ae6ba83c)
raise at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7fa4ae6523db)
__assert_fail at /usr/lib/libc.so.6 (unknown line)
__hipRegisterFatBinary at /opt/rocm/hip/lib/libamdhip64.so (unknown line)
❯ HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGET=gfx1100 julia +beta --project=. -tauto
julia> using AMDGPU; AMDGPU.versioninfo()
ROCm provided by: system
[+] HSA Runtime v1.1.0
    @ /opt/rocm/hsa/lib/libhsa-runtime64.so
[+] ld.lld
    @ /opt/rocm/llvm/bin/ld.lld
[+] ROCm-Device-Libs
    @ /home/stefan/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode
[+] HIP Runtime v5.6.31062
    @ /opt/rocm/hip/lib/libamdhip64.so
[+] rocBLAS v3.0.0
    @ /opt/rocm/lib/librocblas.so
[+] rocSOLVER v3.22.0
    @ /opt/rocm/lib/librocsolver.so
[+] rocALUTION
julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.

[887666] signal (6.-6): Aborted
in expression starting at REPL[1]:1
unknown function (ip: 0x7feed9b6183c)
raise at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7feed9af93db)
__assert_fail at /usr/lib/libc.so.6 (unknown line)
__hipRegisterFatBinary at /opt/rocm/hip/lib/libamdhip64.so (unknown line)

@pxl-th
Copy link
Member

pxl-th commented Oct 15, 2023

Do rocminfo & clinfo utils from ROCm work correctly for you?

@Krastanov
Copy link

Krastanov commented Oct 15, 2023

Yes, and pytorch works fine in a normal python-virtualenv (directly on the same host, no containers or VMs)

click to see output
  ❯ /opt/rocm/bin/rocminfo
  ROCk module is loaded
  =====================    
  HSA System Attributes    
  =====================    
  Runtime Version:         1.1
  System Timestamp Freq.:  1000.000000MHz
  Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
  Machine Model:           LARGE                              
  System Endianness:       LITTLE                             
  
  ==========               
  HSA Agents               
  ==========               
  *******                  
  Agent 1                  
  *******                  
    Name:                    AMD Ryzen 9 7950X 16-Core Processor
    Uuid:                    CPU-XX                             
    Marketing Name:          AMD Ryzen 9 7950X 16-Core Processor
    Vendor Name:             CPU                                
    Feature:                 None specified                     
    Profile:                 FULL_PROFILE                       
    Float Round Mode:        NEAR                               
    Max Queue Number:        0(0x0)                             
    Queue Min Size:          0(0x0)                             
    Queue Max Size:          0(0x0)                             
    Queue Type:              MULTI                              
    Node:                    0                                  
    Device Type:             CPU                                
    Cache Info:              
      L1:                      32768(0x8000) KB                   
    Chip ID:                 0(0x0)                             
    ASIC Revision:           0(0x0)                             
    Cacheline Size:          64(0x40)                           
    Max Clock Freq. (MHz):   5881                               
    BDFID:                   0                                  
    Internal Node ID:        0                                  
    Compute Unit:            32                                 
    SIMDs per CU:            0                                  
    Shader Engines:          0                                  
    Shader Arrs. per Eng.:   0                                  
    WatchPts on Addr. Ranges:1                                  
    Features:                None
    Pool Info:               
      Pool 1                   
        Segment:                 GLOBAL; FLAGS: FINE GRAINED        
        Size:                    64963448(0x3df4378) KB             
        Allocatable:             TRUE                               
        Alloc Granule:           4KB                                
        Alloc Alignment:         4KB                                
        Accessible by all:       TRUE                               
      Pool 2                   
        Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
        Size:                    64963448(0x3df4378) KB             
        Allocatable:             TRUE                               
        Alloc Granule:           4KB                                
        Alloc Alignment:         4KB                                
        Accessible by all:       TRUE                               
      Pool 3                   
        Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
        Size:                    64963448(0x3df4378) KB             
        Allocatable:             TRUE                               
        Alloc Granule:           4KB                                
        Alloc Alignment:         4KB                                
        Accessible by all:       TRUE                               
    ISA Info:                
  *******                  
  Agent 2                  
  *******                  
    Name:                    gfx1100                            
    Uuid:                    GPU-e2d46e94db0446fd               
    Marketing Name:          AMD Radeon RX 7900 XTX             
    Vendor Name:             AMD                                
    Feature:                 KERNEL_DISPATCH                    
    Profile:                 BASE_PROFILE                       
    Float Round Mode:        NEAR                               
    Max Queue Number:        128(0x80)                          
    Queue Min Size:          64(0x40)                           
    Queue Max Size:          131072(0x20000)                    
    Queue Type:              MULTI                              
    Node:                    1                                  
    Device Type:             GPU                                
    Cache Info:              
      L1:                      32(0x20) KB                        
      L2:                      6144(0x1800) KB                    
      L3:                      98304(0x18000) KB                  
    Chip ID:                 29772(0x744c)                      
    ASIC Revision:           0(0x0)                             
    Cacheline Size:          64(0x40)                           
    Max Clock Freq. (MHz):   2371                               
    BDFID:                   768                                
    Internal Node ID:        1                                  
    Compute Unit:            96                                 
    SIMDs per CU:            2                                  
    Shader Engines:          6                                  
    Shader Arrs. per Eng.:   2                                  
    WatchPts on Addr. Ranges:4                                  
    Features:                KERNEL_DISPATCH 
    Fast F16 Operation:      TRUE                               
    Wavefront Size:          32(0x20)                           
    Workgroup Max Size:      1024(0x400)                        
    Workgroup Max Size per Dimension:
      x                        1024(0x400)                        
      y                        1024(0x400)                        
      z                        1024(0x400)                        
    Max Waves Per CU:        32(0x20)                           
    Max Work-item Per CU:    1024(0x400)                        
    Grid Max Size:           4294967295(0xffffffff)             
    Grid Max Size per Dimension:
      x                        4294967295(0xffffffff)             
      y                        4294967295(0xffffffff)             
      z                        4294967295(0xffffffff)             
    Max fbarriers/Workgrp:   32                                 
    Pool Info:               
      Pool 1                   
        Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
        Size:                    25149440(0x17fc000) KB             
        Allocatable:             TRUE                               
        Alloc Granule:           4KB                                
        Alloc Alignment:         4KB                                
        Accessible by all:       FALSE                              
      Pool 2                   
        Segment:                 GROUP                              
        Size:                    64(0x40) KB                        
        Allocatable:             FALSE                              
        Alloc Granule:           0KB                                
        Alloc Alignment:         0KB                                
        Accessible by all:       FALSE                              
    ISA Info:                
      ISA 1                    
        Name:                    amdgcn-amd-amdhsa--gfx1100         
        Machine Models:          HSA_MACHINE_MODEL_LARGE            
        Profiles:                HSA_PROFILE_BASE                   
        Default Rounding Mode:   NEAR                               
        Default Rounding Mode:   NEAR                               
        Fast f16:                TRUE                               
        Workgroup Max Size:      1024(0x400)                        
        Workgroup Max Size per Dimension:
          x                        1024(0x400)                        
          y                        1024(0x400)                        
          z                        1024(0x400)                        
        Grid Max Size:           4294967295(0xffffffff)             
        Grid Max Size per Dimension:
          x                        4294967295(0xffffffff)             
          y                        4294967295(0xffffffff)             
          z                        4294967295(0xffffffff)             
        FBarrier Max Size:       32                                 
  *******                  
  Agent 3                  
  *******                  
    Name:                    gfx1036                            
    Uuid:                    GPU-XX                             
    Marketing Name:          AMD Radeon Graphics                
    Vendor Name:             AMD                                
    Feature:                 KERNEL_DISPATCH                    
    Profile:                 BASE_PROFILE                       
    Float Round Mode:        NEAR                               
    Max Queue Number:        128(0x80)                          
    Queue Min Size:          64(0x40)                           
    Queue Max Size:          131072(0x20000)                    
    Queue Type:              MULTI                              
    Node:                    2                                  
    Device Type:             GPU                                
    Cache Info:              
      L1:                      16(0x10) KB                        
      L2:                      256(0x100) KB                      
    Chip ID:                 5710(0x164e)                       
    ASIC Revision:           1(0x1)                             
    Cacheline Size:          64(0x40)                           
    Max Clock Freq. (MHz):   2200                               
    BDFID:                   4608                               
    Internal Node ID:        2                                  
    Compute Unit:            2                                  
    SIMDs per CU:            2                                  
    Shader Engines:          1                                  
    Shader Arrs. per Eng.:   1                                  
    WatchPts on Addr. Ranges:4                                  
    Features:                KERNEL_DISPATCH 
    Fast F16 Operation:      TRUE                               
    Wavefront Size:          32(0x20)                           
    Workgroup Max Size:      1024(0x400)                        
    Workgroup Max Size per Dimension:
      x                        1024(0x400)                        
      y                        1024(0x400)                        
      z                        1024(0x400)                        
    Max Waves Per CU:        32(0x20)                           
    Max Work-item Per CU:    1024(0x400)                        
    Grid Max Size:           4294967295(0xffffffff)             
    Grid Max Size per Dimension:
      x                        4294967295(0xffffffff)             
      y                        4294967295(0xffffffff)             
      z                        4294967295(0xffffffff)             
    Max fbarriers/Workgrp:   32                                 
    Pool Info:               
      Pool 1                   
        Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
        Size:                    524288(0x80000) KB                 
        Allocatable:             TRUE                               
        Alloc Granule:           4KB                                
        Alloc Alignment:         4KB                                
        Accessible by all:       FALSE                              
      Pool 2                   
        Segment:                 GROUP                              
        Size:                    64(0x40) KB                        
        Allocatable:             FALSE                              
        Alloc Granule:           0KB                                
        Alloc Alignment:         0KB                                
        Accessible by all:       FALSE                              
    ISA Info:                
      ISA 1                    
        Name:                    amdgcn-amd-amdhsa--gfx1036         
        Machine Models:          HSA_MACHINE_MODEL_LARGE            
        Profiles:                HSA_PROFILE_BASE                   
        Default Rounding Mode:   NEAR                               
        Default Rounding Mode:   NEAR                               
        Fast f16:                TRUE                               
        Workgroup Max Size:      1024(0x400)                        
        Workgroup Max Size per Dimension:
          x                        1024(0x400)                        
          y                        1024(0x400)                        
          z                        1024(0x400)                        
        Grid Max Size:           4294967295(0xffffffff)             
        Grid Max Size per Dimension:
          x                        4294967295(0xffffffff)             
          y                        4294967295(0xffffffff)             
          z                        4294967295(0xffffffff)             
        FBarrier Max Size:       32                                 
  *** Done ***             
  
  ~ 
  ❯ /opt/rocm/bin/clinfo
  Number of platforms:				 1
    Platform Profile:				 FULL_PROFILE
    Platform Version:				 OpenCL 2.1 AMD-APP.dbg (3570.0)
    Platform Name:				 AMD Accelerated Parallel Processing
    Platform Vendor:				 Advanced Micro Devices, Inc.
    Platform Extensions:				 cl_khr_icd cl_amd_event_callback 
  
  
    Platform Name:				 AMD Accelerated Parallel Processing
  Number of devices:				 2
    Device Type:					 CL_DEVICE_TYPE_GPU
    Vendor ID:					 1002h
    Board name:					 AMD Radeon RX 7900 XTX
    Device Topology:				 PCI[ B#3, D#0, F#0 ]
    Max compute units:				 48
    Max work items dimensions:			 3
      Max work items[0]:				 1024
      Max work items[1]:				 1024
      Max work items[2]:				 1024
    Max work group size:				 256
    Preferred vector width char:			 4
    Preferred vector width short:			 2
    Preferred vector width int:			 1
    Preferred vector width long:			 1
    Preferred vector width float:			 1
    Preferred vector width double:		 1
    Native vector width char:			 4
    Native vector width short:			 2
    Native vector width int:			 1
    Native vector width long:			 1
    Native vector width float:			 1
    Native vector width double:			 1
    Max clock frequency:				 2371Mhz
    Address bits:					 64
    Max memory allocation:			 21890072576
    Image support:				 Yes
    Max number of images read arguments:		 128
    Max number of images write arguments:		 8
    Max image 2D width:				 16384
    Max image 2D height:				 16384
    Max image 3D width:				 16384
    Max image 3D height:				 16384
    Max image 3D depth:				 8192
    Max samplers within kernel:			 29772
    Max size of kernel argument:			 1024
    Alignment (bits) of base address:		 1024
    Minimum alignment (bytes) for any datatype:	 128
    Single precision floating point capability
      Denorms:					 Yes
      Quiet NaNs:					 Yes
      Round to nearest even:			 Yes
      Round to zero:				 Yes
      Round to +ve and infinity:			 Yes
      IEEE754-2008 fused multiply-add:		 Yes
    Cache type:					 Read/Write
    Cache line size:				 64
    Cache size:					 32768
    Global memory size:				 25753026560
    Constant buffer size:				 21890072576
    Max number of constant args:			 8
    Local memory type:				 Scratchpad
    Local memory size:				 65536
    Max pipe arguments:				 16
    Max pipe active reservations:			 16
    Max pipe packet size:				 415236096
    Max global variable size:			 21890072576
    Max global variable preferred total size:	 25753026560
    Max read/write image args:			 64
    Max on device events:				 1024
    Queue on device max size:			 8388608
    Max on device queues:				 1
    Queue on device preferred size:		 262144
    SVM capabilities:				 
      Coarse grain buffer:			 Yes
      Fine grain buffer:				 Yes
      Fine grain system:				 No
      Atomics:					 No
    Preferred platform atomic alignment:		 0
    Preferred global atomic alignment:		 0
    Preferred local atomic alignment:		 0
    Kernel Preferred work group size multiple:	 32
    Error correction support:			 0
    Unified memory for Host and Device:		 0
    Profiling timer resolution:			 1
    Device endianess:				 Little
    Available:					 Yes
    Compiler available:				 Yes
    Execution capabilities:				 
      Execute OpenCL kernels:			 Yes
      Execute native function:			 No
    Queue on Host properties:				 
      Out-of-Order:				 No
      Profiling :					 Yes
    Queue on Device properties:				 
      Out-of-Order:				 Yes
      Profiling :					 Yes
    Platform ID:					 0x7fa0a5322010
    Name:						 gfx1100
    Vendor:					 Advanced Micro Devices, Inc.
    Device OpenCL C version:			 OpenCL C 2.0 
    Driver version:				 3570.0 (HSA1.1,LC)
    Profile:					 FULL_PROFILE
    Version:					 OpenCL 2.0 
    Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 
  
  
    Device Type:					 CL_DEVICE_TYPE_GPU
    Vendor ID:					 1002h
    Board name:					 AMD Radeon Graphics
    Device Topology:				 PCI[ B#18, D#0, F#0 ]
    Max compute units:				 1
    Max work items dimensions:			 3
      Max work items[0]:				 1024
      Max work items[1]:				 1024
      Max work items[2]:				 1024
    Max work group size:				 256
    Preferred vector width char:			 4
    Preferred vector width short:			 2
    Preferred vector width int:			 1
    Preferred vector width long:			 1
    Preferred vector width float:			 1
    Preferred vector width double:		 1
    Native vector width char:			 4
    Native vector width short:			 2
    Native vector width int:			 1
    Native vector width long:			 1
    Native vector width float:			 1
    Native vector width double:			 1
    Max clock frequency:				 2200Mhz
    Address bits:					 64
    Max memory allocation:			 456340272
    Image support:				 Yes
    Max number of images read arguments:		 128
    Max number of images write arguments:		 8
    Max image 2D width:				 16384
    Max image 2D height:				 16384
    Max image 3D width:				 16384
    Max image 3D height:				 16384
    Max image 3D depth:				 8192
    Max samplers within kernel:			 5710
    Max size of kernel argument:			 1024
    Alignment (bits) of base address:		 1024
    Minimum alignment (bytes) for any datatype:	 128
    Single precision floating point capability
      Denorms:					 Yes
      Quiet NaNs:					 Yes
      Round to nearest even:			 Yes
      Round to zero:				 Yes
      Round to +ve and infinity:			 Yes
      IEEE754-2008 fused multiply-add:		 Yes
    Cache type:					 Read/Write
    Cache line size:				 64
    Cache size:					 16384
    Global memory size:				 536870912
    Constant buffer size:				 456340272
    Max number of constant args:			 8
    Local memory type:				 Scratchpad
    Local memory size:				 65536
    Max pipe arguments:				 16
    Max pipe active reservations:			 16
    Max pipe packet size:				 456340272
    Max global variable size:			 456340272
    Max global variable preferred total size:	 536870912
    Max read/write image args:			 64
    Max on device events:				 1024
    Queue on device max size:			 8388608
    Max on device queues:				 1
    Queue on device preferred size:		 262144
    SVM capabilities:				 
      Coarse grain buffer:			 Yes
      Fine grain buffer:				 Yes
      Fine grain system:				 No
      Atomics:					 No
    Preferred platform atomic alignment:		 0
    Preferred global atomic alignment:		 0
    Preferred local atomic alignment:		 0
    Kernel Preferred work group size multiple:	 32
    Error correction support:			 0
    Unified memory for Host and Device:		 0
    Profiling timer resolution:			 1
    Device endianess:				 Little
    Available:					 Yes
    Compiler available:				 Yes
    Execution capabilities:				 
      Execute OpenCL kernels:			 Yes
      Execute native function:			 No
    Queue on Host properties:				 
      Out-of-Order:				 No
      Profiling :					 Yes
    Queue on Device properties:				 
      Out-of-Order:				 Yes
      Profiling :					 Yes
    Platform ID:					 0x7fa0a5322010
    Name:						 gfx1036
    Vendor:					 Advanced Micro Devices, Inc.
    Device OpenCL C version:			 OpenCL C 2.0 
    Driver version:				 3570.0 (HSA1.1,LC)
    Profile:					 FULL_PROFILE
    Version:					 OpenCL 2.0 
    Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

@pxl-th
Copy link
Member

pxl-th commented Nov 24, 2023

Not completely sure, but it seems #508 (ROCm mixed mode) does not yet provide 7900XTX support (at least on arch)

The failure is:

julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

pytorch with rocm 5.6 works fine

Here is what I tried (on nightly from Oct 14th 2023, and on 1.10-beta3)

# sudo pacman -S rocm-ml-sdk # installing rocm 5.6
# HSA_OVERRIDE_GFX_VERSION=11.0.0 julia +nightly --project=. -tauto
julia> ]add [email protected]
julia> ]add AMDGPU#master
julia> using AMDGPU
julia> AMDGPU.ROCmDiscovery.use_devlibs_jll!()
# restart julia
julia> using AMDGPU
julia> ROCArray(rand(5))
julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

This particular issue comes from debug HIP build, but can be ignored (for now).
So if you can, use release build, where this assert is disabled.
See ROCm/clr#36 & #549.

@pxl-th
Copy link
Member

pxl-th commented Nov 24, 2023

As for 7900 series, we already support them.

@pxl-th pxl-th closed this as completed Nov 24, 2023
@aksuhton
Copy link

aksuhton commented Nov 25, 2023

I'm having some trouble with an error upon using AMDGPU. I first tried today with rocm version 5.7.1, and have since downgraded to version 5.6.1. I'm on arch linux attemping to use a 7900 XTX.

The leading line of the error is as follows:

julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed

Full error:

Details

julia +1.10.0-beta3
julia> ]activate --temp
julia> ]add [email protected]
julia> ]add AMDGPU#master
julia> ]status
Status `/tmp/jl_eQU7XV/Project.toml`
  [21141c5a] AMDGPU v0.8.0 `https://github.com/JuliaGPU/AMDGPU.jl.git#master`
  [873c0968] ROCmDeviceLibs_jll v5.6.1+1
julia> using AMDGPU
julia: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
[2658] signal (6.-6): Aborted
in expression starting at REPL[5]:1
unknown function (ip: 0x7fc80ce3283c)
raise at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7fc80cdca3db)
__assert_fail at /usr/lib/libc.so.6 (unknown line)
__hipRegisterFatBinary at /opt/rocm/hip/lib/libamdhip64.so (unknown line)
unknown function (ip: 0x7fc5f76d363c)
unknown function (ip: 0x7fc80cfc5eed)
unknown function (ip: 0x7fc80cfc5fdb)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7fc80cfcc875)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7fc80cfccbeb)
unknown function (ip: 0x7fc80ce2c9eb)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7fc80cfc2602)
unknown function (ip: 0x7fc80ce2c4f6)
dlopen at /usr/lib/libc.so.6 (unknown line)
ijl_load_dynamic_library at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/dlload.c:365
#dlopen#3 at ./libdl.jl:117
dlopen at ./libdl.jl:116 [inlined]
find_library at ./libdl.jl:206
find_library at ./libdl.jl:214 [inlined]
find_library at ./libdl.jl:214 [inlined]
find_rocm_library at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/utils.jl:107
#get_library#4 at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/discovery.jl:57 [inlined]
get_library at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/discovery.jl:48 [inlined]
__init__ at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/discovery.jl:151
jfptr___init___5427 at /home/aksuhton/.julia/compiled/v1.10/AMDGPU/arpZD_Zqr9o.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
jl_module_run_initializer at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:76
run_module_init at ./loading.jl:1128
register_restored_modules at ./loading.jl:1116
_include_from_serialized at ./loading.jl:1061
_require_search_from_serialized at ./loading.jl:1575
_require at ./loading.jl:1932
__require_prelocked at ./loading.jl:1806
jfptr___require_prelocked_80657.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
jl_f__call_in_world at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:831
#invoke_in_world#3 at ./essentials.jl:921 [inlined]
invoke_in_world at ./essentials.jl:918 [inlined]
_require_prelocked at ./loading.jl:1797
macro expansion at ./loading.jl:1784 [inlined]
macro expansion at ./lock.jl:267 [inlined]
__require at ./loading.jl:1747
jfptr___require_80622.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
jl_f__call_in_world at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:831
#invoke_in_world#3 at ./essentials.jl:921 [inlined]
invoke_in_world at ./essentials.jl:918 [inlined]
require at ./loading.jl:1740
jfptr_require_80619.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
call_require at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:481 [inlined]
eval_import_path at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:518
jl_toplevel_eval_flex at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:752
jl_toplevel_eval_flex at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:383 [inlined]
eval_user_input at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
repl_backend_loop at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#start_repl_backend#46 at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
start_repl_backend at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
#run_repl#59 at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
run_repl at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
jfptr_run_repl_91573.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
#1013 at ./client.jl:432
jfptr_YY.1013_82609.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
jl_f__call_latest at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:887 [inlined]
invokelatest at ./essentials.jl:884 [inlined]
run_main_repl at ./client.jl:416
exec_options at ./client.jl:333
_start at ./client.jl:552
jfptr__start_82635.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
true_main at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7fc80cdcbccf)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 4064768 (Pool: 4061266; Big: 3502); GC: 6
[1]    2656 IOT instruction (core dumped)  julia +1.10.0-beta3

rocminfo:

Details

bin /opt/rocm/bin/rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 7950X3D 16-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 7950X3D 16-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
	L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   5759
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
	Pool 1
	  Segment:                 GLOBAL; FLAGS: FINE GRAINED
	  Size:                    31950900(0x1e78834) KB
	  Allocatable:             TRUE
	  Alloc Granule:           4KB
	  Alloc Alignment:         4KB
	  Accessible by all:       TRUE
	Pool 2
	  Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
	  Size:                    31950900(0x1e78834) KB
	  Allocatable:             TRUE
	  Alloc Granule:           4KB
	  Alloc Alignment:         4KB
	  Accessible by all:       TRUE
	Pool 3
	  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
	  Size:                    31950900(0x1e78834) KB
	  Allocatable:             TRUE
	  Alloc Granule:           4KB
	  Alloc Alignment:         4KB
	  Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-8dc302efb16ccf0b
  Marketing Name:          AMD Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
	L1:                      32(0x20) KB
	L2:                      6144(0x1800) KB
	L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
	x                        1024(0x400)
	y                        1024(0x400)
	z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
	x                        4294967295(0xffffffff)
	y                        4294967295(0xffffffff)
	z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
	Pool 1
	  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
	  Size:                    25149440(0x17fc000) KB
	  Allocatable:             TRUE
	  Alloc Granule:           4KB
	  Alloc Alignment:         4KB
	  Accessible by all:       FALSE
	Pool 2
	  Segment:                 GROUP
	  Size:                    64(0x40) KB
	  Allocatable:             FALSE
	  Alloc Granule:           0KB
	  Alloc Alignment:         0KB
	  Accessible by all:       FALSE
  ISA Info:
	ISA 1
	  Name:                    amdgcn-amd-amdhsa--gfx1100
	  Machine Models:          HSA_MACHINE_MODEL_LARGE
	  Profiles:                HSA_PROFILE_BASE
	  Default Rounding Mode:   NEAR
	  Default Rounding Mode:   NEAR
	  Fast f16:                TRUE
	  Workgroup Max Size:      1024(0x400)
	  Workgroup Max Size per Dimension:
		x                        1024(0x400)
		y                        1024(0x400)
		z                        1024(0x400)
	  Grid Max Size:           4294967295(0xffffffff)
	  Grid Max Size per Dimension:
		x                        4294967295(0xffffffff)
		y                        4294967295(0xffffffff)
		z                        4294967295(0xffffffff)
	  FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1036
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
	L1:                      16(0x10) KB
	L2:                      256(0x100) KB
  Chip ID:                 5710(0x164e)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2200
  BDFID:                   20480
  Internal Node ID:        2
  Compute Unit:            2
  SIMDs per CU:            2
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
	x                        1024(0x400)
	y                        1024(0x400)
	z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
	x                        4294967295(0xffffffff)
	y                        4294967295(0xffffffff)
	z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
	Pool 1
	  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
	  Size:                    524288(0x80000) KB
	  Allocatable:             TRUE
	  Alloc Granule:           4KB
	  Alloc Alignment:         4KB
	  Accessible by all:       FALSE
	Pool 2
	  Segment:                 GROUP
	  Size:                    64(0x40) KB
	  Allocatable:             FALSE
	  Alloc Granule:           0KB
	  Alloc Alignment:         0KB
	  Accessible by all:       FALSE
  ISA Info:
	ISA 1
	  Name:                    amdgcn-amd-amdhsa--gfx1036
	  Machine Models:          HSA_MACHINE_MODEL_LARGE
	  Profiles:                HSA_PROFILE_BASE
	  Default Rounding Mode:   NEAR
	  Default Rounding Mode:   NEAR
	  Fast f16:                TRUE
	  Workgroup Max Size:      1024(0x400)
	  Workgroup Max Size per Dimension:
		x                        1024(0x400)
		y                        1024(0x400)
		z                        1024(0x400)
	  Grid Max Size:           4294967295(0xffffffff)
	  Grid Max Size per Dimension:
		x                        4294967295(0xffffffff)
		y                        4294967295(0xffffffff)
		z                        4294967295(0xffffffff)
	  FBarrier Max Size:       32
*** Done ***

clinfo:

Details

bin /opt/rocm/bin/clinfo
Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP.dbg (3570.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 2
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 AMD Radeon RX 7900 XTX
  Device Topology:				 PCI[ B#3, D#0, F#0 ]
  Max compute units:				 48
  Max work items dimensions:			 3
	Max work items[0]:				 1024
	Max work items[1]:				 1024
	Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 2526Mhz
  Address bits:					 64
  Max memory allocation:			 21890072576
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 16384
  Max image 3D height:				 16384
  Max image 3D depth:				 8192
  Max samplers within kernel:			 29772
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
	Denorms:					 Yes
	Quiet NaNs:					 Yes
	Round to nearest even:			 Yes
	Round to zero:				 Yes
	Round to +ve and infinity:			 Yes
	IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 32768
  Global memory size:				 25753026560
  Constant buffer size:				 21890072576
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 415236096
  Max global variable size:			 21890072576
  Max global variable preferred total size:	 25753026560
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				
	Coarse grain buffer:			 Yes
	Fine grain buffer:				 Yes
	Fine grain system:				 No
	Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 32
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				
	Execute OpenCL kernels:			 Yes
	Execute native function:			 No
  Queue on Host properties:				
	Out-of-Order:				 No
	Profiling :					 Yes
  Queue on Device properties:				
	Out-of-Order:				 Yes
	Profiling :					 Yes
  Platform ID:					 0x7f0c4860f010
  Name:						 gfx1100
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0
  Driver version:				 3570.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.0
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program


  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 AMD Radeon Graphics
  Device Topology:				 PCI[ B#80, D#0, F#0 ]
  Max compute units:				 1
  Max work items dimensions:			 3
	Max work items[0]:				 1024
	Max work items[1]:				 1024
	Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 2200Mhz
  Address bits:					 64
  Max memory allocation:			 456340272
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 16384
  Max image 3D height:				 16384
  Max image 3D depth:				 8192
  Max samplers within kernel:			 5710
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
	Denorms:					 Yes
	Quiet NaNs:					 Yes
	Round to nearest even:			 Yes
	Round to zero:				 Yes
	Round to +ve and infinity:			 Yes
	IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 536870912
  Constant buffer size:				 456340272
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 456340272
  Max global variable size:			 456340272
  Max global variable preferred total size:	 536870912
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				
	Coarse grain buffer:			 Yes
	Fine grain buffer:				 Yes
	Fine grain system:				 No
	Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 32
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				
	Execute OpenCL kernels:			 Yes
	Execute native function:			 No
  Queue on Host properties:				
	Out-of-Order:				 No
	Profiling :					 Yes
  Queue on Device properties:				
	Out-of-Order:				 Yes
	Profiling :					 Yes
  Platform ID:					 0x7f0c4860f010
  Name:						 gfx1036
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0
  Driver version:				 3570.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.0
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

If instead I use a LocalPreferences.toml with use_artifacts = true , then I see on using AMDGPU

julia: /workspace/srcdir/ROCR-Runtime/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char *, rocr::AMD::GpuAgent::AssembleTarget, void *&, size_t &) const: Assertion `code_buf != __null && "Code buffer allocation failed"' failed.
Details

julia: /workspace/srcdir/ROCR-Runtime/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char *, rocr::AMD::GpuAgent::AssembleTarget, void *&, size_t &) const: Assertion `code_buf != __null && "Code buffer allocation failed"' failed.
  
  [2199] signal (6.-6): Aborted
  in expression starting at REPL[3]:1
  unknown function (ip: 0x7f3bc216d83c)
  raise at /usr/lib/libc.so.6 (unknown line)
  abort at /usr/lib/libc.so.6 (unknown line)
  unknown function (ip: 0x7f3bc21053db)
  __assert_fail at /usr/lib/libc.so.6 (unknown line)
  _ZNK4rocr3AMD8GpuAgent14AssembleShaderEPKcNS1_14AssembleTargetERPvRm at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  _ZN4rocr3AMD8GpuAgent15BindTrapHandlerEv at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  _ZN4rocr3AMD8GpuAgent13PostToolsInitEv at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  _ZN4rocr4core7Runtime4LoadEv at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  _ZN4rocr4core7Runtime7AcquireEv at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  _ZN4rocr3HSA8hsa_initEv at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  hsa_init at /home/aksuhton/.julia/artifacts/4df816456579cea2c03cac08a6b82fa87abe2b38/lib/libhsa-runtime64.so (unknown line)
  unknown function (ip: 0x7f3b0ee6b4de)
  unknown function (ip: 0x7f3b0edcaebb)
  unknown function (ip: 0x7f3b0ee62c95)
  unknown function (ip: 0x7f3b0eb2fca9)
  hipRuntimeGetVersion at /home/aksuhton/.julia/artifacts/3e4a5c18581a48180ab1525d3d490a2e2552616f/hip/lib/libamdhip64.so (unknown line)
  _hip_runtime_version at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/discovery.jl:87
  __init__ at /home/aksuhton/.julia/packages/AMDGPU/9od2y/src/discovery/discovery.jl:144
  jfptr___init___5571 at /home/aksuhton/.julia/compiled/v1.10/AMDGPU/arpZD_iwQWC.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  jl_module_run_initializer at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:76
  run_module_init at ./loading.jl:1128
  register_restored_modules at ./loading.jl:1116
  _include_from_serialized at ./loading.jl:1061
  _require_search_from_serialized at ./loading.jl:1575
  _require at ./loading.jl:1932
  __require_prelocked at ./loading.jl:1806
  jfptr___require_prelocked_80657.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  jl_f__call_in_world at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:831
  #invoke_in_world#3 at ./essentials.jl:921 [inlined]
  invoke_in_world at ./essentials.jl:918 [inlined]
  _require_prelocked at ./loading.jl:1797
  macro expansion at ./loading.jl:1784 [inlined]
  macro expansion at ./lock.jl:267 [inlined]
  __require at ./loading.jl:1747
  jfptr___require_80622.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  jl_f__call_in_world at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:831
  #invoke_in_world#3 at ./essentials.jl:921 [inlined]
  invoke_in_world at ./essentials.jl:918 [inlined]
  require at ./loading.jl:1740
  jfptr_require_80619.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  call_require at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:481 [inlined]
  eval_import_path at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:518
  jl_toplevel_eval_flex at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:752
  jl_toplevel_eval_flex at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:877
  ijl_toplevel_eval_in at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/toplevel.c:985
  eval at ./boot.jl:383 [inlined]
  eval_user_input at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
  repl_backend_loop at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
  #start_repl_backend#46 at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
  start_repl_backend at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  #run_repl#59 at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
  run_repl at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
  jfptr_run_repl_91573.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  #1013 at ./client.jl:432
  jfptr_YY.1013_82609.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  jl_f__call_latest at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/builtins.c:812
  #invokelatest#2 at ./essentials.jl:887 [inlined]
  invokelatest at ./essentials.jl:884 [inlined]
  run_main_repl at ./client.jl:416
  exec_options at ./client.jl:333
  _start at ./client.jl:552
  jfptr__start_82635.1 at /home/aksuhton/.julia/juliaup/julia-1.10.0-beta3+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
  _jl_invoke at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:2892 [inlined]
  ijl_apply_generic at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/gf.c:3074
  jl_apply at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/julia.h:1976 [inlined]
  true_main at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/jlapi.c:582
  jl_repl_entrypoint at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/src/jlapi.c:731
  main at /cache/build/default-amdci5-7/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
  unknown function (ip: 0x7f3bc2106ccf)
  __libc_start_main at /usr/lib/libc.so.6 (unknown line)
  unknown function (ip: 0x4010b8)
  Allocations: 4048475 (Pool: 4044616; Big: 3859); GC: 6
  [1]    2197 IOT instruction (core dumped)  julia +1.10.0-beta3

Lastly, here is a list which I think contains the relevant packages. I've tested with everything uniformly either version 5.7 or version 5.6 except composable-kernel which I could only find as either version 5.5 or version 5.7.

Details

comgr 
composable-kernel 
hip-runtime-amd 
hipblas 
hipcub 
hipfft 
hipsolver 
hipsprase 
hsa-rocr 
hsakmt-roct 
miopen-hip 
rccl 
rocalution 
rocblas 
rocfft 
rocm-clang-ocl 
rocm-cmake 
rocm-core 
rocm-dbgapi 
rocm-devlice-libs 
rocm-hip-libraries 
rocm-hip-runtime 
rocm-hip-sdk 
rocm-language-runtime 
rocm-llvm 
rocm-ml-libraries 
rocm-ml-sdk 
rocm-opencl-runtime 
rocm-opencl-sdk 
rocm-smi-lib 
rocminfo 
rocprim 
rocrand 
rocsolver 
rocsprase 
rocthrust 
roctracer 

Apologies if the solution is documented above and I have not understood how to put it together. Thank you for all of your hard work.

@pxl-th
Copy link
Member

pxl-th commented Nov 25, 2023

Can you get ROCm in release mode? HIP on you system is in debug.
I don't think this is an issue with Julia in particular, e.g ROCm/HIP#1333

@pxl-th
Copy link
Member

pxl-th commented Nov 25, 2023

Also there are other issues with debug build that are not there with release mode
ROCm/clr#36

@aksuhton
Copy link

Gotcha, thank you for your time. I am not sure how easy/difficult it is for me to get ROCm in release mode, but I will report back with any progress.

@pxl-th
Copy link
Member

pxl-th commented Nov 25, 2023

Not sure if arch has HIP in release mode in its packages, but as an alternative, you can build it from source, which is quite easy.
See docs: https://rocm.docs.amd.com/projects/HIP/en/latest/developer_guide/build.html

But you then need to make it visible to Libdl.dlopen with something like DL_LOAD_PATH, otherwise AMDGPU will attempt to load the one you have installed via package manager.

@matinraayai
Copy link
Contributor

@fraksuh @pxl-th What I'm seeing is that:

  1. If you use the Arch-based ROCm, HIP fails to load the HIP Fat binaries into its static code object map. It looks like a host-side runtime issue.
  2. If you use artifacts, the HSA runtime (think CUDA driver layer but for ROCm) fails to load the pre-compiled shaders onto the device. It can be either the trap handler or the BLIT kernels.

This all seems like you don't have the AMDGPU DKMS modules installed on your base system. An easy way to check is to see if you have /dev/kfd available on your system. I don't think the X.org driver works with ROCm.

AMDGPU DKMS drivers don't seem to be provided in extra by Arch, and there's an AUR that is not maintained anymore named amdgpu-dkms. There's this that might provide it, but I'm not sure if it works with ROCm.

I'm curious to know what the correct way for ROCm AMDGPU drivers.

Ubuntu and CentOS both have a dedicated APT repo for this, that matches the ROCm version you want to run, so it's easier for them. You're just paying the Arch penalty.

If you get the AMDGPU DKMS part working you can just use containers instead of building ROCm yourself. AMD provides Ubuntu 18.04/20.04/22.04 you can pull from Docker Hub. You can run it with the following command, although you might reduce the amount of permissions you need to run it:

docker run -it --rm --device=/dev/kfd  --device=/dev/dri --ipc=host --group-add=video --shm-size=16G --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm-terminal /bin/bash```

@aksuhton
Copy link

Hey thank you so much for taking the time to help.

This all seems like you don't have the AMDGPU DKMS modules installed on your base system. An easy way to check is to see if you have /dev/kfd available on your system. I don't think the X.org driver works with ROCm.

I do at least have /dev/kfd on my system, but I don't at the moment know where it came from; I was able to "make sure your user is in the same group as /dev/kfd"

ls -l /dev/kfd
crw-rw-rw- 1 root render 237, 0 Nov 25 11:10 /dev/kfd

groups aksuhton
sys network power lp wheel optical scanner rfkill video storage audio users autologin render aksuhton

Earlier today I reached out to the maintainer of the relevant Arch packages, and they are kindly offering their brainpower; I plan to follow this lead first before attempting to either build ROCm myself or use containers (thank you for the suggestion!).

@AntonReinhard
Copy link
Contributor

I'm having the exact same problem, did you figure anything out?
The docker solution seems very cumbersome...

@aksuhton
Copy link

aksuhton commented Dec 8, 2023

@AntonReinhard

I've had some further corresponding with the package maintainer, but no resolution. Ideally I'd open an issue on their gitlab, but they seem to have suspended account creation.

Here are the salient aspects of our conversation so far; quoted lines are the maintainers words, non-quoted lines are mine:

"Packages are not built in debug mode but with Arch specific flags. Release mode is discouraged for all binary packages in the Arch Linux repositories."

"The /usr/src/debug doesn't mean that the package is compiled in debug mode. It's the default prefix for debug symbols,

https://gitlab.archlinux.org/pacman/pacman/-/blob/master/scripts/libmakepkg/buildenv/debugflags.sh.in#L34"

"You can try to run the simple test.sh in the hip-runtime-amd repo on gitlab or run the HIP-Examples repo from AMD on github. If both pass, it's likely an issue with julia."

hip-runtime-amd-main> chmod +x test.sh
hip-runtime-amd-main> ./test.sh
Agent AMD Radeon RX 7900 XTX
System version 11.0
TESTS PASSED!

And I get the same passing output if I extract the HIP-Examples depot from AMD into /opt/rocm/hip and run their test.sh there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hsa
Projects
None yet
Development

No branches or pull requests

9 participants