You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use the new "AMD Modern Fortran Compiler" described here: https://github.com/amd/InfinityHub-CI/tree/main/fortran
on my code that uses "do concurrent" for GPU-offload with optional OpenMP Target data movement (for GPUs/compiler that do not support unified memory).
The code is "HipFT" located publicly here: github.com/predsci/hipft
The code works on NVIDIA GPUs with nvfortran and HPE, and on Intel GPUs with ifx.
It also compiles and runs on AMD server GPUs with HPE's CCE compiler (see https://arxiv.org/pdf/2408.07843)
I have compiled HDF5 1.14.3 (with a configure fix) and OpenMPI 5.0.6 with the amdflang and amdclang compiler to link to the code.
When I try to compiler with: -O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906
I get:
If I try to compile without any OpenMP or Do Concurrent flags, the code compiles fine and runs correctly on 1 CPU core.
If I try to compile with just openmp turned on, and "do concurrent" set to host I get a lot of serialization warnings: warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7683:7): Some do concurrent loops are not perfectly-nested. These will be serialzied.
These concern me since if I cannot use DC with index ranges like "2:N-1" than I doubt the code will parallelize at all on either the GPU or CPU since a LOT of the loops are like that.
Note I also had to use: -L/opt/amdfort/llvm/lib -lomptarget in this case otherwise it cannot find the OpenMP target data movement symbols (although they should not be being used in this case....).
Any help would be appreciated as I plan to present the code at SIAM's CSE meeting in a few months and would really like to have some AMD results.
Replace the HDF5 paths with the ones to a HDF5 library compiled with amdflang.
The mpif90 should also be associated with an MPI library compiled with amdflang.
Replace the gfx906 with the correct GPU arch you are using.
Now, try to run the build script in the top level directory of the repo.
You should see:
./build_amd_gpu.sh
=== STARTING HIPFT BUILD ===
==> Entering src directory...
==> Removing old Makefile...
==> Generating Makefile from Makefile.template...
==> Compiling code...
!!> ERROR! hipft executable not found. Build most likely failed.
Contents of src/build.err:
LLVM ERROR: aborting
make: *** [Makefile:25: hipft.o] Error 1
You can go into the src folder and try to edit the Makefile and recompile as needed.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4000
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx906
Uuid: GPU-b86490a172da5ee9
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 3584
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Problem Description
I am trying to use the new "AMD Modern Fortran Compiler" described here:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
on my code that uses "do concurrent" for GPU-offload with optional OpenMP Target data movement (for GPUs/compiler that do not support unified memory).
The code is "HipFT" located publicly here:
github.com/predsci/hipft
The code works on NVIDIA GPUs with nvfortran and HPE, and on Intel GPUs with ifx.
It also compiles and runs on AMD server GPUs with HPE's CCE compiler (see https://arxiv.org/pdf/2408.07843)
I have compiled HDF5 1.14.3 (with a configure fix) and OpenMPI 5.0.6 with the amdflang and amdclang compiler to link to the code.
When I try to compiler with:
-O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906
I get:
I am using 'mpif90' to compile the code which is using the amdflang:
If I try to compile without any OpenMP or Do Concurrent flags, the code compiles fine and runs correctly on 1 CPU core.
If I try to compile with just openmp turned on, and "do concurrent" set to host I get a lot of serialization warnings:
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7683:7): Some
do concurrentloops are not perfectly-nested. These will be serialzied.
These concern me since if I cannot use DC with index ranges like "2:N-1" than I doubt the code will parallelize at all on either the GPU or CPU since a LOT of the loops are like that.
Note I also had to use:
-L/opt/amdfort/llvm/lib -lomptarget
in this case otherwise it cannot find the OpenMP target data movement symbols (although they should not be being used in this case....).Any help would be appreciated as I plan to present the code at SIAM's CSE meeting in a few months and would really like to have some AMD results.
-- Ron
Operating System
Rocky Linux 9.5 (Blue Onyx)
CPU
Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
GPU
AMD Radeon VII, gfx906, amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-, , amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
ROCm Version
ROCm 6.2.3
ROCm Component
flang
Steps to Reproduce
My rocm is actually 6.2.4, but that is not on the list.
My linux kernel is:
edge 6.10.6-1.el9.elrepo.x86_64
To reproduce, install the new AMD flang compiler from:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
Next, clone the repo:
git clone https://github.com/predsci/hipft
Then, copy one of the build scripts from the
build_examples
folder and edit the top portion to resemble this:But:
Now, try to run the build script in the top level directory of the repo.
You should see:
You can go into the
src
folder and try to edit the Makefile and recompile as needed.(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4000
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx906
Uuid: GPU-b86490a172da5ee9
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 3584
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
Here are my installed amd and rocm packages:
The text was updated successfully, but these errors were encountered: