GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

sr1gh · 2024-08-09T04:40:13Z

Describe the bug
If GPU computation is suspended during use or when an exclusive application is running, when computation resumes, BOINC sometimes swaps which task is on which GPU. This causes a computation error for asteroids@home tasks when using multiple AMD GPUs, for example, an RX 7600 XT and an RX 6600. This might be an application specific issue, but it might be a good idea to have an option to not switch tasks between GPUs if possible, unless, for example, one GPU is removed, in which case all the tasks would have to run on the remaining GPU.

Steps To Reproduce

Start the asteroids@home period search application on a system with 2 different AMD gpus
Suspend computation partway through the computation, observing which task is on which gpu, the resume computation
Repeat if necessary until BOINC switches tasks between gpus resulting in a computation error.

Expected behavior
I would expect the task to stay on the GPU it started on if that is necessary for the task to finish. An option to disable gpu task switching is a potential solution, or tasks could specify weather or not they can be switched.

Screenshots

System Information

OS: Windows 10 (Latest)
BOINC Version: 8.0.4

Additional context

AenBleidd · 2024-08-09T08:09:23Z

I'm pretty much sure this is an issue of the project application, because for every task we assign on start-up ID (0, 1, 2, etc) of the GPU to be used.
If the project application doesn't use it but instead relies on some other mechanism - then there might be a collision.
@sr1gh, may I ask you for a favor?
Could you please go to the %BOINCDATA%\slots%N%
where

%BOINDATA% is the folder where your BOINC data is located (usually C:\ProgramData\BOINC)
%N% number of the slot
locate there two running tasks in the %N% folders (two different folders), open their init_data.xml and check <gpu_device_num> value.
These numbers should be different for different tasks, but should stay consistent after task is suspended and run again.
If these numbers stay the same but the application crashes - then this is definitely an issue with the project application, and it should be reported to their admins.
Anyway please report these results back to us, and we will check that there is no issue on our side.

davidpanderson · 2024-08-09T08:35:06Z

AFAIK the client doesn't have a mechanism for pinning a job to a GPU.
I need to verify that it rewrites init_data.xml before restarting a job;
otherwise a collision could happen.

RichardHaselgrove · 2024-08-09T08:49:38Z

The same issue is a significant problem at GPUGrid.

init_data contains the correct <gpu_device_num> for a running task. But if BOINC is stopped and restarted, there is no guarantee that the same GPU will be assigned by BOINC. If the new GPU is identical to the previous run, the task restarts normally.

But if it is not identical, the task crashes, potentially losing several hours of work. The crash is initiated by the project application, but could be prevented by the BOINC client remembering and reusing the device allocation at startup.

NB consider respecting previous OpenCL device numbers too, although I've only seen the problem for cuda apps.

davidpanderson · 2024-08-09T09:03:30Z

The issue is whether the task crashes because it runs an a different GPU than where it started,
or because 2 tasks are trying to use the same GPU.

The former seems odd - why would a checkpoint file be specific to a GPU instance?

RichardHaselgrove · 2024-08-09T09:13:28Z

I'm looking through my recent errors for an example of the specific failure case, but I haven't found one yet.

From memory, the problem occurs from the 'just in time' GPU code compiler. At GPUGrid, this produces code which is specific to the individual GPU type used in the first run, If the second GPU is different, the by now pre-compiled code is incompatible with the hardware.

RichardHaselgrove · 2024-08-09T09:30:08Z

Can't find an error on my own machines - I know from bitter experience that I have to avoid shutdowns when GPUGrid work is running.

But see https://www.gpugrid.net/forum_thread.php?id=5461 for a report/response on their message board.

AenBleidd · 2024-08-09T10:39:57Z

The former seems odd - why would a checkpoint file be specific to a GPU instance?

@davidpanderson, as @RichardHaselgrove already mentioned, it's very important that the task that started to run on particular GPU will stick to it forever, otherwise it's not guaranteed that the computation could be continued even from the checkpoint.

sr1gh · 2024-08-09T14:54:48Z

Yes, it appears that some GPU apps generate the code for the specific hardware used. Here is the error output from a failed task from asteroids at home.

<stderr_txt>
BOINC client version 8.0.4
BOINC GPU type 'ATI', deviceId=1, slot=0
Application: period_search_10220_windows_x86_64__opencl_102_amd_win.exe
Version: 102.20.0.0
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 1
OpenCL device name: AMD Radeon RX 6600 7GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 14
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 448 = 2 * 14 * 16
Block dim: 128
Binary build log for AMD Radeon RX 6600:
OK (0)
Program build log for AMD Radeon RX 6600:
OK (0)
Prefered kernel work group size multiple: 32
Setting Grid Dim to 256
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 0
OpenCL device name: AMD Radeon RX 7600 XT 15GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 16
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 512 = 2 * 16 * 16
Block dim: 128
Build log: AMD Accelerated Parallel Processing | AMD Radeon RX 7600 XT:
Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102
Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

Error creating queue: build program failure(-11)

</stderr_txt>

sr1gh · 2024-08-09T15:23:14Z

The <gpu_device_num> values appeared the same after resuming computation, but in BOINC manager, the task that said "device 0" likely said "device 1" before the error, but the error happens immediately after resuming, so it is hard to tell, although I have seen this swap occur with other applications from other projects. And the following error from the above post would indicate that the tasks are sometimes swapping GPUs:

Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102

gfx1032 is RX 6600
gfx1102 is RX 7600 XT

davidpanderson · 2024-08-10T03:31:10Z

One option would be for the app to compile its kernels each time it starts.

davidpanderson · 2024-08-10T03:45:58Z

If we pin each GPU job to a GPU instance, the following could happen:
jobs A and B are running on GPUs 0 and 1 respectively.
Job C arrives, with an early deadline, so it preempts job A and starts running on GPU 0.
Job B finishes.

We now have 2 jobs pinned to GPU 0; GPU 1 is idle.
The work fetch logic (which doesn't know about GPU assignments)
thinks that both GPUs are busy, so it doesn't fetch more jobs.

To avoid this, we'd have to extend the simulation done by the work fetch logic
to model GPU assignments (in addition to per-project GPU exclusions,
max concurrency, etc.). This would be quite difficult.
It would be better if apps could recompile their kernels on startup.

AenBleidd added C: Client P: Undetermined T: Defect labels Aug 9, 2024

AenBleidd added this to Client/Manager Aug 9, 2024

github-project-automation bot moved this to Backlog in Client/Manager Aug 9, 2024

AenBleidd added this to the Client/Manager milestone Aug 9, 2024

davidpanderson mentioned this issue Aug 13, 2024

address issue with max concurrent and work fetch #5755

Merged

AenBleidd added the R: fixed label Aug 13, 2024

AenBleidd modified the milestones: Client/Manager, Client/Manager 8.0.5 Aug 13, 2024

AenBleidd removed the R: fixed label Aug 13, 2024

AenBleidd modified the milestones: Client/Manager 8.0.5, Client/Manager Aug 13, 2024

AenBleidd removed the T: Defect label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

sr1gh commented Aug 9, 2024

AenBleidd commented Aug 9, 2024

davidpanderson commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

davidpanderson commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

AenBleidd commented Aug 9, 2024

sr1gh commented Aug 9, 2024

sr1gh commented Aug 9, 2024

davidpanderson commented Aug 10, 2024

davidpanderson commented Aug 10, 2024

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

Comments

sr1gh commented Aug 9, 2024

AenBleidd commented Aug 9, 2024

davidpanderson commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

davidpanderson commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

RichardHaselgrove commented Aug 9, 2024

AenBleidd commented Aug 9, 2024

sr1gh commented Aug 9, 2024

sr1gh commented Aug 9, 2024

davidpanderson commented Aug 10, 2024

davidpanderson commented Aug 10, 2024