Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying HostBuffer is broken #458

Open
maleadt opened this issue Aug 12, 2024 · 2 comments
Open

Copying HostBuffer is broken #458

maleadt opened this issue Aug 12, 2024 · 2 comments
Labels
bug Something isn't working needs info Further information is requested upstream Out of our hands.

Comments

@maleadt
Copy link
Member

maleadt commented Aug 12, 2024

Device buffers:

julia> c = oneVector{Int}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.DeviceBuffer}:
 1

julia> d = oneVector{Int}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.DeviceBuffer}:
 1

julia> Array(c .+ d)
1-element Vector{Int64}:
 2

Host buffers:

julia> c = oneVector{Int,oneL0.HostBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.HostBuffer}:
 1

julia> d = oneVector{Int,oneL0.HostBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.HostBuffer}:
 1

julia> Array(c .+ d)
1-element Vector{Int64}:
 0

Mixed buffers:

julia> c = oneVector{Int,oneL0.HostBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.HostBuffer}:
 1

julia> d = oneVector{Int,oneL0.DeviceBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.DeviceBuffer}:
 1

julia> Array(c .+ d)
1-element Vector{Int64}:
 1

On PVC hardware, this triggers an UNKNOWN_ERROR in subsequent command list submissions:

julia> c = oneVector{Int,oneL0.HostBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.HostBuffer}:
 1

julia> d = oneVector{Int,oneL0.DeviceBuffer}([1])
1-element oneArray{Int64, 1, oneAPI.oneL0.DeviceBuffer}:
 1

julia> Array(c .+ d)
Array(c .+ d) = [1]
1-element Vector{Int64}:
 1

# synchronize doesn't trap this error
julia> oneAPI.synchronize()

# new command list submissions does
julia> a = oneVector([1])
ERROR: ZeError: unknown or internal error (code 2147483646, ZE_RESULT_ERROR_UNKNOWN)
Stacktrace:
  [1] throw_api_error(res::oneAPI.oneL0._ze_result_t)
    @ oneAPI.oneL0 ~/oneAPI/lib/level-zero/libze.jl:8
  [2] check
    @ ~/oneAPI/lib/level-zero/libze.jl:19 [inlined]
  [3] zeCommandQueueExecuteCommandLists
    @ ~/oneAPI/lib/utils/call.jl:24 [inlined]

It's idiotic that the driver doesn't report this earlier...

@maleadt maleadt added the bug Something isn't working label Aug 12, 2024
@maleadt maleadt changed the title HostBuffer broadcast is broken HostBuffer use in kernels is broken Aug 13, 2024
@maleadt maleadt changed the title HostBuffer use in kernels is broken Copying HostBuffer is broken Aug 13, 2024
@maleadt
Copy link
Member Author

maleadt commented Aug 13, 2024

This apparently extends to all uses of host buffers:

using oneAPI

a = oneVector{Int,oneL0.HostBuffer}([1])
function memset(ptr, val)
    unsafe_store!(ptr, val)
    return
end
@oneapi memset(pointer(a), 42)

@show Array(a)
Array(a) = [1]

The API trace looks correct though:

zeDriverGet: pCount = 0x7fffccef7000 (Count = 0) phDrivers = 0
zeDriverGet: pCount = 0x7fffccef7000 (Count = 1) phDrivers = 0x7fffceeddae0 (hDrivers = 0x2)
zeDeviceGet: hDriver = 0x1fc4040 pCount = 0x7fffccef7e30 (Count = 0) phDevices = 0
zeDeviceGet: hDriver = 0x1fc4040 pCount = 0x7fffccef7e30 (Count = 8) phDevices = 0x7fffccbace00 (hDevices = 0)
zeContextCreate: hDriver = 0x1fc4040 desc = 0x7fffcefa6610 {ZE_STRUCTURE_TYPE_CONTEXT_DESC(0xd) 0 0} phContext = 0x7fffcce31950 (hContext = 0x7fffe22ce3c0)
zeMemAllocHost: hContext = 0x4f95f70 host_desc = 0x7fffcefc5b10 {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC(0x16) 0 0} size = 8 alignment = 8 pptr = 0x7fffcce337c0 (ptr = 0x7fffced76450)
zeCommandQueueCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7fffcee69510 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 0 0} phCommandQueue = 0x7fffccd86ef0 (hCommandQueue = 0x7fffc4cbe5a0)
zeCommandQueueSynchronize: hCommandQueue = 0x504d0e0 timeout = 18446744073709551615
zeModuleCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7fffcf4f3350 {ZE_STRUCTURE_TYPE_MODULE_DESC(0x1b) 0 0 516 0x7ffc54563b98  0x7fffffffb120} phModule = 0x7ffc542121a0 (hModule = 0x7ffc9490cb40) phBuildLog = 0x7ffc54212190 (hBuildLog = 0x7fffe2306490)
zeKernelCreate: hModule = 0x59dd380 desc = 0x7fffce9239d0 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_Z6memsetP5Int64S_"} phKernel = 0x7ffc542121d0 (hKernel = 0x7fffe234cc60)
zeKernelSetArgumentValue: hKernel = 0x5b368a0 argIndex = 0 argSize = 8 pArgValue = 0x7fffcc4ec670
zeKernelSetArgumentValue: hKernel = 0x5b368a0 argIndex = 1 argSize = 8 pArgValue = 0x7fffcc4ec680
zeKernelSetGroupSize: hKernel = 0x5b368a0 groupSizeX = 1 groupSizeY = 1 groupSizeZ = 1
zeCommandListCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7ffc5374f250 {ZE_STRUCTURE_TYPE_COMMAND_LIST_DESC(0xf) 0 0 0} phCommandList = 0x7fffcc386620 (hCommandList = 0x396)
zeCommandListAppendLaunchKernel: hCommandList = 0x6780860 hKernel = 0x5b368a0 (_Z6memsetP5Int64S_) pLaunchFuncArgs = 0x7ffc5374f270 {1, 1, 1} hSignalEvent = 0 numWaitEvents = 0 phWaitEvents = 0x7ffc5374f2c0 (hWaitEvents = 0)
zeCommandListClose: hCommandList = 0x6780860
zeCommandQueueExecuteCommandLists: hCommandQueue = 0x504d0e0 numCommandLists = 1 phCommandLists = 0x7ffc547144d0 (hCommandLists = 0x6780860) hFence = 0
zeCommandQueueSynchronize: hCommandQueue = 0x504d0e0 timeout = 18446744073709551615

Note how there's no explicit API memcpy, as we should just be able to use a CPU-based memcpy because this memory is readable on the CPU. This appears to not be true, as removing our memcpy optimization and always using an explicit zeCommandListAppendMemoryCopy with host memory does return the correct result.

C reproducer:

#include <level_zero/ze_api.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CHECK_ERROR(x)                                                         \
  {                                                                            \
    if (x != ZE_RESULT_SUCCESS) {                                              \
      ze_result_t error = x;                                                   \
      const char *errorString;                                                 \
      zeDriverGetLastErrorDescription(NULL, &errorString);                     \
      fprintf(stderr, "Error: %d - %s\n", error, errorString);                 \
      abort();                                                                 \
    }                                                                          \
  }

char *read_binary_file(const char *filename, size_t *size) {
  FILE *file = fopen(filename, "rb");
  if (!file) {
    fprintf(stderr, "Failed to open file: %s\n", filename);
    return NULL;
  }

  fseek(file, 0, SEEK_END);
  *size = ftell(file);
  fseek(file, 0, SEEK_SET);

  char *buffer = (char *)malloc(*size);
  if (!buffer) {
    fclose(file);
    return NULL;
  }

  fread(buffer, 1, *size, file);
  fclose(file);
  return buffer;
}

int main() {
  // Initialize the driver
  CHECK_ERROR(zeInit(0));
  uint32_t driverCount = 0;
  CHECK_ERROR(zeDriverGet(&driverCount, NULL));
  ze_driver_handle_t driver;
  CHECK_ERROR(zeDriverGet(&driverCount, &driver));

  // Create a context
  ze_context_desc_t contextDesc = {ZE_STRUCTURE_TYPE_CONTEXT_DESC, NULL, 0};
  ze_context_handle_t context;
  CHECK_ERROR(zeContextCreate(driver, &contextDesc, &context));

  // Get the device
  uint32_t deviceCount = 0;
  CHECK_ERROR(zeDeviceGet(driver, &deviceCount, NULL));
  ze_device_handle_t device;
  CHECK_ERROR(zeDeviceGet(driver, &deviceCount, &device));

  // Create a command queue
  ze_command_queue_desc_t queueDesc = {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
                                       NULL,
                                       0,
                                       0,
                                       0,
                                       ZE_COMMAND_QUEUE_MODE_DEFAULT,
                                       ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
  ze_command_queue_handle_t queue;
  CHECK_ERROR(zeCommandQueueCreate(context, device, &queueDesc, &queue));

  // Allocate host memory
  const size_t size = sizeof(int);
  ze_host_mem_alloc_desc_t hostDesc = {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC,
                                       NULL, 0};
  void *hostPtr;
  CHECK_ERROR(zeMemAllocHost(context, &hostDesc, size, 1, &hostPtr));

  // Create a module from SPIR-V
  size_t binary_size;
  char *binary = read_binary_file("kernel.spv", &binary_size);
  if (!binary) {
    fprintf(stderr, "Failed to read SPIR-V binary\n");
    return 1;
  }
  ze_module_desc_t moduleDesc = {ZE_STRUCTURE_TYPE_MODULE_DESC,
                                 NULL,
                                 ZE_MODULE_FORMAT_IL_SPIRV,
                                 binary_size,
                                 binary,
                                 NULL,
                                 NULL};
  ze_module_handle_t module;
  ze_module_build_log_handle_t buildLog;
  CHECK_ERROR(zeModuleCreate(context, device, &moduleDesc, &module, &buildLog));

  // Prepare the kernel
  ze_kernel_desc_t kernelDesc = {ZE_STRUCTURE_TYPE_KERNEL_DESC, NULL, 0,
                                 "_Z6memsetP5Int64S_"};
  ze_kernel_handle_t kernel;
  CHECK_ERROR(zeKernelCreate(module, &kernelDesc, &kernel));
  int value = 42;
  CHECK_ERROR(zeKernelSetArgumentValue(kernel, 0, sizeof(void *), &hostPtr));
  CHECK_ERROR(zeKernelSetArgumentValue(kernel, 1, sizeof(int), &value));
  CHECK_ERROR(zeKernelSetGroupSize(kernel, 1, 1, 1));

  // Launch the kernel using a command list
  ze_command_list_desc_t cmdListDesc = {ZE_STRUCTURE_TYPE_COMMAND_LIST_DESC,
                                        NULL, 0, 0};
  ze_command_list_handle_t cmdList;
  CHECK_ERROR(zeCommandListCreate(context, device, &cmdListDesc, &cmdList));
  ze_group_count_t launchArgs = {1, 1, 1};
  CHECK_ERROR(zeCommandListAppendLaunchKernel(cmdList, kernel, &launchArgs,
                                              NULL, 0, NULL));
  CHECK_ERROR(zeCommandListClose(cmdList));
  CHECK_ERROR(zeCommandQueueExecuteCommandLists(queue, 1, &cmdList, NULL));

  // Synchronize
  CHECK_ERROR(zeCommandQueueSynchronize(queue, UINT64_MAX));

  // Print result
  printf("Result: %d\n", *(int *)hostPtr);

  // Cleanup
  zeMemFree(context, hostPtr);
  zeCommandListDestroy(cmdList);
  zeKernelDestroy(kernel);
  zeModuleDestroy(module);
  zeCommandQueueDestroy(queue);
  zeContextDestroy(context);
  free(binary);

  return 0;
}

This reads kernel.spv, attached here: kernel.zip

@kballeda @pengtu Can you explain what the issue is here? Shouldn't it be possible to read zeMemAllocHost-allocated using a simple CPU-based memcpy after having synchronized the command queue we issued a kernel on?

Also note that on PVC we encounter a command list submission failure first, but let's debug the memcpy problem first.

@maleadt maleadt added needs info Further information is requested upstream Out of our hands. labels Aug 13, 2024
@kballeda
Copy link
Contributor

This apparently extends to all uses of host buffers:

using oneAPI

a = oneVector{Int,oneL0.HostBuffer}([1])
function memset(ptr, val)
    unsafe_store!(ptr, val)
    return
end
@oneapi memset(pointer(a), 42)

@show Array(a)
Array(a) = [1]

The API trace looks correct though:

zeDriverGet: pCount = 0x7fffccef7000 (Count = 0) phDrivers = 0
zeDriverGet: pCount = 0x7fffccef7000 (Count = 1) phDrivers = 0x7fffceeddae0 (hDrivers = 0x2)
zeDeviceGet: hDriver = 0x1fc4040 pCount = 0x7fffccef7e30 (Count = 0) phDevices = 0
zeDeviceGet: hDriver = 0x1fc4040 pCount = 0x7fffccef7e30 (Count = 8) phDevices = 0x7fffccbace00 (hDevices = 0)
zeContextCreate: hDriver = 0x1fc4040 desc = 0x7fffcefa6610 {ZE_STRUCTURE_TYPE_CONTEXT_DESC(0xd) 0 0} phContext = 0x7fffcce31950 (hContext = 0x7fffe22ce3c0)
zeMemAllocHost: hContext = 0x4f95f70 host_desc = 0x7fffcefc5b10 {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC(0x16) 0 0} size = 8 alignment = 8 pptr = 0x7fffcce337c0 (ptr = 0x7fffced76450)
zeCommandQueueCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7fffcee69510 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 0 0} phCommandQueue = 0x7fffccd86ef0 (hCommandQueue = 0x7fffc4cbe5a0)
zeCommandQueueSynchronize: hCommandQueue = 0x504d0e0 timeout = 18446744073709551615
zeModuleCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7fffcf4f3350 {ZE_STRUCTURE_TYPE_MODULE_DESC(0x1b) 0 0 516 0x7ffc54563b98  0x7fffffffb120} phModule = 0x7ffc542121a0 (hModule = 0x7ffc9490cb40) phBuildLog = 0x7ffc54212190 (hBuildLog = 0x7fffe2306490)
zeKernelCreate: hModule = 0x59dd380 desc = 0x7fffce9239d0 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_Z6memsetP5Int64S_"} phKernel = 0x7ffc542121d0 (hKernel = 0x7fffe234cc60)
zeKernelSetArgumentValue: hKernel = 0x5b368a0 argIndex = 0 argSize = 8 pArgValue = 0x7fffcc4ec670
zeKernelSetArgumentValue: hKernel = 0x5b368a0 argIndex = 1 argSize = 8 pArgValue = 0x7fffcc4ec680
zeKernelSetGroupSize: hKernel = 0x5b368a0 groupSizeX = 1 groupSizeY = 1 groupSizeZ = 1
zeCommandListCreate: hContext = 0x4f95f70 hDevice = 0x1fc4380 desc = 0x7ffc5374f250 {ZE_STRUCTURE_TYPE_COMMAND_LIST_DESC(0xf) 0 0 0} phCommandList = 0x7fffcc386620 (hCommandList = 0x396)
zeCommandListAppendLaunchKernel: hCommandList = 0x6780860 hKernel = 0x5b368a0 (_Z6memsetP5Int64S_) pLaunchFuncArgs = 0x7ffc5374f270 {1, 1, 1} hSignalEvent = 0 numWaitEvents = 0 phWaitEvents = 0x7ffc5374f2c0 (hWaitEvents = 0)
zeCommandListClose: hCommandList = 0x6780860
zeCommandQueueExecuteCommandLists: hCommandQueue = 0x504d0e0 numCommandLists = 1 phCommandLists = 0x7ffc547144d0 (hCommandLists = 0x6780860) hFence = 0
zeCommandQueueSynchronize: hCommandQueue = 0x504d0e0 timeout = 18446744073709551615

Note how there's no explicit API memcpy, as we should just be able to use a CPU-based memcpy because this memory is readable on the CPU. This appears to not be true, as removing our memcpy optimization and always using an explicit zeCommandListAppendMemoryCopy with host memory does return the correct result.

C reproducer:

#include <level_zero/ze_api.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CHECK_ERROR(x)                                                         \
  {                                                                            \
    if (x != ZE_RESULT_SUCCESS) {                                              \
      ze_result_t error = x;                                                   \
      const char *errorString;                                                 \
      zeDriverGetLastErrorDescription(NULL, &errorString);                     \
      fprintf(stderr, "Error: %d - %s\n", error, errorString);                 \
      abort();                                                                 \
    }                                                                          \
  }

char *read_binary_file(const char *filename, size_t *size) {
  FILE *file = fopen(filename, "rb");
  if (!file) {
    fprintf(stderr, "Failed to open file: %s\n", filename);
    return NULL;
  }

  fseek(file, 0, SEEK_END);
  *size = ftell(file);
  fseek(file, 0, SEEK_SET);

  char *buffer = (char *)malloc(*size);
  if (!buffer) {
    fclose(file);
    return NULL;
  }

  fread(buffer, 1, *size, file);
  fclose(file);
  return buffer;
}

int main() {
  // Initialize the driver
  CHECK_ERROR(zeInit(0));
  uint32_t driverCount = 0;
  CHECK_ERROR(zeDriverGet(&driverCount, NULL));
  ze_driver_handle_t driver;
  CHECK_ERROR(zeDriverGet(&driverCount, &driver));

  // Create a context
  ze_context_desc_t contextDesc = {ZE_STRUCTURE_TYPE_CONTEXT_DESC, NULL, 0};
  ze_context_handle_t context;
  CHECK_ERROR(zeContextCreate(driver, &contextDesc, &context));

  // Get the device
  uint32_t deviceCount = 0;
  CHECK_ERROR(zeDeviceGet(driver, &deviceCount, NULL));
  ze_device_handle_t device;
  CHECK_ERROR(zeDeviceGet(driver, &deviceCount, &device));

  // Create a command queue
  ze_command_queue_desc_t queueDesc = {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
                                       NULL,
                                       0,
                                       0,
                                       0,
                                       ZE_COMMAND_QUEUE_MODE_DEFAULT,
                                       ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
  ze_command_queue_handle_t queue;
  CHECK_ERROR(zeCommandQueueCreate(context, device, &queueDesc, &queue));

  // Allocate host memory
  const size_t size = sizeof(int);
  ze_host_mem_alloc_desc_t hostDesc = {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC,
                                       NULL, 0};
  void *hostPtr;
  CHECK_ERROR(zeMemAllocHost(context, &hostDesc, size, 1, &hostPtr));

  // Create a module from SPIR-V
  size_t binary_size;
  char *binary = read_binary_file("kernel.spv", &binary_size);
  if (!binary) {
    fprintf(stderr, "Failed to read SPIR-V binary\n");
    return 1;
  }
  ze_module_desc_t moduleDesc = {ZE_STRUCTURE_TYPE_MODULE_DESC,
                                 NULL,
                                 ZE_MODULE_FORMAT_IL_SPIRV,
                                 binary_size,
                                 binary,
                                 NULL,
                                 NULL};
  ze_module_handle_t module;
  ze_module_build_log_handle_t buildLog;
  CHECK_ERROR(zeModuleCreate(context, device, &moduleDesc, &module, &buildLog));

  // Prepare the kernel
  ze_kernel_desc_t kernelDesc = {ZE_STRUCTURE_TYPE_KERNEL_DESC, NULL, 0,
                                 "_Z6memsetP5Int64S_"};
  ze_kernel_handle_t kernel;
  CHECK_ERROR(zeKernelCreate(module, &kernelDesc, &kernel));
  int value = 42;
  CHECK_ERROR(zeKernelSetArgumentValue(kernel, 0, sizeof(void *), &hostPtr));
  CHECK_ERROR(zeKernelSetArgumentValue(kernel, 1, sizeof(int), &value));
  CHECK_ERROR(zeKernelSetGroupSize(kernel, 1, 1, 1));

  // Launch the kernel using a command list
  ze_command_list_desc_t cmdListDesc = {ZE_STRUCTURE_TYPE_COMMAND_LIST_DESC,
                                        NULL, 0, 0};
  ze_command_list_handle_t cmdList;
  CHECK_ERROR(zeCommandListCreate(context, device, &cmdListDesc, &cmdList));
  ze_group_count_t launchArgs = {1, 1, 1};
  CHECK_ERROR(zeCommandListAppendLaunchKernel(cmdList, kernel, &launchArgs,
                                              NULL, 0, NULL));
  CHECK_ERROR(zeCommandListClose(cmdList));
  CHECK_ERROR(zeCommandQueueExecuteCommandLists(queue, 1, &cmdList, NULL));

  // Synchronize
  CHECK_ERROR(zeCommandQueueSynchronize(queue, UINT64_MAX));

  // Print result
  printf("Result: %d\n", *(int *)hostPtr);

  // Cleanup
  zeMemFree(context, hostPtr);
  zeCommandListDestroy(cmdList);
  zeKernelDestroy(kernel);
  zeModuleDestroy(module);
  zeCommandQueueDestroy(queue);
  zeContextDestroy(context);
  free(binary);

  return 0;
}

This reads kernel.spv, attached here: kernel.zip

@kballeda @pengtu Can you explain what the issue is here? Shouldn't it be possible to read zeMemAllocHost-allocated using a simple CPU-based memcpy after having synchronized the command queue we issued a kernel on?

Also note that on PVC we encounter a command list submission failure first, but let's debug the memcpy problem first.

@maleadt File a ticket on intel compute runtime on this. We will internally follow up on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs info Further information is requested upstream Out of our hands.
Projects
None yet
Development

No branches or pull requests

2 participants