Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageLockScope issue #691

Closed
yuryGotham opened this issue Nov 15, 2021 · 6 comments · Fixed by #697
Closed

PageLockScope issue #691

yuryGotham opened this issue Nov 15, 2021 · 6 comments · Fixed by #697

Comments

@yuryGotham
Copy link

Not sure if it is a bug or a missing feature, but it appears to be impossible for multiple GPUs to use the same pinned memory on host.

Code:

            var startTime = DateTime.Now;
            double timeSpent;
            void LogTime(string s) { timeSpent = DateTime.Now.Subtract(startTime).TotalSeconds; Console.WriteLine($"{DateTime.Now}\tTime to {s}: {timeSpent:0.0000}s"); startTime = DateTime.Now; }

            var context = Context.Create().AllAccelerators().EnableAlgorithms().ToContext();
            LogTime("create context with " + context.GetCudaDevices().Count + " gpus");

            var accelerators = Enumerable.Range(0, context.GetCudaDevices().Count).Select(i => context.CreateCudaAccelerator(i)).ToArray();

            int size = 1024;
            var cpuShared = new float[size];
            var intPtr = GCHandle.Alloc(cpuShared, GCHandleType.Pinned).AddrOfPinnedObject();
            var scopes = new PageLockScope<float>[accelerators.Length];
            LogTime("create cpu array");

            for(int accId = 0; accId < accelerators.Length; accId++) {
                scopes[accId] = accelerators[accId].CreatePageLockFromPinned<float>(cpuShared);
                LogTime("create scope for accelerator " + accId);
            }

Output:

Unhandled exception. ILGPU.Runtime.Cuda.CudaException: part or all of the requested memory range is already mapped
   at ILGPU.Runtime.CudaPageLockScope`1..ctor(CudaAccelerator accelerator, IntPtr hostPtr, Int64 numElements)
   at ILGPU.Runtime.Cuda.CudaAccelerator.CreatePageLockFromPinnedInternal[T](IntPtr pinned, Int64 numElements)
   at ILGPU.Runtime.Accelerator.CreatePageLockFromPinned[T](IntPtr pinned, Int64 numElements)
   at ILGPU.Runtime.Accelerator.CreatePageLockFromPinned[T](T[] pinned)

I suspect that CreatePageLockFromPinned() is trying call cudaHostRegister a second time when it does not need to - it only needs to call cudaHostGetDevicePointer for subsequent accelerators.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Nov 15, 2021

hi @yuryGotham.

ILGPU does not keep track of the memory range that is being locked - it is the responsibility of the calling application to manage this, and ensure that the same memory range is not locked twice.

Please note that ILGPU is using the cudaHostRegisterPortable flag when making the call to cudaHostRegister. This means that the device pointer "returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation". Therefore, it is not necessary to call page lock on each accelerator.

@yuryGotham
Copy link
Author

Hi @MoFtZ,

It looks like I'm able to get the pinned memory to work for the most part. One thing that I'm still unable to figure out is how to do a partial copy between MemoryBuffer and PageLockScope.

In my scenario PageLockScopes are created ahead of time and used many times (creating a scope is slow). The amount of memory that I need to copy from PageLockScope to MemoryBuffer varies during runtime. I can dynamically point to a region of MemoryBuffer by using View.SubView(), but I cannot point to a region of PageLockScope and creating a new PageLockScope each time is VERY slow.

Is it possible to add the functionality to point to a part of PageLockScope? Even being able to point to first X bytes of PageLockScope would solve my specific problem.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Nov 24, 2021

@yuryGotham @m4rs-mt
I have raised a PR to allow copying to/from a subset of a page locked array. Instead of adding lots of new methods, I have simplified the API to use the existing CopyFrom and CopyTo ArrayView methods.

The PageLockedScope class now has an ArrayView property to allow it to be used with the existing copy methods.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 3, 2021

@yuryGotham now that #697 has been postponed, you can use this workaround in ILGPU v1.0.

using var memoryBuffer = CPUMemoryBuffer.Create(
    accelerators[accId],
    scopes[accId].AddrOfLockedObject,
    scopes[accId].Length,
    Interop.SizeOf<float>());

var arrayView = memoryBuffer.AsArrayView<float>(0L, memoryBuffer.Length);

This creates a memory buffer over the page locked array. And from the memory buffer, we get an array view. This array view can then be sliced and used in copy operations.

@yuryGotham
Copy link
Author

@MoFtZ how do I copy the arrayView to gpu using locked/async method? I'm currently calling gpuBuffer.View.CopyToPageLockedAsync(stream, pageLockedScope). CopyToPageLockedAsync does not accept ArrayView as an argument - is there a way to cast ArrayView to PageLockedArray?

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 21, 2021

@yuryGotham once the CPU buffer is page-locked, you can just copy to/from the buffer as normal.

Here is a sample program:

static void Main()
{
    using var context = Context.CreateDefault();
    using var accelerator = context.CreateCudaAccelerator(0);

    // Create a pinned CPU buffer.
    int size = 1024;
    var cpuShared = new float[size];
    var gcHandle = GCHandle.Alloc(cpuShared, GCHandleType.Pinned);
    try
    {
        // Page lock the pinned CPU buffer.
        using var pageLockScope = accelerator.CreatePageLockFromPinned(cpuShared);

        // Wrap CPU buffer in an array view.
        using var cpuMemoryBuffer = CPUMemoryBuffer.Create(
            accelerator,
            pageLockScope.AddrOfLockedObject,
            pageLockScope.Length,
            Interop.SizeOf<float>());
        var cpuArrayView = cpuMemoryBuffer.AsArrayView<float>(0, cpuMemoryBuffer.Length);

        // Allocate a GPU buffer.
        using var gpuBuffer = accelerator.Allocate1D<float>(256);

        // Copy a subset of the pinned/page-locked CPU view to the GPU.
        var sourceView = cpuArrayView.SubView(32, 256);
        var destinationView = gpuBuffer.View;

        // NOTE: CopyFrom does not have an implicit call to stream.Synchronize.
        destinationView.CopyFrom(sourceView);
        accelerator.Synchronize();
    }
    finally
    {
        gcHandle.Free();
    }
}

@m4rs-mt m4rs-mt added this to the v2.0 milestone Jun 9, 2022
@MoFtZ MoFtZ closed this as completed in #697 Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants