Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPUCompiler precompilation caching #425

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

collinwarner
Copy link
Contributor

@collinwarner collinwarner commented Apr 9, 2023

Adds ability to precompile code in GPUCompiler.GLOBAL_CI_CACHES. Taps into non-gpu caching of global constants to write the current instance of the global cache and reload on initialization. Requires user to declare, initialize and snapshot local cache. The user will then use GPUCompiler.precompile_gpucompiler. Mainly this adds and api for downstream packages such as Enzyme, CUDA, to use to cache instances of their functions. A sample SimpleGPU and Example.jl illustrate usage.

@maleadt
Copy link
Member

maleadt commented Apr 9, 2023

You forgot to commit precompile_native.jl.

@codecov
Copy link

codecov bot commented Apr 9, 2023

Codecov Report

Patch coverage has no change and project coverage change: -10.21 ⚠️

Comparison is base (d5086fb) 87.08% compared to head (cc34d21) 76.87%.

❗ Current head cc34d21 differs from pull request most recent head 1951087. Consider uploading reports for the commit 1951087 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #425       +/-   ##
===========================================
- Coverage   87.08%   76.87%   -10.21%     
===========================================
  Files          24       25        +1     
  Lines        2943     2993       +50     
===========================================
- Hits         2563     2301      -262     
- Misses        380      692      +312     
Impacted Files Coverage Δ
src/GPUCompiler.jl 100.00% <ø> (ø)
src/jlgen.jl 66.86% <0.00%> (-16.57%) ⬇️
src/precompilation_cache.jl 0.00% <0.00%> (ø)

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@maleadt
Copy link
Member

maleadt commented Apr 11, 2023

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

@maleadt maleadt marked this pull request as draft April 11, 2023 07:44
@collinwarner
Copy link
Contributor Author

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

Updated initial comment and added some example code. Hope this clears some things up!

@maleadt
Copy link
Member

maleadt commented Apr 11, 2023

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

@collinwarner
Copy link
Contributor Author

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns. This commit fixes this issue requiring some user input. This would improve time-to-first-x for things requiring GPUCompiler. Have a pull request for Enzyme in the works that is one downstream use case another is in CUDA.

Been working with @vchuravy on this, with an eventual extension to be to cache binary code between runs not just type hints.

The reason for so much user involvement and use of macros is this was the simplest way forward. We use macros to create a local cache outside of the user control that has a unique id that does not conflict with the user code. We want a unique cache to eliminate duplications in the cache. Additionally we tried making all of this run at init time, but that was to late, the caches had already been serialized at that point, so we needed user involvement.

We definitely want to try to reduce this but this is a first polished attempt at the matter.

@maleadt
Copy link
Member

maleadt commented Apr 12, 2023

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

@collinwarner
Copy link
Contributor Author

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

It is serialized, it just occurs to early in the process. By the time the dependent packages have inserted into the cache it is too late for the global. Additionally multiple children are now allowed to mutate and still have some see a cache improvements.

@maleadt
Copy link
Member

maleadt commented Apr 12, 2023

It is serialized, it just occurs to early in the process.

Repeating my comment from Slack: Is this because the global is serialized as part of the GPUCompiler.ji, and isn't part of, e.g., CUDA.jl's precompilation image? In that case, you could override ci_cache and use a Dict that's serialized as part of the downstream package, in order to avoid this complexity.

If that turns out to be the way to do it, we could even remove the global CI cache here to force users to bring their own (and thus get proper precompilation of GPUCompiler-inferred CIs).

@vchuravy
Copy link
Member

So the overarching design consideration is:

Users of Enzyme.jl/CUDA.jl/AMDGPU.jl should be able to "precompile" their code.
Where can we store these precompilation results, while also ensuring that they get invalidated properly?

Each user package will need to declare an "anchor"/cache that will be serialized into the .ji of this package.
So the workflow is something like:

module ParallelStencil

using CUDA
using Enzyme
import GPUCompiler

GPUCompiler.@declare_cache() # anchor

f(x) = 1
CUDA.precompile_cuda(f, (Int, ))
Enzyme.precompile_fwd(f, (Int, ))

function __init__()
    GPUCompiler.@reinit_cache()
end

GPUCompiler.@snapshot_cache()

So it is not the down-stream packages of GPUCompiler that need to bring their own cache,
but it is the user of those packages.

We use the cachefile of ParallelStencil to save the cache entries that were discovered during precompilation of PS,
and we then need to re-insert those cache entries into the cache.

That's at least the high-level design Collin and I came up with.

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new content in examples should likely go into test.

src/precompile_native.jl Outdated Show resolved Hide resolved
src/precompile_native.jl Outdated Show resolved Hide resolved
src/precompile_native.jl Outdated Show resolved Hide resolved
CodeCache(cache::CodeCache) = new(GPUCompiler.copyAndFilter(cache.dict))
end

function copyAndFilter(dict::IdDict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this needed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is used in https://github.com/collinwarner/GPUCompiler.jl/blob/3dbe9d5b7c7c5f56f18553f0e4d4bd9c2bdcaca5/src/precompile_native.jl#L102

It creates a CodeCache that contains unbounded entries only. Used when snapshotting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just write this as filter(validate_codecache, cache.dict) where valid is:

function validate_codecache(cache)
    for ci in cache
        if ci.max_world < typemax(typeof(ci.max_world))
            return false
        end
        return true
    end
end

But that seems overeager, are we gurantueed just one entry? Or do we want to remove all CIs that don't have max_world?

@maleadt
Copy link
Member

maleadt commented Apr 14, 2023

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

@collinwarner
Copy link
Contributor Author

collinwarner commented Apr 16, 2023

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

That would seem to work. Updating now

@maleadt
Copy link
Member

maleadt commented Apr 17, 2023

Downstream packages probably should not serialize the entire cache snapshot, and rather do something like:

module DownstreamPackage

using GPUCompiler, CUDA

const cache = let
    cache_snapshot = GPUCompiler.ci_cache_snapshot()
    include("precompile.jl")
    GPUCompiler.ci_cache_delta(cache_snapshot)
end


__init__() = GPUCompiler.ci_cache_insert(cache)

end

But that doesn't change the actual API.

@collinwarner
Copy link
Contributor Author

collinwarner commented Apr 23, 2023

Changed API to follow @maleadt advice. Leads to a cleaner interface. Added an example kernel with caching at test/ExamplePersistentCache/GPUKernel.jl . Using this you are able to get a persistent cache, which reduces the recompilation time on consecutive calls of using Package when restarting Julia.

@collinwarner
Copy link
Contributor Author

Remaining work is to test integration with Downstream packages such as Enzyme, Oceananigans, CUDA, AMDGPU,.... Additionally, there are potentially some algorithmic improvements for the merge algorithm to bring precompile times with and without this feature more inline.

@collinwarner
Copy link
Contributor Author

image

Reloads Global Cache from global variable which stores the previous
cached results
"""
function reinit_cache(LOCAL_CACHE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops reminent code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code, removed

@maleadt
Copy link
Member

maleadt commented Apr 23, 2023

cc @aviatesk, this may be relevant to DAECompiler (as a workaround, until we have the ability to update another module's globals, i.e., a ci cache).

@collinwarner
Copy link
Contributor Author

collinwarner commented Apr 23, 2023

See greater performance improvement when used during Enzyme.jl's precompilation phase. EnzymeAD/Enzyme.jl#760
image

export ci_cache_snapshot, ci_cache_delta, ci_cache_insert, precompile_gpucompiler

function ci_cache_snapshot()
cleaned_cache_to_save = IdDict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just copy(GPUCompiler.GLOBAL_CI_CACHES)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an additional parse when constructing the CodeCache that removes CodeInstances in finite ranges. I could potentially split up that process so there are two phases. Copying then filtering, I though since we were already doing one pass over the data we could add filtering in directly.

@collinwarner
Copy link
Contributor Author

collinwarner commented May 2, 2023

Improves downstream CUDA code. Creation of two CuArrays and vectorized adds:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants