Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NDTensors] JLArrays Extension #1508

Merged
merged 44 commits into from
Sep 24, 2024

Conversation

kmp5VT
Copy link
Collaborator

@kmp5VT kmp5VT commented Jun 21, 2024

Description

Introduce the JLArrays extension to test GPU for ITensors

Checklist:

  • Introduce JLArrays and all NDTensors tests pass with array backend
  • Potentially move function from GPU array specific libraries to GPUArraysCoreExt
  • Update ITensors suite for JLArray to test on GPU adjacent array on CPU

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Jun 21, 2024

There is an issue in JLArrays, there is a missing resize! function JuliaGPU/GPUArrays.jl#541

NDTensors/Project.toml Outdated Show resolved Hide resolved
Comment on lines 26 to 29
if "jlarrays" in ARGS || "all" in ARGS
Pkg.add("JLArrays")
using JLArrays
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a reason to not run the tests for JLArrays since they run on CPU anyway. So I think it should just be added as a normal test dependency, and inserted into the device list by default.

Copy link
Collaborator Author

@kmp5VT kmp5VT Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am seeing an issue with the GPU tests when I have JLArrays is in the Project.toml. To fix the problem, I moved JLArrays to [extra] and I add it to the project if isempty(ARGS) || "base" in ARGS. However, this would run into a problem if the test args is ["base","metal"]. Right now we aren't testing in that way, though if you would prefer, I can just include JLArrays if no GPUs are being tested, i.e. only allow JLArrays if isempty(ARGS). Let me know, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue you see? It would be best to try to address that issue and have JLArrays as a dependency in Project.toml and run by default in the tests as we planned, so I would prefer to try to aim for that rather than work around an issue with a more convoluted solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that GPUArrays does not compile during Pkg.test

ERROR: LoadError: UndefVarError: `SimplifyCFGPassOptions` not defined
Stacktrace:
 [1] top-level scope
   @ ~/.julia/packages/GPUCompiler/nWT2N/src/optim.jl:57
 [2] include(mod::Module, _path::String)
   @ Base ./Base.jl:495
 [3] include(x::String)
   @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/GPUCompiler.jl:1
 [4] top-level scope
   @ ~/.julia/packages/GPUCompiler/nWT2N/src/GPUCompiler.jl:37
 [5] include
   @ Base ./Base.jl:495 [inlined]
 [6] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
   @ Base ./loading.jl:2216
 [7] top-level scope
   @ stdin:3
in expression starting at /home/jenkins/workspace/ITensor_ITensors.jl_PR-1508@tmp/.julia/packages/GPUCompiler/nWT2N/src/optim.jl:56
in expression starting at /home/jenkins/workspace/ITensor_ITensors.jl_PR-1508@tmp/.julia/packages/GPUCompiler/nWT2N/src/GPUCompiler.jl:1
in expression starting at stdin:3
ERROR: LoadError: Failed to precompile GPUCompiler [61eb1bfa-7361-4325-ad38-22787b887f55] to "/home/jenkins/workspace/ITensor_ITensors.jl_PR-1508@tmp/.julia/compiled/v1.10/GPUCompiler/jl_xlhZZG".

However, I think I was able to fix the problem by moving using JLArrays to after the GPU using statements.

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.38%. Comparing base (82cfd76) to head (c0ebacb).
Report is 23 commits behind head on main.

Current head c0ebacb differs from pull request most recent head d6e675d

Please upload reports for the commit d6e675d to get more accurate results.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1508      +/-   ##
==========================================
- Coverage   78.05%   77.38%   -0.68%     
==========================================
  Files         148      140       -8     
  Lines        9679     9103     -576     
==========================================
- Hits         7555     7044     -511     
+ Misses       2124     2059      -65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mtfishman mtfishman marked this pull request as ready for review September 5, 2024 02:00
@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Sep 5, 2024

@mtfishman It looks like modifying the SparseArrays compat like you suggested does work. Unfortunately, the Julia 1.8 tests still fail for resize with the JLArrays because there is a compat restriction for GPUArrays and JLArrays which prevent this code from being used. Heres the compat restrictions for the JLArrays version

(@v1.8) pkg> add JLArrays@0.1.5
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Adapt [79e6a3ab]:
 Adapt [79e6a3ab] log:
 ├─possible versions are: 0.3.0-4.0.4 or uninstalled
 ├─restricted to versions 3.7.0-4 by NDTensors [23ae76d9], leaving only versions 3.7.0-4.0.4
 │ └─NDTensors [23ae76d9] log:
 │   ├─possible versions are: 0.3.42 or uninstalled
 │   └─NDTensors [23ae76d9] is fixed to version 0.3.42
 ├─restricted by compatibility requirements with GPUArrays [0c68f7d7] to versions: 4.0.0-4.0.4
 │ └─GPUArrays [0c68f7d7] log:
 │   ├─possible versions are: 0.3.0-10.3.1 or uninstalled
 │   ├─restricted by compatibility requirements with JLArrays [27aeb0d3] to versions: 10.0.0-10.3.1
 │   │ └─JLArrays [27aeb0d3] log:
 │   │   ├─possible versions are: 0.1.0-0.1.5 or uninstalled
 │   │   └─restricted to versions 0.1.5 by an explicit requirement, leaving only versions 0.1.5
 │   ├─restricted by julia compatibility requirements to versions: 0.3.0-10.2.3 or uninstalled, leaving only versions: 10.0.0-10.2.3
 │   └─restricted by compatibility requirements with CUDA [052768ef] to versions: [8.6.0-9.1.0, 10.0.1-10.3.1], leaving only versions: 10.0.1-10.2.3
 │     └─CUDA [052768ef] log:
 │       ├─possible versions are: 0.1.0-5.4.3 or uninstalled
 │       ├─restricted by compatibility requirements with StridedViews [4db3bf67] to versions: 4.0.0-5.4.3
 │       │ └─StridedViews [4db3bf67] log:
 │       │   ├─possible versions are: 0.1.0-0.3.1 or uninstalled
 │       │   └─restricted to versions 0.2.2-0.3 by NDTensors [23ae76d9], leaving only versions 0.2.2-0.3.1
 │       │     └─NDTensors [23ae76d9] log: see above
 │       └─restricted by compatibility requirements with GPUArrays [0c68f7d7] to versions: 5.2.0-5.4.3 or uninstalled, leaving only versions: 5.2.0-5.4.3
 │         └─GPUArrays [0c68f7d7] log: see above
 └─restricted by compatibility requirements with ArrayInterface [4fba245c] to versions: 3.0.0-3.7.2 — no versions left
   └─ArrayInterface [4fba245c] log:
     ├─possible versions are: 0.0.1-7.16.0 or uninstalled
     ├─restricted by compatibility requirements with LinearAlgebra [37e2e46d] to versions: 0.0.1-7.5.1 or uninstalled
     │ └─LinearAlgebra [37e2e46d] log:
     │   ├─possible versions are: 1.8.5 or uninstalled (package in sysimage!)
     │   └─restricted to versions 1.6.0-1 by NDTensors [23ae76d9], leaving only versions 1.8.5
     │     └─NDTensors [23ae76d9] log: see above
     ├─restricted by compatibility requirements with Compat [34da2185] to versions: [0.0.1-3.1.32, 6.0.2-7.16.0] or uninstalled, leaving only versions: [0.0.1-3.1.32, 6.0.2-7.5.1] or uninstalled
     │ └─Compat [34da2185] log:
     │   ├─possible versions are: 1.0.0-4.16.0 or uninstalled
     │   └─restricted to versions 4.9.0-4 by NDTensors [23ae76d9], leaving only versions 4.9.0-4.16.0
     │     └─NDTensors [23ae76d9] log: see above
     └─restricted by compatibility requirements with StaticArrayInterface [0d7ed370] to versions: 7.0.0-7.16.0, leaving only versions: 7.0.0-7.5.1
       └─StaticArrayInterface [0d7ed370] log:
         ├─possible versions are: 1.0.0-1.8.0 or uninstalled
         └─restricted by julia compatibility requirements to versions: 1.0.0-1.6.0 or uninstalled, leaving only versions: 1.0.0-1.6.0

and here are the current packages I have

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
⌅ [46192b85] GPUArraysCore v0.1.5
⌃ [27aeb0d3] JLArrays v0.1.2
  [23ae76d9] NDTensors v0.3.42 `../../dev/ITensors/NDTensors`
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

NDTensors/Project.toml Outdated Show resolved Hide resolved
NDTensors/Project.toml Outdated Show resolved Hide resolved
@mtfishman
Copy link
Member

Gotcha, thanks for checking.

I made similar changes to the compat entry of LinearAlgebra and Random since those are also standard libraries and the CI log showed a similar issue for LinearAlgebra.

If after those changes it still shows that error you are seeing with compatibility of JLArrays, let's just stop testing the GPU backends (including JLArrays.jl) against Julia 1.8. That issue may be caused by the fact that GPUArrays.jl v10 requires Julia 1.10: https://github.com/JuliaGPU/GPUArrays.jl/blob/v10.3.1/Project.toml#L26, and JLArrays.jl v0.1.5 requires GPUArrays v10: https://github.com/JuliaGPU/GPUArrays.jl/blob/v10.3.1/lib/JLArrays/Project.toml#L13.

@mtfishman
Copy link
Member

@kmp5VT I'm not sure what's going on with the CI in Julia 1.8, it's showing the same compat issue with LinearAlgebra even though I changed the compat bound to allow LinearAlgebra v0. Let's just stop testing against Julia 1.8 in the Jenkins GPU CI workflow.

Beyond that, are there issues that need to be addressed before this can be merged? How is the JLArrays extension being tested, as part of the Github Actions CPU test or as a Jenkins test? I don't see it run in Jenkins so I assume it is the former but if you could give a summary of the current status of this PR including how the tests are working that would be helpful since I've forgotten at this point.

jenkins/Jenkinsfile Outdated Show resolved Hide resolved
@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Sep 23, 2024

@mtfishman All of the JLArrays code seems to work now! A questions for this PR:

  1. Should I spend time making an NDTensorsGPUArraysCoreExt module and generalize functions which are duplicated across the GPU extension libraries?
  2. Should I try to introduce JLArrays to the ITensor tests in this PR or should we merge this and I start a new one for that process?

@mtfishman
Copy link
Member

1. Should I spend time making an `NDTensorsGPUArraysCoreExt` module and generalize functions which are duplicated across the GPU extension libraries?

I already see: https://github.com/ITensor/ITensors.jl/tree/main/NDTensors/ext/NDTensorsGPUArraysCoreExt. I guess you mean adding more functionality to that extension? Let's hold off on that and focus on wrapping up this PR and keeping it more self-contained.

2. Should I try to introduce JLArrays to the ITensor tests in this PR or should we merge this and I start a new one for that process?

Isn't it already running as part of the tests? I.e. the JLArrays.jl adaptor is added to the device list here: https://github.com/kmp5VT/ITensors.jl/blob/14922b1063676c02d8876d3403aaae8d7fa022f2/NDTensors/test/NDTensorsTestUtils/device_list.jl#L36.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Sep 24, 2024

@mtfishman oh I had forgotten that I created a GPUArraysCore extension. But yes this is what I mean, adding functionality to NDTensorsGPUArraysCoreExt to replace overlapping function definitions in the other GPU extensions. But I understand your response and leaving that for work in a future PR was going to be my suggestion.

Regarding testing: Yes JLArrays is being used in the NDTensors test suite, but I was wondering if we want to modify the ITensors test suite to test on CPU and with JLArrays.jl? This idea comes from the understanding that the ITensors test suite is currently more far reaching than the NDTensors tests so we may catch GPU issues which we don't see in NDTensors tests.

Thanks!

@mtfishman
Copy link
Member

@mtfishman oh I had forgotten that I created a GPUArraysCore extension. But yes this is what I mean, adding functionality to NDTensorsGPUArraysCoreExt to replace overlapping function definitions in the other GPU extensions. But I understand your response and leaving that for work in a future PR was going to be my suggestion.

Sounds good, sounds like we are in agreement there. One comment about the philosophy there is that even if there are overlapping definitions across extensions, it doesn't necessarily mean we should always turn them into generic definitions in NDTensorsGPUArraysCoreExt. For example, multiple GPU extensions might have a certain Exposed workaround for an operation that isn't working, but I think it is better to have repeated definitions in the different backends in that case since that makes it easier to customize the extensions as the backends change (say, if one GPU backend fixes an operation but it remains broken in the other backends). So it is a judgement call about what operations we should turn into generic definitions.

Regarding testing: Yes JLArrays is being used in the NDTensors test suite, but I was wondering if we want to modify the ITensors test suite to test on CPU and with JLArrays.jl? This idea comes from the understanding that the ITensors test suite is currently more far reaching than the NDTensors tests so we may catch GPU issues which we don't see in NDTensors tests.

I see, thanks for clarifying. Let's hold off on incorporating it into the ITensors test suite, I think those tests already take too long to run, and would prefer to make more tests at a lower level (i.e. at the NDTensors level, and more specifically at the submodule level) going forward anyway, since that is a better strategy for having more systematic and efficient test coverage.

@mtfishman
Copy link
Member

Thanks! This will be helpful for testing the GPU backends.

@mtfishman mtfishman merged commit 29cce1c into ITensor:main Sep 24, 2024
15 checks passed
@emstoudenmire
Copy link
Collaborator

Great to see this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants