Releases: JuliaGPU/CUDA.jl
Releases Β· JuliaGPU/CUDA.jl
v5.5.2
v5.5.1
What's Changed
- Update wrappers for CUDA v12.6.1 by @amontoison in #2499
- Enzyme: adapt to pending version breaking update by @wsmoses in #2490
Full Changelog: v5.5.0...v5.5.1
v5.5.0
CUDA v5.5.0
Merged pull requests:
- Add support for arbitrary group sizes in
gemm_grouped_batched!
(#2334) (@lpawela) - Add kernel compilation requirements to docs (#2416) (@termi-official)
- Enzyme: reverse mode kernels (#2422) (@wsmoses)
- CUFFT: Support Float16 (#2430) (@eschnett)
- Updated compute-sanitizer documentation (#2440) (@alexp616)
- Add troubleshooting section for NSight Compute (#2442) (@efaulhaber)
- Correct typo in documentation (#2445) (@eschnett)
- Bump minimal Julia requirement to v1.10. (#2447) (@maleadt)
- fix compute-sanitizer typo (#2448) (@alexp616)
- Address a corner case when establishing p2p access (#2457) (@findmyway)
- Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre)
- Update to CUDA 12.6. (#2461) (@maleadt)
- CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot])
- Bump CUDA driver JLL. (#2463) (@maleadt)
- CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur)
- Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt)
- Fix a method deprecation. (#2470) (@maleadt)
- Add Enzyme sum derivatives (#2471) (@wsmoses)
- Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt)
- Bump LLVM compat (#2473) (@maleadt)
- Bump subpackage compat. (#2475) (@maleadt)
- Enzyme: Reversemode cudaconvert (#2476) (@wsmoses)
- Ignore Enzyme.jl CI failures (#2479) (@maleadt)
- Re-enable enzyme testing (#2480) (@wsmoses)
- Add missing GC.@preserves. (#2487) (@maleadt)
- [CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison)
- [CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison)
- Update to LLVM 9.1. (#2491) (@maleadt)
- Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt)
- Rework NNlib CI. (#2493) (@maleadt)
- CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)
Closed issues:
LinearAlgebra.norm(x)
falls back to generic implementation forx::Transpose
andx::Adjoint
(#1782)- dlclose'ing the compatibility driver can fail (#1848)
- Creating a sparse diagonal matrix of CuArray(u) (#1857)
- Support for Julia 1.11 (#2241)
- CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
- Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
- Error using CUDA on Julia 1.10:
Number of threads per block exceeds kernel limit
(#2438) - Error when I load my model (#2439)
- Driver JLL improvements (#2446)
- Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
- CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
- Segmentation Fault on Loading CUDA (#2453)
Invalid instruction
error whenusing CUDA
(#2454)- Missing
adapt
for sparse andCUDABackend
(#2459) - CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
- Request: Option to disable the "full GC when under very high memory pressure". (#2467)
- copyto! ambiguous (#2477)
- NeuralODE training failed on GPU with Enzyme (#2478)
- issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
- Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
- Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
- Memory Access error in sparse array constructor (#2494)
- Forwards-compatible driver breaks CURAND (#2496)
- CUDA 12.6 Update 1 (#2497)
v5.4.3
CUDA v5.4.3
Merged pull requests:
- add cublasgetrsBatched (#2385) (@bjarthur)
- add two quirks for rationals (#2403) (@lanceXwq)
- Bump cuDNN (#2404) (@maleadt)
- Add convert method for ScaledPlan (#2409) (@david-macmahon)
- Conditionalize a quirk. (#2411) (@maleadt)
- Relax signature of generic matvecmul! (#2414) (@dkarrasch)
- Fix kron launch configuration. (#2418) (@maleadt)
- Run full GC when under very high memory pressure. (#2421) (@maleadt)
- Enzyme: Fix cuarray return type (#2425) (@wsmoses)
- CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
- pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
- Profiler tweaks. (#2432) (@maleadt)
- Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
- Correct workspace handling (#2437) (@maleadt)
Closed issues:
- Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
- Broadcasted multiplication with a rational doesn't work (#1926)
- Incorrect grid size in
kron
(#2410) - GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
- Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
- CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
- Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
- CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
- CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
- CUDA.jl won't install/run on Jetson Orin NX (#2435)
v5.4.2
CUDA v5.4.2
Merged pull requests:
v5.4.1
CUDA v5.4.1
Merged pull requests:
v5.4.0
CUDA v5.4.0
Merged pull requests:
- Support CUDA 12.5 (#2392) (@maleadt)
- Mark cuarray as noalias (#2395) (@wsmoses)
- Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
- Enable correct pool access for cublasXt. (#2398) (@maleadt)
- More fine-grained CUPTI version checks. (#2399) (@maleadt)
Closed issues:
v5.3.5
CUDA v5.3.5
Merged pull requests:
- Avoid constructing
MulAddMul
s on Julia v1.12+ (#2277) (@dkarrasch) - CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
- Enzyme: allocation functions (#2386) (@wsmoses)
- Tweaks to prevent context construction on some operations (#2387) (@maleadt)
- Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
- CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
- Backport: Enzyme allocation fns (#2393) (@wsmoses)
Closed issues:
v5.3.4
CUDA v5.3.4
Merged pull requests:
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)
Closed issues:
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a
japi1
function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating
CuArray{Tracker.TrackedReal{Float64},1}
a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
rand
and friends default toFloat64
(#108)- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for
sm_80
cp.async
: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2
not defined (#1820)- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)
v5.3.3
CUDA v5.3.3
Merged pull requests:
- Rework context handling (#2346) (@maleadt)
- fix kernel launch logic (#2353) (@xaellison)
Closed issues: