Releases: JuliaGPU/CUDA.jl
Releases · JuliaGPU/CUDA.jl
v3.6.4
CUDA v3.6.4
Closed issues:
- Artifacts.toml has bad git-tree-sha1 values (#1309)
Merged pull requests:
v3.6.3
CUDA v3.6.3
Closed issues:
CUDA.@atomic
deadlocks when overwritingNaN
(#1299)- Unreasonablely slow copy kernel (#1301)
- Passing a LogicalIndex(::CuArray) fails (#1304)
Merged pull requests:
- Allow sorting of tuples of numbers (#1196) (@mcabbott)
- Use
===
for generic atomic updates with compare-and-swap (#1300) (@guyvdbroeck) - Update manifest (#1302) (@github-actions[bot])
- Store the array length next to its dimensions. (#1303) (@maleadt)
- Disallow calling CUDA device array intrinsics on the host. (#1305) (@maleadt)
- Support logical indexing with CPU sources. (#1306) (@maleadt)
- Activate a context when calling
device!
. (#1307) (@maleadt)
v3.6.2
v3.6.1
v3.6.0
CUDA v3.6.0
Closed issues:
- Conversion issue (#157)
- Extend new RNG to Complex numbers & normal distributions (#726)
- Fatal errors during sorting tests (#916)
deepcopy
failing (#1202)- Kernel compilation fails when specifying shared memory array size as a tuple consisting of block dimension and kernel argument (#1205)
- ERROR: LoadError: The artifact at C:\Users\name.julia\artifacts\58bd87695e9ccdb508cb38be1ab717315ecc9152 is empty. (#1209)
- InvalidIRError when displaying a model which is on the GPU (#1212)
- CUDA.jl tries to load CUDA compat loaded via jll even though system package is installed (#1216)
- Synchronizing over blocks (#1220)
- assignment changes random seed (#1226)
accumulate
gives wrong answer wheninit != 0
(#1227)- Generic dot kernel: use multiple kernels instead of atomics (#1244)
- integer division error creating CuVector of
missing
andnothing
(#1251) - unsupported dynamic function invocation with union type of more than 2 elements (#1252)
- three CUDA.@atomic in a row result in out-of-bounds error (#1254)
- Float16 CAS cannot use atom.cas.b16.global on sm_61 (#1258)
cu(::SVector)
givesSVector
,cu(::MVector)
givesCuArray
(#1262)- Get back
unsafe_copyto!
methods for unified<-unified and unified<->device (#1263) - Passing and using a FFT plan in a CUDA kernel seems impossible (#1266)
- Inplace Complex FFT and Threads (#1268)
sort
returns nothing (#1270)- Release a new version (#1276)
__init_driver__
not called in 3.5 (#1280)- Shared memory does not support isbits unions. (#1281)
- NVIDIA Nsight Systems and
CUDA.@profile
error (#1282) - nvprof with
using CUDA
crashes julia (#1283)
Merged pull requests:
- Addition over CuSparseMatrix (#1195) (@yuehhua)
- [CUSOLVER] Add ordering functions (#1198) (@amontoison)
- Correctly handle multi-GPU instances with NVML. (#1199) (@maleadt)
- CI improvements. (#1200) (@maleadt)
- fix FFT workarea typo leading to memory corruption (#1204) (@marius311)
- Update manifest (#1206) (@github-actions[bot])
- Minor improvements for library wrappers (#1207) (@maleadt)
- Various small improvements (#1210) (@maleadt)
- Extend CuDeviceArray ctors for mixed-int indices. (#1211) (@maleadt)
- Deprecate non-blocking sync, and always call the synchronization API. (#1213) (@maleadt)
- Generic CUSPARSE: use the index arguments. (#1214) (@maleadt)
- Add bitonic sort implementation (#1217) (@xaellison)
- Update manifest (#1218) (@github-actions[bot])
- Reverted deepcopy, added test (#1221) (@birkmichael)
- Use broadcast instead of copies to initialize mapreduce buffers. (#1223) (@maleadt)
- Remove some unneeded Base module prefixes. (#1224) (@maleadt)
- Update manifest (#1225) (@github-actions[bot])
- Cherry-picked improvements (#1228) (@maleadt)
- Update introduction.jl (#1232) (@aramirezreyes)
- Update manifest (#1233) (@github-actions[bot])
- Fix SpMV for CUDA 11.5 (#1234) (@amontoison)
- Add support for randn and randexp. (#1236) (@maleadt)
- Avoid double-initializing partial accumulate results. (#1237) (@maleadt)
- Fix cuTENSOR contractions not working for FP16 inputs (#1238) (@thomasfaingnaert)
- Bump CUTENSOR and fix on CUDA 11.5 (#1239) (@maleadt)
- Support dot product on GPU between CuArrays with inconsistent eltypes (#1240) (@findmyway)
- Update manifest (#1241) (@github-actions[bot])
- Optimize CUTENSOR contraction. (#1243) (@maleadt)
- Don't use nondeterministic atomics in dot when requested. (#1245) (@maleadt)
- Remove CUBLAS decomposition tests without pivoting. (#1246) (@maleadt)
- Update manifest (#1247) (@github-actions[bot])
- wrap CUBLAS spmv and spr (#1248) (@bjarthur)
- CompatHelper: bump compat for "SpecialFunctions" to "2" (#1249) (@github-actions[bot])
- Update manifest (#1250) (@github-actions[bot])
- Store array offset as elements to fix all-singleton case. (#1255) (@maleadt)
- Update CUDA to 11.5 Update 1. (#1256) (@maleadt)
- Use Base functionality for iteration Union type components. (#1257) (@maleadt)
- Bump CI to Julia 1.7. (#1260) (@maleadt)
- Update manifest (#1261) (@github-actions[bot])
- Use CUDA APIs for unoptimized copies. (#1265) (@maleadt)
- Bump CUDNN to 8.3.1, enable CUDA 11.5 by default. (#1267) (@maleadt)
- Adding stream update for inplace complex FFT (#1269) (@ovanvincq)
- Fix sort! return type. (#1272) (@maleadt)
- Add const keyword to type aliases declarations. (#1273) (@eliascarv)
- Update manifest (#1274) (@github-actions[bot])
- Avoid eager expansion of CUDA_compat artifact string. (#1275) (@maleadt)
- Allow copies between unified arrays in different contexts. (#1277) (@maleadt)
- fix zeros and ones for user defined types (#1278) (@GiggleLiu)
- Make CUDNN depend on CUBLAS. (#1279) (@maleadt)
- Update manifest (#1286) (@github-actions[bot])
- Restore call to init_driver. (#1287) (@maleadt)
- Improvements for isbits union shared memory (#1288) (@maleadt)
v3.5.0
CUDA v3.5.0
Closed issues:
- Illegal memory access on 3.3 (#975)
- Forward compatibility (#1071)
- ambiguous
sparse
constructor (#1088) - Map reduce with float 16 (#1124)
- Allow invalid GPU pointers not allowed in unsafe_wrap (#1125)
- Scalar Indexing error in the Introduction docs (#1127)
- stackoverflow when printing a custom subtype of AbstractCuSparseMatrix (#1128)
- missing
rand
methods (#1138) - Error mapreducing over a 0 dimensional array (#1141)
- seed! is not thread safe (#1158)
- Simplify Int32-based indices (#1160)
- Concatenating a scalar to a CuArray gives an Array (#1162)
- Calling
byte_perm
withInt32
values inserts sign checks (#1165) sum!
does not compile for large arrays (#1169)- Same random sequence on GPU and CPU? (#1170)
- Specifying eltype and buffer type when adapting to
CuArray
? (#1171) - Inefficient
lop3.lut
instructions generated (#1172) - Writing temporary PTX files can fail (#1173)
- Switching devices doesn't switch the REPL's output task (#1175)
- GC is not working for CuSparseMatrixCSR (#1178)
- sparse*dense operations shouldn't drop sparseness (#1188)
- Raises illegal memory access error randomly (#1189)
Merged pull requests:
- CI fixes (#950) (@maleadt)
- implement sparse (#1093) (@CarloLucibello)
- Use the kernel state object to pass the exception flag location. (#1110) (@maleadt)
- Update manifest (#1123) (@github-actions[bot])
- Improve show methods in sparse GPU arrays. (#1129) (@maleadt)
- Use warp intrinsics for a wider range of reductions. (#1130) (@maleadt)
- Support wrapping a host buffer with a CuArray (#1131) (@maleadt)
- support transpose CSC to CUDA CSR (#1132) (@Roger-luo)
- Small improvements to discovery of local toolkits. (#1134) (@maleadt)
- Rework device and context getters. (#1135) (@maleadt)
- Avoid memory operations during graph capture. (#1137) (@maleadt)
- Streamline the random number interface. (#1146) (@maleadt)
- Native device synchronization (#1147) (@maleadt)
- support interpret(reshape) (#1149) (@Roger-luo)
- add a gitignore (#1150) (@Roger-luo)
- Fix normalize on complex number (#1151) (@maleadt)
- Addition and multiplication over cuarray and cusparse (#1152) (@maleadt)
- Preserve Int32 hardware indices (#1153) (@maleadt)
- remove mutable to make device sparse type bitstype (#1154) (@Roger-luo)
- Update manifest (#1155) (@github-actions[bot])
- CompatHelper: bump compat for "BFloat16s" to "0.2" (#1156) (@github-actions[bot])
- Perform actual synchronization API calls when we need the memory (#1157) (@maleadt)
- Binary dependency changes (#1159) (@maleadt)
- Bump dependencies. (#1161) (@maleadt)
- Generalize Sparse Array Indices Type in Struct Def (#1163) (@Roger-luo)
- Use unchecked type conversions for
byte_perm
arguments (#1166) (@eschnett) - Fix performance regressions (#1167) (@maleadt)
- Fix big mapreduce kernel for inputs without neutral element. (#1174) (@maleadt)
- Switch contexts before performing memory operations on arrays (#1176) (@maleadt)
- Improvements to stream-ordered memory management (#1177) (@maleadt)
- Update manifest (#1180) (@github-actions[bot])
- Consistently use chars instead of raw enums in CUSPARSE/CUSOLVER functions. (#1181) (@maleadt)
- Implement forward compatibility (#1182) (@maleadt)
- Bump GPUCompiler for 1.8 compat. (#1183) (@maleadt)
- Bump GPUArrays. (#1186) (@maleadt)
- Update documentation (#1187) (@maleadt)
v3.4.2
CUDA v3.4.2
Closed issues:
- Broadcasting a datatype does not work (#261)
- CUDA error: invalid argument during Zygote/Flux gradient computation (#1107)
- EXCEPTION_ACCESS_VIOLATION when using shared memory allocations. (#1116)
Merged pull requests:
- add symmetric support for mul (#217) (@Roger-luo)
- adds a device array type for CuSparseMatrixCSR to support using it in kernel functions (#1106) (@Roger-luo)
- Update manifest (#1108) (@github-actions[bot])
- Specialize Ref{<:Type} for GPU compatibility. (#1109) (@maleadt)
- Use the documented version of the enable_finalizers API. (#1111) (@maleadt)
- Don't embed the method table in the AST. (#1112) (@maleadt)
- Remove the hacky unique'ing of shmem GVs. (#1114) (@maleadt)
- Introduce a macro for marking multiple functions as device-only. (#1117) (@maleadt)
- Simplify library loading. (#1121) (@maleadt)
- Backports for 3.4.2 (#1122) (@maleadt)
v3.4.1
v3.4.0
CUDA v3.4.0
Merged pull requests: