Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize memorynew intrinsic for constant length Memory #55913

Closed

Conversation

oscardssmith
Copy link
Member

@oscardssmith oscardssmith commented Sep 28, 2024

This speeds up making new Memorys and allow the compiler to better understand what's going on, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

Benchmarks:

julia> f(T) = Memory{T}(undef, 2);

julia> @btime f(Int);
  7.879 ns (1 allocation: 48 bytes) #before
  3.959 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Union{Int, Nothing})
  8.426 ns (1 allocation: 48 bytes) #before
  4.017 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Any);
  8.637 ns (1 allocation: 48 bytes) #before
  3.920 ns (1 allocation: 48 bytes) #after 

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

src/cgutils.cpp Outdated Show resolved Hide resolved
@nsajko nsajko added the arrays [a, r, r, a, y, s] label Sep 28, 2024
@oscardssmith oscardssmith added the performance Must go faster label Sep 29, 2024
@oscardssmith
Copy link
Member Author

@gbaraldi so with LLVM assertions enabled I'm getting

julia: /home/oscardssmith/Documents/Code/julia/deps/srccache/llvm-julia-18.1.7-2/llvm/lib/IR/Instructions.cpp:685: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

[398134] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7108b6c2871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm8CallInst4initEPNS_12FunctionTypeEPNS_5ValueENS_8ArrayRefIS4_EENS5_INS_17OperandBundleDefTIS4_EEEERKNS_5TwineE at /home/oscardssmith/Documents/Code/julia/usr/bin/../lib/libLLVM.so.18.1jl (unknown line)
CallInst at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1692 [inlined]
Create at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1540 [inlined]
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2398
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2420 [inlined]
emit_memorynew at /home/oscardssmith/Documents/Code/julia/src/cgutils.cpp:4555 [inlined]

which is on the line that does ctx.builder.CreateCall(prepare_call(jl_alloc_genericmemory_unchecked_func), {ptls, nbytes, cg_typ});. Any ideas?

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

I'd print everyone involved here with the way I showed you yesterday thing->print(dbgs()), So print the func and the arguments

@oscardssmith
Copy link
Member Author

This now works! For simple examples like Memory{Int}(undef, 3) it's about 7ns compared to 11 previously. Also this version is almost able to be analyzed by our escape analysis to delete the Memory when it doesn't (we just need to teach the llvm-alloc-helpers about gc_loaded, and for non @inbounds accesses we need to teach LLVM that the boundschecks are dead).

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

As an example of what is possible.

Allocopt was able to go from

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  %memoryref_mem = call dereferenceable(40) ptr addrspace(10) @julia.gc_alloc_obj(ptr nonnull %current_task1, i64 40, ptr addrspace(10) addrspacecast (ptr @"+Core.GenericMemory#771.jit" to ptr addrspace(10))), !dbg !14
  %0 = addrspacecast ptr addrspace(10) %memoryref_mem to ptr addrspace(11), !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr addrspace(11) %0, i64 0, i32 1, !dbg !14
  %2 = call nonnull ptr @julia.pointer_from_objref(ptr addrspace(11) %0) #4, !dbg !14
  %3 = getelementptr inbounds i8, ptr %2, i64 16, !dbg !14
  store ptr %3, ptr addrspace(11) %1, align 8, !dbg !14
  store i64 3, ptr addrspace(11) %0, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) %memoryref_mem, ptr %3), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

to. Removing the allocation. Which likely would allow it to just return the 11

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %memoryref_mem = alloca [40 x i8], align 16
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  call void @llvm.lifetime.start.p0(i64 40, ptr %memoryref_mem)
  %0 = freeze [40 x i8] undef, !dbg !14
  store [40 x i8] %0, ptr %memoryref_mem, align 1, !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr %memoryref_mem, i64 0, i32 1, !dbg !14
  %2 = getelementptr inbounds i8, ptr %memoryref_mem, i64 16, !dbg !14
  store ptr %2, ptr %1, align 8, !dbg !14
  store i64 3, ptr %memoryref_mem, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) null, ptr %2), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

@oscardssmith
Copy link
Member Author

julia> @btime Memory{Int8}(undef, 3)
  7.169 ns (1 allocation: 32 bytes) # before
  3.986 ns (1 allocation: 32 bytes) # after

@giordano
Copy link
Contributor

giordano commented Oct 4, 2024

FixedSizeArrays.jl is going to be much nicer 🙂

@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch from 6222082 to b65a483 Compare October 5, 2024 01:12
src/cgutils.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 7, 2024
combined with #55913, the compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
src/ccall.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 10, 2024
combined with #55913, the
compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch from b65a483 to 724b8c5 Compare October 11, 2024 19:51
@giordano
Copy link
Contributor

Can you please add an llvm pass test for #56030 (comment) (removing all memory for a simple case where the Memory object doesn't escape)?

@oscardssmith
Copy link
Member Author

Do you want an actual LLVM pass, or can I just write a test for 0 allocations?

@giordano
Copy link
Contributor

I think an llvm test would be more robust, but probably a simple zero-allocation test would do the job as well.

@oscardssmith
Copy link
Member Author

LOL. This test is so good it broke a doctest in performance tips. We're testing to show that you get allocations if you have "bad" code that allocates arrays, but now it doesn't allocate :laughing

@oscardssmith
Copy link
Member Author

This is now on top of #55995 (to figure out why we weren't optimizing correctly), but other than that, I think this is good to go!

@oscardssmith oscardssmith requested a review from vtjnash October 14, 2024 16:45
@giordano
Copy link
Contributor

Maybe a test of no allocations in simple cases as discussed above? 🙂

@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch from dfef35a to 3e780d3 Compare December 5, 2024 19:16
@oscardssmith
Copy link
Member Author

@nsajko thanks for the example you gave! Turns out it simplifies to

function confuse_alias_analysis()
   @inbounds begin
	   mem0 = Memory{Int}(undef, 1)
	   mem1 = Memory{Int}(undef, 1)
	   mem0[1] = 3
	   for width in 1:2
			mem1[1] = mem0[1]
		    mem0 = mem1
	   end
	   mem0[1]
   end
end
confuse_alias_analysis()

(which will get added to the tests)

@oscardssmith
Copy link
Member Author

tuple test fixed (we had invalid TBAA on emit_memoryrefptr). If CI passes, I'll rerun pkgeval.

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(["SimpleLooper", "FloatTracker", "HNSW", "PermutationGroups", "LMDB", "HMMGradients", "LIBLINEAR", "NearestNeighbors", "LIBSVM", "MusicManipulations", "Andes", "EinExprs", "FindMinimaxPolynomial", "BPGates", "PlantRayTracer", "JuMP", "SciMLBenchmarks", "StatsLearnModels", "TensorOperationsTBLIS", "SimpleDiffEq", "PosteriorStats", "DiffEqPhysics", "Sensemakr", "GeometryOptimization", "SensitivityRankCondition", "DynamicMovementPrimitives", "MLJ", "NeRCA", "EMpht", "EqualitySampler", "GenomicDiversity", "LineageCollapse", "DataDrivenSparse", "MathepiaModels", "TulipaPlots", "BaryPlots"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

@oscardssmith
Copy link
Member Author

@maleadt @KristofferC any idea why my nanosoldier seems to be stalled?

@maleadt
Copy link
Member

maleadt commented Dec 6, 2024

any idea why my nanosoldier seems to be stalled?

It's not. There's other jobs running.

The initial run seemed a bit redundant since the reported miscompilation hadn't been fixed yet, but it's not my infrastructure so I don't really care if people overuse it

The run also served to test the new julia_args option; might as well do it on something relevant.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(["NearestNeighbors"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch 2 times, most recently from 429910d to e003251 Compare December 6, 2024 19:04
@oscardssmith
Copy link
Member Author

the NearestNeighbors issue isn't reproducing locally which is a little scary.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@oscardssmith
Copy link
Member Author

so I've investigated 3 of these so far, and 2 were package bugs, and one doesn't reproduce locally. we definitely are getting close

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(configuration = (julia_args=["--check-bounds=auto"],), vs_configuration = (julia_args=["--check-bounds=auto"],))

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@KristofferC
Copy link
Member

I don't know if squashing the whole commit history here was great. It's hard to see how different bug fixes were added and if those have correspond tests now etc.

@oscardssmith
Copy link
Member Author

A concerning number of these are failing from within type inference which is highly unfortunate. I think we're likely constant folding code that we're miscompiling, but it's hard to know for sure.

@KristofferC
Copy link
Member

KristofferC commented Dec 8, 2024

There are also these precompile failures in some of the failed packages:

┌ Warning: The call to compilecache failed to create a usable precompiled cache file for ForestMensuration [8fa9ba82-4654-4ff3-8d17-47720d346ae3]
│   exception = Required dependency Base.PkgId(Base.UUID("21216c6a-2e73-6563-6e65-726566657250"), "Preferences") failed to load from a cache file.
└ @ Base loading.jl:2690

I wonder if there is some corruption that happens that causes the precompile process to not work properly.

@nsajko nsajko mentioned this pull request Dec 8, 2024
Co-authored-by: Jameson Nash <[email protected]>
Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Gabriel Baraldi  <[email protected]>
@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch from e003251 to bdb29cf Compare December 10, 2024 15:47
@oscardssmith oscardssmith changed the title make memorynew intrinsic optimize memorynew intrinsic for constant length Memory Dec 12, 2024
oscardssmith added a commit that referenced this pull request Dec 16, 2024
Attempt to split up #55913 into 2
pieces. This piece now only adds the `memorynew` intrinsic without any
of the optimizations enabled by
#55913. As such, this PR should
be ready to merge now. (and will make
#55913 smaller and simpler)

---------

Co-authored-by: gbaraldi <[email protected]>
@oscardssmith
Copy link
Member Author

closing in favor of #56847

stevengj pushed a commit that referenced this pull request Jan 2, 2025
Attempt to split up #55913 into 2
pieces. This piece now only adds the `memorynew` intrinsic without any
of the optimizations enabled by
#55913. As such, this PR should
be ready to merge now. (and will make
#55913 smaller and simpler)

---------

Co-authored-by: gbaraldi <[email protected]>
oscardssmith added a commit that referenced this pull request Jan 6, 2025
replaces #55913 (the rebase was
more annoying than starting from scratch)
This allows the compiler to better understand what's going on for
`memorynew` with compile-time constant length, allowing for LLVM level
escape analysis in some cases. There is more room to grow this
(currently this only optimizes for fairly small Memory since bigger ones
would require writing some more LLVM code, and we probably want a size
limit on putting Memory on the stack to avoid stackoverflow. For larger
ones, we could potentially inline the free so the Memory doesn't have to
be swept by the GC, etc.
```
julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @Btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] needs pkgeval Tests for all registered packages should be run with this change performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.