Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2GB limit to sysimage size? #1019

Open
dgleich opened this issue Dec 5, 2024 · 12 comments
Open

2GB limit to sysimage size? #1019

dgleich opened this issue Dec 5, 2024 · 12 comments

Comments

@dgleich
Copy link

dgleich commented Dec 5, 2024

is there a limit of sysimage files to 2GB?

When I compile a number of packages that increases the sysimage size to over 2GB, then it fails when I try to use it with this error message.

Testing sysimage: /Users/dgleich/.julia/sysimages/FullJuliaSysimage.so of size 2240882280

[56623] signal 11 (1): Segmentation fault: 11
in expression starting at none:0
jfptr___init___118647 at /Users/dgleich/.julia/sysimages/FullJuliaSysimage.so (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-x64-4.0/build/default-macmini-x64-4-0/julialang/julia-master/src/./julia.h:2157 [inlined]
jl_module_run_initializer at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-x64-4.0/build/default-macmini-x64-4-0/julialang/julia-master/src/toplevel.c:76
_finish_julia_init at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-x64-4.0/build/default-macmini-x64-4-0/julialang/julia-master/src/init.c:902
julia_init at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-x64-4.0/build/default-macmini-x64-4-0/julialang/julia-master/src/init.c:843
jl_repl_entrypoint at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-x64-4.0/build/default-macmini-x64-4-0/julialang/julia-master/src/jlapi.c:1053
Allocations: 1 (Pool: 1; Big: 0); GC: 0

System info:


julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 16 × Intel(R) Xeon(R) W-2140B CPU @ 3.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

I'm continuing to look for other sources of where the issue may be, but there seems to be a success/failure threshold at 2GB regardless of what packages I'm adding.

These sysimages were constructed with PackageCompiler.create_sysimage(pkgs; sysimage_path="FullJuliaSysimage.so). (The list of packages is at the bottom, although I don't think this is so relevant... (but it may be!)

Full list of packages that I had to trigger the error.

pkgs = ["MultiFloats",  
       "Polynomials",
       "SpecialFunctions",
       "Roots", 
       "FastTransforms",
       "Interpolations", 
       "Graphs",
       "SimpleWeightedGraphs",
       "Metis",
       "Combinatorics",
       "JuMP",
       "Ipopt",
       "GLPK",
       "Clp",
       "HiGHS",
       "Convex",
       "SCS",
       "OptimTestProblems",
       "Optim",
       "NonlinearSolve",
       "LsqFit", 
       "Tulip",
       "ForwardDiff", 
       "Symbolics",
       "DifferentialEquations",
       "ParserCombinator",
       "MemoryViews",
       "ApproxFun",
       "NaNMath",
       "LineSearches",
       "Meshes",
       "Gridap",
       "GenericLinearAlgebra", 
       "GLMakie",
       "UnicodePlots",
       "PGFPlotsX",
       "Luxor",
       "AlgebraOfGraphics",  
       "ReinforcementLearning",
       "NMF",
       "RDatasets" ]
@dgleich
Copy link
Author

dgleich commented Dec 5, 2024

On my system, this will reproduce the issue with a package right at the 2GB boundary.

pkgs = ["MultiFloats",  
       "Polynomials",
       "SpecialFunctions",
       "Roots", 
       "FastTransforms",
       "Interpolations", 
       "Graphs",
       "SimpleWeightedGraphs",
       "Metis",
       "Combinatorics",
       "JuMP",
       "Ipopt",
       "GLPK",
       "Clp",
       "HiGHS",
       "Convex",
       "SCS",
       "OptimTestProblems",
       "Optim",
       "NonlinearSolve",
       "LsqFit", 
       "Tulip",
       "ForwardDiff", 
       "Symbolics",
       "DifferentialEquations",
       "ParserCombinator",
       "MemoryViews",
       "ApproxFun",
       "NaNMath",
       "LineSearches",
       "Meshes",
       "Gridap",
       "GenericLinearAlgebra", 
       "GLMakie",
       "UnicodePlots",
       "PGFPlotsX",
       "Luxor",
       "AlgebraOfGraphics",  
       "ReinforcementLearning",
       "NMF",
       "RDatasets" ]

using Pkg
Pkg.activate(; temp=true)
Pkg.add("PackageCompiler")
Pkg.add(pkgs; preserve=Pkg.PRESERVE_ALL_INSTALLED)

sysimage1 = joinpath(dirname(Base.active_project()), "sysimage1", "FullSysimage1.so")
sysimage2 = joinpath(dirname(Base.active_project()), "sysimage2", "FullSysimage2.so")

using PackageCompiler

PackageCompiler.create_sysimage(pkgs; sysimage_path=sysimage1)
PackageCompiler.create_sysimage(pkgs[1:end-1]; sysimage_path=sysimage2)
@show filesize(sysimage1)
@show filesize(sysimage2)

run(`/Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia -J$(sysimage1) -e"println(length(Base.loaded_modules))"`)
run(`/Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia -J$(sysimage2) -e"println(length(Base.loaded_modules))"`)

@dgleich
Copy link
Author

dgleich commented Dec 6, 2024

Okay, so here is what I think is going on in this case.

tl;dr - the maximum displacement for a position independent offset is 31 bits in x86_64/amd64. If the target function exceeds that, the displacement offset is computed incorrectly. This appears to be a clang/Xcode//g++ bug.

When this crashes, here's the line it crashes on:

Process 96024 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x114ddee40)
    frame #0: 0x0000000192e1a282 FullSysimage1.so`jfptr___init___122579 + 2
FullSysimage1.so`jfptr___init___122579:
->  0x192e1a282 <+2>:  movq   -0x7e03b449(%rip), %rax
    0x192e1a289 <+9>:  movq   -0x7e03b448(%rip), %rdi
    0x192e1a290 <+16>: callq  *%rax
    0x192e1a292 <+18>: movq   %rax, %r13
Target 0: (julia) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x114ddee40)
  * frame #0: 0x0000000192e1a282 FullSysimage1.so`jfptr___init___122579 + 2
    frame #1: 0x0000000100b79b6c libjulia-internal.1.11.2.dylib`jl_module_run_initializer [inlined] jl_apply(args=0x00007ff7bfefee10, nargs=1) at julia.h:2157:12 [opt]
    frame #2: 0x0000000100b79b5d libjulia-internal.1.11.2.dylib`jl_module_run_initializer(m=0x00000001bbcb35b0) at toplevel.c:76:9 [opt]
    frame #3: 0x0000000100b643c3 libjulia-internal.1.11.2.dylib`_finish_julia_init(rel=JL_IMAGE_CWD, ptls=<unavailable>, ct=<unavailable>) at init.c:902:13 [opt]
    frame #4: 0x0000000100b63bdb libjulia-internal.1.11.2.dylib`julia_init(rel=JL_IMAGE_CWD) at init.c:843:5 [opt]
    frame #5: 0x0000000100bb096c libjulia-internal.1.11.2.dylib`jl_repl_entrypoint(argc=0, argv=0x00007ff7bfeff578) at jlapi.c:1053:5 [opt]
    frame #6: 0x0000000100003f79 julia`main + 9
    frame #7: 0x00007ff81537c345 dyld`start + 1909

Here's the function from objdump in sysimage1.so (the one that doesn't work)

000000000199b280 <_jfptr___init___122579>:
 199b280: 41 55                         pushq   %r13
 199b282: 48 8b 05 b7 4b fc 81          movq    -0x7e03b449(%rip), %rax ## 0xffffffff8395fe40 <_jl_small_typeof+0xfffffffeffffffe0>
 199b289: 48 8b 3d b8 4b fc 81          movq    -0x7e03b448(%rip), %rdi ## 0xffffffff8395fe48 <_jl_small_typeof+0xfffffffeffffffe8>
 199b290: ff d0                         callq   *%rax
 199b292: 49 89 c5                      movq    %rax, %r13
 199b295: e8 16 8d 72 fe                callq   0xc3fb0 <_julia___init___122578>
 199b29a: 48 8b 05 1f c0 aa 02          movq    0x2aac01f(%rip), %rax   ## 0x44472c0 <_sigsetjmp+0x44472c0>
 199b2a1: 48 8b 00                      movq    (%rax), %rax
 199b2a4: 41 5d                         popq    %r13
 199b2a6: c3                            retq
 199b2a7: 66 0f 1f 84 00 00 00 00 00    nopw    (%rax,%rax)

Looking around at other jfptr_init functions in sysimage2.so (the one that does work!) there are a lot of functions that look very similar...

0000000000036af0 <_jfptr_init_regex_233950>:
   36af0: 41 55                         pushq   %r13
   36af2: 48 8b 05 47 24 0a 7e          movq    0x7e0a2447(%rip), %rax  ## 0x7e0d8f40 <_jl_pgcstack_func_slot>
   36af9: 48 8b 3d 48 24 0a 7e          movq    0x7e0a2448(%rip), %rdi  ## 0x7e0d8f48 <_jl_pgcstack_key_slot>
   36b00: ff d0                         callq   *%rax
   36b02: 49 89 c5                      movq    %rax, %r13
   36b05: e8 16 49 6a 03                callq   0x36db420 <_julia_init_regex_233949>
   36b0a: 41 5d                         popq    %r13
   36b0c: c3                            retq
   36b0d: 0f 1f 00                      nopl    (%rax)

So the lines (which is where the segfault occurs)

 199b282: 48 8b 05 b7 4b fc 81          movq    -0x7e03b449(%rip), %rax ## 0xffffffff8395fe40 <_jl_small_typeof+0xfffffffeffffffe0>
 199b289: 48 8b 3d b8 4b fc 81          movq    -0x7e03b448(%rip), %rdi ## 0xffffffff8395fe48 <_jl_small_typeof+0xfffffffeffffffe8>

should be loading up the address of _jl_pgcstack_func_slot/key_slot.

Back to sysimage1.so.

% nm /var/folders/yd/vb_8j2ns7c763wxf9q184bk80000gn/T/jl_9xTNej/sysimage1/FullSysimage1.so | grep _jl_pgcstack_
000000008395fe40 s _jl_pgcstack_func_slot
000000008395fe48 s _jl_pgcstack_key_slot

So if we take:

0x199b289 (inst. pointer for next address) and need to get to 0x8395fe40 we need a displacement of 0x81fc4bbe (which is bigger than 31 bits...)

Of course, since something is computing this probably with overflow, let's see what happens:

x = 0x8395fe40 - 0x199b289
UInt32(-reinterpret(Int32,x)) # compute the negative offset as a pos. val. 

which gives 0x7e03b449 vs. the code which has -0x7e03b449.

So it seems like this is a bit of a stumbling block for using larger than 2GB sysimages.

Okay, so where does this error come from:

The object file shows this:

% objdump -dr /var/folders/yd/vb_8j2ns7c763wxf9q184bk80000gn/T/jl_SLOP2XY90L-o.a
... 
00000000000da0d0 <_jfptr___init___122579>:
   da0d0: 41 55                         pushq   %r13
   da0d2: 48 8b 05 00 00 00 00          movq    (%rip), %rax            ## 0xda0d9 <_jfptr___init___122579+0x9>
                00000000000da0d5:  X86_64_RELOC_SIGNED  _jl_pgcstack_func_slot
   da0d9: 48 8b 3d 00 00 00 00          movq    (%rip), %rdi            ## 0xda0e0 <_jfptr___init___122579+0x10>
                00000000000da0dc:  X86_64_RELOC_SIGNED  _jl_pgcstack_key_slot
   da0e0: ff d0                         callq   *%rax
   da0e2: 49 89 c5                      movq    %rax, %r13
   da0e5: e8 00 00 00 00                callq   0xda0ea <_jfptr___init___122579+0x1a>
                00000000000da0e6:  X86_64_RELOC_BRANCH  _julia___init___122578
   da0ea: 48 8b 05 00 00 00 00          movq    (%rip), %rax            ## 0xda0f1 <_jfptr___init___122579+0x21>
                00000000000da0ed:  X86_64_RELOC_GOT_LOAD        _jl_nothing@GOTPCREL
   da0f1: 48 8b 00                      movq    (%rax), %rax
   da0f4: 41 5d                         popq    %r13
   da0f6: c3                            retq
   da0f7: 66 0f 1f 84 00 00 00 00 00    nopw    (%rax,%rax)
...

So it's g++ that doesn't compute the offset correct or throw an error that it can't handle integrating everything into a shared library.

I'm looking into if there is an option that would flag this error.

@dgleich
Copy link
Author

dgleich commented Dec 8, 2024

Curiously, this does work with a 3.2GB sysimage under linux, so it seems it's something macOS specific.

@Zentrik
Copy link
Member

Zentrik commented Dec 8, 2024

I suspect this is because the small code model is used when on Mac os but not Linux, see https://github.com/JuliaLang/julia/blob/c897a13c45c1222b4b16cf941348beef25f97ee0/src/aotcompile.cpp#L1889 and a previous issue with the same cause #500 (comment).

@dgleich
Copy link
Author

dgleich commented Dec 8, 2024

I was looking at that :).

Time to recompile Julia to see if that fixes it.

@dgleich
Copy link
Author

dgleich commented Dec 10, 2024

I tried editing aotcompile.cpp to use the medium code model on the Mac.

This runs into a linker error when building the base Julia sysimage... (sys-o.a)

clang++ -mmacosx-version-min=10.14 -march=native -mtune=native -integrated-as -m64  -shared -fPIC -L/Volumes/Videos/julia-compile/julia/usr/lib/julia -L/Volumes/Videos/julia-compile/julia/usr/lib -L/Volumes/Videos/julia-compile/julia/usr/lib -o /Volumes/Videos/julia-compile/julia/usr/lib/julia/sys.dylib -Xlinker -all_load /Volumes/Videos/julia-compile/julia/usr/lib/julia/sys-o.a  -ljulia-internal -ljulia $([ Darwin = WINNT ] && echo '' -lopenlibm -lssp --disable-auto-import --disable-runtime-pseudo-reloc)

ld: illegal text-relocation in '_jfptr_YY.handle_matchNOT.YY.521_46438'+0xC (/Volumes/Videos/julia-compile/julia/usr/lib/julia/sys-o.a[2](text#0.o)) to '_jl_pgcstack_func_slot'

I'll often get errors on different functions (I think it's just erroring on the first one), e.g. I tried something else and got:

ld: illegal text-relocation in '_julia_Dict_44375'+0x43 (/Volumes/Videos/julia-compile/julia/usr/lib/julia/sys-o.a[2](text#0.o)) to '_SUM.CoreDOT.GenericMemoryYY.26010'

This is using a recent clang++

% clang++ --version
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

@dgleich
Copy link
Author

dgleich commented Dec 10, 2024

Here's the commit where this was changed for linux...

JuliaLang/julia#53391

@topolarity
Copy link
Member

Yeah, every platform supports a different set of relocations (ELF vs. COFF vs. Mach-O) and LLVM is not always good about complaining up front when you ask it to use a code model that is not fully implemented for a given platform / binary format.

You might try using the Large code model instead of Medium to see if that succeeds on macOS.

@dgleich
Copy link
Author

dgleich commented Dec 11, 2024

Large also fails :(. All the entries in sys-a.o that Julia compiles are generated with pcrel=false; i.e they aren't position independent.

I was doing these tests on a compile of 1.11.2 to keep changes minimal. But it seems like LLVM 16 just doesn't support CodeModel medium on macOS with PIC, or there is some black magic I haven't determined yet.

On the other hand, moving to Julia master (1.12-dev, LLVM 18) shows that CodeModel medium does work on macOS now. I'm checking if that'll compile the multiGB sysimage.

@topolarity
Copy link
Member

topolarity commented Dec 11, 2024

cc @gbaraldi sounds like we can potentially turn this on for more platforms, which would be great news

It would be nice to find what change upstream fixed this too

@dgleich
Copy link
Author

dgleich commented Dec 12, 2024

Agree on finding out what changed in llvm to fix it :) I never like solutions where there's a link in the chain I don't understand.

In my current test of a large sysimage on 1.12-dev, I ran into a new issue at the linking stage.

g++ -m64 -march=x86-64 -shared -L/Volumes/Videos/julia-compile/julia/usr/lib -L/Volumes/Videos/julia-compile/julia/usr/lib -o /var/folders/yd/vb_8j2ns7c763wxf9q184bk80000gn/T/jl_OZv5Qu/sysimage1/FullSysimage1.so -Wl,-all_load /var/folders/yd/vb_8j2ns7c763wxf9q184bk80000gn/T/jl_MNSFELqhQ3-o.a -ljulia -ljulia-internal -fPIC -Wl,-rpath,@executable_path -Wl,-rpath,@executable_path/julia
Undefined symbols for architecture x86_64:
  "_jl_fptr_sparam", referenced from:
      _tojlinvoke552860 in jl_MNSFELqhQ3-o.a[3](text#1.o)
      _tojlinvoke559005 in jl_MNSFELqhQ3-o.a[7](text#5.o)
      _tojlinvoke552862 in jl_MNSFELqhQ3-o.a[7](text#5.o)
ld: symbol(s) not found for architecture x86_64
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
000000000089fb40 <_tojlinvoke552860>:
  89fb40: 50                            pushq   %rax
  89fb41: 48 8b 0d 00 00 00 00          movq    (%rip), %rcx            ## 0x89fb48 <_tojlinvoke552860+0x8>
                000000000089fb44:  X86_64_RELOC_SIGNED  _jl_globalYY.552861
  89fb48: e8 00 00 00 00                callq   0x89fb4d <_tojlinvoke552860+0xd>
                000000000089fb49:  X86_64_RELOC_BRANCH  _jl_fptr_sparam
  89fb4d: 59                            popq    %rcx
  89fb4e: c3                            retq
  89fb4f: 90                            nop

That function is definitely in libjulia-internal.dylib:

% nm /Volumes/Videos/julia-compile/julia/usr/lib/libjulia-internal.dylib | grep sparam
00000000003cd998 S _jl_builtin__compute_sparams
0000000000030720 T _jl_f__compute_sparams
0000000000388ba8 D _jl_f__compute_sparams_addr
000000000001a070 t _jl_fptr_sparam
0000000000371680 S _jl_fptr_sparam_addr

But it looks like it isn't exported (little t vs. big T)?

There seem to be a number of checks for this function, so I'm guessing it shouldn't be called in the shared object?

@dgleich
Copy link
Author

dgleich commented Dec 14, 2024

I'm guessing JuliaLang/julia#56817 fixed the _jl_fptr_sparam issues as I don't see them anymore.

Still checking on what happens with a 2GB+ sysimage (some of the packages seem to break at precompilation for 1.12-dev, so my initial set doesn't quite work there.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants