Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.10 release beta3 has an inference regression bug: infinite hang during inference #51603

Closed
NHDaly opened this issue Oct 5, 2023 · 20 comments
Closed
Labels
bug Indicates an unexpected problem or unintended behavior compiler:inference Type inference regression Regression in behavior compared to a previous version
Milestone

Comments

@NHDaly
Copy link
Member

NHDaly commented Oct 5, 2023

It looks like our new julia release testing automation at RAI is working and has caught a regression! 🎉 Thank you to @nickrobinson251, @quinnj and @Drvi for driving this. ❤️

Our build started failing in between Sept 25th and Sept 26th (but we are only just getting started with the automatic notifications so it took us a few days to get to this):
Screenshot 2023-10-05 at 12 28 07 PM

The failure mode is the job is hanging during PackageCompiler compiling incremental system image:

- PackageCompiler: compiling incremental system image

and then is eventually killed by our build system.

@nickrobinson251 and @kpamnany have also observed this locally on Julia master, where just using RAICode (our package) was spinning forever.
Kiran attached lldb and saw the following julia stack traces:

(lldb) bt                                                                                                                                                                                       
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP                                                                                                                      
  * frame #0: 0x000000011d42f494 sys.dylib`julia_already_inserted_31931 at essentials.jl:13                                                                                                     
    frame #1: 0x000000011d28a928 sys.dylib`julia_simple_walk_31884 at passes.jl:190                                                                                                             
    frame #2: 0x000000011d54da80 sys.dylib`julia_walk_to_defs_31913 at passes.jl:236                                                                                                            
    frame #3: 0x000000011d3117f0 sys.dylib`julia_sroa_passNOT._32108 at passes.jl:183                                                                                                           
    frame #4: 0x000000011d3e29ac sys.dylib`julia_run_passes_ipo_safe_32621 at optimize.jl:797                                                                                                   
    frame #5: 0x000000011d294f80 sys.dylib`julia_optimize_32604 at optimize.jl:812                                                                                                              
    frame #6: 0x000000011d294ffc sys.dylib`jfptr_optimize_32605 + 56                                                                                                                            
    frame #7: 0x000000010473da58 libjulia-internal.1.11.0.dylib`ijl_apply_generic at gf.c:2873:35 [opt]                                                                                         
    frame #8: 0x000000010473da10 libjulia-internal.1.11.0.dylib`ijl_apply_generic(F=0x000000011d908ed0, args=0x000000016bca3a38, nargs=<unavailable>) at gf.c:3074:12 [opt]                     
    frame #9: 0x000000011d549fa8 sys.dylib`julia__typeinf_30761 at typeinfer.jl:265                                                                                                             
    frame #10: 0x000000011d3178a8 sys.dylib`julia_typeinf_30732 at typeinfer.jl:216                                                                                                             
    frame #11: 0x000000011d1bcdb4 sys.dylib`julia_typeinf_edge_30856 at typeinfer.jl:863                                                                                                        
  <snip>

I haven't run a complete git bisect yet, but looking at the 1.10 release backports branch, it seems that one of these two commits are the likely culprits (CC: @aviatesk):

Screenshot 2023-10-05 at 12 31 13 PM
@NHDaly NHDaly added bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version compiler:inference Type inference labels Oct 5, 2023
@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2023

Note, when attempting to reproduce, ran into #51570. The workaround there seems to have helped.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 6, 2023

FYI reverting ae8f9ad appears to be enough to resolve the issue.

Locally, on 1.10.0-beta3, using RAICode hangs forever for me (while Precompiling one of our packages, ◑ ResourceManagement).

With ae8f9ad reverted, it succeeded very quickly.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 6, 2023

Revert proposal for 1.10 here: #51612

@aviatesk aviatesk added this to the 1.10 milestone Oct 6, 2023
@NHDaly
Copy link
Member Author

NHDaly commented Oct 6, 2023

I captured this stacktrace via ctrl-t when my process was stuck procompiling:

obvious_subtype at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
ijl_subtype_env at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
uv_alloc_buf at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
jfptr_uv_alloc_buf_75149 at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
jlcapi_uv_alloc_buf_75686 at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
uv__read at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
uv__stream_io at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
uv__io_poll at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
uv_run at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
ijl_task_get_next at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
poptask at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
wait at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
YY.waitYY.645 at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
_trywait at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
YY.726 at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
jfptr_YY.726_75384 at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
start_task at /nix/store/p2ycjgmsjwjbl3k352z63dl8i60kq58s-julia-1.10/lib/julia/libjulia-internal.1.10.0.dylib (unknown line)
unknown function (ip: 0x0)

@aviatesk
Copy link
Sponsor Member

aviatesk commented Oct 7, 2023

It seems that commit ae8f9ad (indirectly) introduced the sporadic bug related to the subtyping algorithm?

@nickrobinson251
Copy link
Contributor

nickrobinson251 commented Oct 10, 2023

You're right that the issue seems to be sporadic:

  • we've been struggling to replicate this locally on Mac with v1.10 in order to create a MRE
  • we saw the build work fine with the latest backports-release-1.10 (b3898c3456)

I was able to trigger it again using latest master (651aef941e). Backtrace: bt.txt

signal (29): Information request: 29
iterate at ./array.jl:943 [inlined]
in at ./operators.jl:1303 [inlined]
already_inserted at ./compiler/ssair/passes.jl:342
perform_lifting! at ./compiler/ssair/passes.jl:723
sroa_pass! at ./compiler/ssair/passes.jl:1170
run_passes_ipo_safe at ./compiler/optimize.jl:797
run_passes_ipo_safe at ./compiler/optimize.jl:812 [inlined]
optimize at ./compiler/optimize.jl:786
jfptr_optimize_32588 at /Users/nickr/repos/rai-julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nickr/repos/rai-julia/src/gf.c:2892 [inlined]
ijl_apply_generic at /Users/nickr/repos/rai-julia/src/gf.c:3093
_typeinf at ./compiler/typeinfer.jl:265
typeinf at ./compiler/typeinfer.jl:216
typeinf_edge at ./compiler/typeinfer.jl:863
abstract_call_method at ./compiler/abstractinterpretation.jl:617
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:89
abstract_call_known at ./compiler/abstractinterpretation.jl:2080
abstract_call at ./compiler/abstractinterpretation.jl:2162
abstract_call at ./compiler/abstractinterpretation.jl:2155
abstract_call at ./compiler/abstractinterpretation.jl:2310
abstract_eval_call at ./compiler/abstractinterpretation.jl:2325
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2335
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2629
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2925
typeinf_local at ./compiler/abstractinterpretation.jl:3122
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3206
_typeinf at ./compiler/typeinfer.jl:247
typeinf at ./compiler/typeinfer.jl:216
typeinf_edge at ./compiler/typeinfer.jl:863
...
...

But also I tested out the 1.10 revert PR #51612 and still saw a hang (in CI), so (given it's sporadic) i suspect we might have misidentified the culprit.

@kpamnany is trying to replicate it on linux so we can capture it in rr. And i will keep trying to work on a MRE.

@nickrobinson251
Copy link
Contributor

@vtjnash kindly spent time on a call with me, @kpamnany and @NHDaly today, where we had the hang replicated in a local process on recent Julia master, and we were able to extract some info that should hopefully help track this down

@NHDaly
Copy link
Member Author

NHDaly commented Oct 12, 2023

Okay, we have a standalone isolated MRE for the hang on master! Sorry this took us so long.

This reliably hangs on master, but not on julia 1.10. So something slightly different is happening in 1.10, but since that is nondeterministic and only happens during the package compiler build, it's harder to track down, so we've started here.


I don't think we could have gotten this without @vtjnash walking us through some very gnarly hackery to find which function was stuck in inference (including reading some pointers off of registers and some manual pointer arithmetic and memory reads).

One of my takeaways here is that it would be nice to be able to set some kind of "debug mode" flag to have julia log which function it's inferring before it starts and stops inference / optimization, so that we could have found where it was stuck much more easily. Once we found that, isolating an MRE wasn't too hard.

@aviatesk and @vtjnash: Please try to reproduce the hang from this, and see if you can work out the issue from there!

Once you find the culprit, can you also give a time estimate to fix it? I do still think that if it will take more than a couple days, we should start by reverting the culprits so that main can be unbroken until we land the fix.

const Iterator = Any

abstract type Shape end

struct ShapeUnion <: Shape
    args::Vector{Shape}
end
shape_null(::Type{Shape}) = SHAPE_NULL
const SHAPE_NULL = ShapeUnion(Shape[])

const CANCELLED = Ref(false)
throw_if_cancelled() = if CANCELLED[] throw(ErrorException("cancelled")) end

@noinline Base.@nospecializeinfer function _union_vec_no_union(args)
    return args
end

# WEIRDLY, this also reproduces if shape_disjuncts is not defined!
function shape_disjuncts end
shape_disjuncts(s::ShapeUnion) = s.args
shape_disjuncts(s::Shape) = Shape[s]
# ^ (You can try commenting out the above three lines to produce another hang as well).

function shape_union(::Type{Shape}, args::Iterator)
    return _union(args)
end
function _union(args::Iterator)
    if any(arg -> arg isa ShapeUnion, args)
        # Call the entry point rather than `_union_vec_no_union` because we are uncertain
        # about the size and because there might be some optimization opportunity.
        return shape_union(Shape, (disj for arg in args for disj in shape_disjuncts(arg)))
    end

    if !(args isa Vector)
        args = collect(Shape, args)
        # respect _union_vec_no_union's assumption
        isempty(args) && return shape_null(Shape)
        length(args) == 1 && return only(args)
    end

    # If this is a big union, check for cancellation
    length(args) > 100 && throw_if_cancelled()
    return _union_vec_no_union(args)
end

# Reproduction:
#=
julia> code_typed(
           _union,
           (Vector{Shape},),
       )
=#
julia> VERSION
v"1.11.0-DEV.638"

julia> include("inference_hang_repro.jl")
_union (generic function with 1 method)

julia> code_typed(
           _union,
           (Vector{Shape},),
       )
# it is hanging here....

@NHDaly
Copy link
Member Author

NHDaly commented Oct 12, 2023

(pausing it gives this stack trace):

^CERROR: InterruptException:
Stacktrace:
   [1] perform_lifting!(compact::Core.Compiler.IncrementalCompact, visited_philikes::Vector{…}, cache_key::Any, result_t::Any, lifted_leaves::Core.Compiler.IdDict{…}, stmt_val::Any, lazydomtree::Core.Compiler.LazyGenericDomtree{…})
     @ Core.Compiler ./compiler/ssair/passes.jl:731
   [2] sroa_pass!(ir::Core.Compiler.IRCode, inlining::Core.Compiler.InliningState{Core.Compiler.NativeInterpreter})
     @ Core.Compiler ./compiler/ssair/passes.jl:1170
   [3] run_passes_ipo_safe(ci::Core.CodeInfo, sv::Core.Compiler.OptimizationState{…}, caller::Core.Compiler.InferenceResult, optimize_until::Nothing)
     @ Core.Compiler ./compiler/optimize.jl:797
   [4] run_passes_ipo_safe
     @ Core.Compiler ./compiler/optimize.jl:812 [inlined]
   [5] optimize(interp::Core.Compiler.NativeInterpreter, opt::Core.Compiler.OptimizationState{…}, caller::Core.Compiler.InferenceResult)
     @ Core.Compiler ./compiler/optimize.jl:786
   [6] _typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
     @ Core.Compiler ./compiler/typeinfer.jl:265
   [7] typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
     @ Core.Compiler ./compiler/typeinfer.jl:216
   [8]
     @ Core.Compiler ./compiler/typeinfer.jl:863
   [9]
     @ Core.Compiler ./compiler/abstractinterpretation.jl:617
  [10]
     @ Core.Compiler ./compiler/abstractinterpretation.jl:89
  [11]
     @ Core.Compiler ./compiler/abstractinterpretation.jl:2080
  [12]
     @ Core.Compiler ./compiler/abstractinterpretation.jl:2162
  [13]
     @ Core.Compiler ./compiler/abstractinterpretation.jl:2155
  [14] abstract_call(interp::Core.Compiler.NativeInterpreter, arginfo::Core.Compiler.ArgInfo, sv::Core.Compiler.InferenceState)

@NHDaly
Copy link
Member Author

NHDaly commented Oct 13, 2023

(I've branched out the discussion about the hang on master into a different thread: #51694, since this turns out it isn't actually related to the hang on 1.10 😬)

@nickrobinson251
Copy link
Contributor

nickrobinson251 commented Oct 13, 2023

Okay so the hang on master seems to be a different issue to the v1.10-beta hang

The v1.10 failure mode is a hang during PackageCompiler compiling incremental system image:

- PackageCompiler: compiling incremental system image

and then is eventually killed by our build system.

Turning on println for all the precompile statements, it seems all the precompile statements run successfully, but create_sysimg_object_file never returns -- is it possible that it is hanging while generate the object file with --output-o?

(maybe related to JuliaLang/PackageCompiler.jl#825)

@JeffBezanson
Copy link
Sponsor Member

Can we get a stacktrace from interrupting it as in #51603 (comment) ?

@KristofferC
Copy link
Sponsor Member

Another question, is the 1.10 issue where it hangs on sysimage output what is referred to with:

Our build started failing in between Sept 25th and Sept 26th

in the original post, or is that one for master?

@NHDaly
Copy link
Member Author

NHDaly commented Oct 18, 2023

It is what was referred to in the original post, but it turns out that this is a somewhat nondeterministic failure, so we cannot be sure that it started failing then. Unfortunately we only had one green build before the 25th, since we had only turned on the tests on the 24th. :/

@NHDaly
Copy link
Member Author

NHDaly commented Oct 18, 2023

@kpamnany is looking into trying to capture the hang with a stack trace.

We have narrowed it down to hanging after all of the precompile statements have executed, so it's possibly happening during the .o export, or during linking the .so?

@kpamnany
Copy link
Contributor

Here's a backtrace I just got:

[ Info: PackageCompiler: Done  
                                                                                                
[531018] signal (11.1): Segmentation fault                                                                                                                                                      
in expression starting at none:0                                                                                                                                                                
ijl_new_struct at /home/ubuntu/julia.rai/src/datatype.c:1404                             
parse_sysimg<unsigned int (&)(void const*, _jl_value_t**)> at /home/ubuntu/julia.rai/src/processor.cpp:637 [inlined]
jl_init_processor_sysimg at /home/ubuntu/julia.rai/src/processor_x86.cpp:1062
ijl_set_sysimg_so at /home/ubuntu/julia.rai/src/staticdata.c:2828                 
_finish_julia_init at /home/ubuntu/julia.rai/src/init.c:844                                                                                                                                     
jl_repl_entrypoint at /home/ubuntu/julia.rai/src/jlapi.c:725                                                                                                                                    
main at /home/ubuntu/julia.rai/cli/loader_exe.c:58
unknown function (ip: 0x7f1e7e229d8f)                                                           
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/ubuntu/julia.rai/usr/bin/julia (unknown line)
Allocations: 0 (Pool: 0; Big: 0); GC: 0
                                                
[531018] signal (11.1): Segmentation fault                                                      
in expression starting at none:0
ijl_new_struct at /home/ubuntu/julia.rai/src/datatype.c:1404
parse_sysimg<unsigned int (&)(void const*, _jl_value_t**)> at /home/ubuntu/julia.rai/src/processor.cpp:637 [inlined]
jl_init_processor_sysimg at /home/ubuntu/julia.rai/src/processor_x86.cpp:1062
ijl_set_sysimg_so at /home/ubuntu/julia.rai/src/staticdata.c:2828
_finish_julia_init at /home/ubuntu/julia.rai/src/init.c:844
jl_repl_entrypoint at /home/ubuntu/julia.rai/src/jlapi.c:725
main at /home/ubuntu/julia.rai/cli/loader_exe.c:58
unknown function (ip: 0x7f1e7e229d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/ubuntu/julia.rai/usr/bin/julia (unknown line)
Allocations: 0 (Pool: 0; Big: 0); GC: 0
✖ [00m:00s] PackageCompiler: compiling incremental system image
ERROR: LoadError: failed process: Process(`/home/ubuntu/julia.rai/usr/bin/julia --color=yes --startup-file=no --pkgimages=no '--cpu-target=generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrn
d,base(1)' --sysimage=/home/ubuntu/julia.rai/usr/lib/julia/sys.so --project=/home/ubuntu/raicode.work --output-o=/tmp/jl_dY3GpWNq3Y.o /tmp/jl_zqZkRU1MD8`, ProcessSignaled(11)) [0]

@kpamnany
Copy link
Contributor

Seems like it is the sysimage because I can get this crash again by running /home/ubuntu/julia.rai/usr/bin/julia --color=yes --startup-file=no --pkgimages=no '--cpu-target=generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrn d,base(1)' --sysimage=/home/ubuntu/julia.rai/usr/lib/julia/sys.so --project=/home/ubuntu/raicode.work --output-o=/tmp/jl_dY3GpWNq3Y.o /tmp/jl_zqZkRU1MD8 again even though /tmp/jl_zqZkRU1MD8 is gone. So I can share an rr trace if that's helpful, but presumably the problem was in the creation of the sysimage.

@d-netto
Copy link
Member

d-netto commented Oct 26, 2023

Our automation system has pulled the latest commits of backports-release-1.10 and rebased our local branch on top of that. We've been unable to reproduce the hang after pulling/rebasing.

@vtjnash vtjnash closed this as completed Oct 27, 2023
@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 27, 2023

We'll assume this is done then, and open a new issue if you catch another case

@NHDaly
Copy link
Member Author

NHDaly commented Oct 27, 2023

Thanks @vtjnash, agreed, closing this is the right move. 👍 It's annoying that we never got to the bottom of it, but I'm glad it's resolved now. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior compiler:inference Type inference regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

8 participants