-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.10 release beta3 has an inference regression bug: infinite hang during inference #51603
Comments
Note, when attempting to reproduce, ran into #51570. The workaround there seems to have helped. |
Revert proposal for 1.10 here: #51612 |
I captured this stacktrace via ctrl-t when my process was stuck procompiling:
|
It seems that commit ae8f9ad (indirectly) introduced the sporadic bug related to the subtyping algorithm? |
You're right that the issue seems to be sporadic:
I was able to trigger it again using latest
But also I tested out the 1.10 revert PR #51612 and still saw a hang (in CI), so (given it's sporadic) i suspect we might have misidentified the culprit. @kpamnany is trying to replicate it on linux so we can capture it in |
Okay, we have a standalone isolated MRE for the hang on master! Sorry this took us so long. This reliably hangs on master, but not on julia 1.10. So something slightly different is happening in 1.10, but since that is nondeterministic and only happens during the package compiler build, it's harder to track down, so we've started here. I don't think we could have gotten this without @vtjnash walking us through some very gnarly hackery to find which function was stuck in inference (including reading some pointers off of registers and some manual pointer arithmetic and memory reads). One of my takeaways here is that it would be nice to be able to set some kind of "debug mode" flag to have julia log which function it's inferring before it starts and stops inference / optimization, so that we could have found where it was stuck much more easily. Once we found that, isolating an MRE wasn't too hard. @aviatesk and @vtjnash: Please try to reproduce the hang from this, and see if you can work out the issue from there! Once you find the culprit, can you also give a time estimate to fix it? I do still think that if it will take more than a couple days, we should start by reverting the culprits so that main can be unbroken until we land the fix. const Iterator = Any
abstract type Shape end
struct ShapeUnion <: Shape
args::Vector{Shape}
end
shape_null(::Type{Shape}) = SHAPE_NULL
const SHAPE_NULL = ShapeUnion(Shape[])
const CANCELLED = Ref(false)
throw_if_cancelled() = if CANCELLED[] throw(ErrorException("cancelled")) end
@noinline Base.@nospecializeinfer function _union_vec_no_union(args)
return args
end
# WEIRDLY, this also reproduces if shape_disjuncts is not defined!
function shape_disjuncts end
shape_disjuncts(s::ShapeUnion) = s.args
shape_disjuncts(s::Shape) = Shape[s]
# ^ (You can try commenting out the above three lines to produce another hang as well).
function shape_union(::Type{Shape}, args::Iterator)
return _union(args)
end
function _union(args::Iterator)
if any(arg -> arg isa ShapeUnion, args)
# Call the entry point rather than `_union_vec_no_union` because we are uncertain
# about the size and because there might be some optimization opportunity.
return shape_union(Shape, (disj for arg in args for disj in shape_disjuncts(arg)))
end
if !(args isa Vector)
args = collect(Shape, args)
# respect _union_vec_no_union's assumption
isempty(args) && return shape_null(Shape)
length(args) == 1 && return only(args)
end
# If this is a big union, check for cancellation
length(args) > 100 && throw_if_cancelled()
return _union_vec_no_union(args)
end
# Reproduction:
#=
julia> code_typed(
_union,
(Vector{Shape},),
)
=# julia> VERSION
v"1.11.0-DEV.638"
julia> include("inference_hang_repro.jl")
_union (generic function with 1 method)
julia> code_typed(
_union,
(Vector{Shape},),
)
# it is hanging here.... |
(pausing it gives this stack trace): ^CERROR: InterruptException:
Stacktrace:
[1] perform_lifting!(compact::Core.Compiler.IncrementalCompact, visited_philikes::Vector{…}, cache_key::Any, result_t::Any, lifted_leaves::Core.Compiler.IdDict{…}, stmt_val::Any, lazydomtree::Core.Compiler.LazyGenericDomtree{…})
@ Core.Compiler ./compiler/ssair/passes.jl:731
[2] sroa_pass!(ir::Core.Compiler.IRCode, inlining::Core.Compiler.InliningState{Core.Compiler.NativeInterpreter})
@ Core.Compiler ./compiler/ssair/passes.jl:1170
[3] run_passes_ipo_safe(ci::Core.CodeInfo, sv::Core.Compiler.OptimizationState{…}, caller::Core.Compiler.InferenceResult, optimize_until::Nothing)
@ Core.Compiler ./compiler/optimize.jl:797
[4] run_passes_ipo_safe
@ Core.Compiler ./compiler/optimize.jl:812 [inlined]
[5] optimize(interp::Core.Compiler.NativeInterpreter, opt::Core.Compiler.OptimizationState{…}, caller::Core.Compiler.InferenceResult)
@ Core.Compiler ./compiler/optimize.jl:786
[6] _typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
@ Core.Compiler ./compiler/typeinfer.jl:265
[7] typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
@ Core.Compiler ./compiler/typeinfer.jl:216
[8]
@ Core.Compiler ./compiler/typeinfer.jl:863
[9]
@ Core.Compiler ./compiler/abstractinterpretation.jl:617
[10]
@ Core.Compiler ./compiler/abstractinterpretation.jl:89
[11]
@ Core.Compiler ./compiler/abstractinterpretation.jl:2080
[12]
@ Core.Compiler ./compiler/abstractinterpretation.jl:2162
[13]
@ Core.Compiler ./compiler/abstractinterpretation.jl:2155
[14] abstract_call(interp::Core.Compiler.NativeInterpreter, arginfo::Core.Compiler.ArgInfo, sv::Core.Compiler.InferenceState) |
(I've branched out the discussion about the hang on master into a different thread: #51694, since this turns out it isn't actually related to the hang on 1.10 😬) |
Okay so the hang on The v1.10 failure mode is a hang during PackageCompiler compiling incremental system image:
and then is eventually killed by our build system. Turning on (maybe related to JuliaLang/PackageCompiler.jl#825) |
Can we get a stacktrace from interrupting it as in #51603 (comment) ? |
Another question, is the 1.10 issue where it hangs on sysimage output what is referred to with:
in the original post, or is that one for master? |
It is what was referred to in the original post, but it turns out that this is a somewhat nondeterministic failure, so we cannot be sure that it started failing then. Unfortunately we only had one green build before the 25th, since we had only turned on the tests on the 24th. :/ |
@kpamnany is looking into trying to capture the hang with a stack trace. We have narrowed it down to hanging after all of the precompile statements have executed, so it's possibly happening during the .o export, or during linking the .so? |
Here's a backtrace I just got:
|
Seems like it is the sysimage because I can get this crash again by running |
Our automation system has pulled the latest commits of backports-release-1.10 and rebased our local branch on top of that. We've been unable to reproduce the hang after pulling/rebasing. |
We'll assume this is done then, and open a new issue if you catch another case |
Thanks @vtjnash, agreed, closing this is the right move. 👍 It's annoying that we never got to the bottom of it, but I'm glad it's resolved now. Thanks all! |
It looks like our new julia release testing automation at RAI is working and has caught a regression! 🎉 Thank you to @nickrobinson251, @quinnj and @Drvi for driving this. ❤️
Our build started failing in between Sept 25th and Sept 26th (but we are only just getting started with the automatic notifications so it took us a few days to get to this):
The failure mode is the job is hanging during PackageCompiler compiling incremental system image:
and then is eventually killed by our build system.
@nickrobinson251 and @kpamnany have also observed this locally on Julia master, where just
using RAICode
(our package) was spinning forever.Kiran attached lldb and saw the following julia stack traces:
I haven't run a complete
git bisect
yet, but looking at the 1.10 release backports branch, it seems that one of these two commits are the likely culprits (CC: @aviatesk):The text was updated successfully, but these errors were encountered: