Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in GraphPPL.jl tests on 1.11, works fine in debugger #56459

Open
bvdmitri opened this issue Nov 5, 2024 · 14 comments
Open
Labels
regression 1.11 Regression in the 1.11 release

Comments

@bvdmitri
Copy link
Contributor

bvdmitri commented Nov 5, 2024

This test in GraphPPL.jl causes segmentation fault. The segmentation fault can be reproduced by copy-pasting the content of the test (plus necessary imports) in REPL. Interestingly enough the test passes normally while debugging. So the notable thing is that this line

y = getorcreate!(model, ctx, :y, 1)

should return a fully initialized y, but on 1.11 it returns an array of #undef values.
Image

The code in the loop uses isassigned under the hood to initialize the elements of y and the check works correctly during the debugging and in 1.10, e.g in VSCode debugger view I get Image

The fact that debugging works normally does not really allow us to narrow down the scope of the issue. It also doesn't seem to happen in real code that relies on this functionality, only in tests. Julia shouldn't really segfault so it might indicate deeper problems somewhere else.

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m3)
Threads: 1 default, 0 interactive, 1 GC (on 5 virtual cores)

The code that segfaults is on the main branch

commit c97718a10bcf035cff093acf52ee9fe30f225b35 (HEAD -> main, origin/main, origin/HEAD)
Author: Wouter Nuijten <[email protected]>
Date:   Fri Oct 11 11:44:21 2024 +0200

    Update codecov action
(GraphPPL) pkg> st
Project GraphPPL v4.3.3 
Status `~/.julia/dev/GraphPPL.jl/Project.toml`
  [0f2f92aa] BitSetTuples v1.1.5
  [864edb3b] DataStructures v0.18.20
  [85a47980] Dictionaries v0.4.2
  [1914dd2f] MacroTools v0.5.13
  [fa8bd995] MetaGraphsNext v0.7.1
  [d9ec5142] NamedTupleTools v0.14.3
  [aedffcd0] Static v1.1.1
  [90137ffa] StaticArrays v1.9.8
  [9d95972d] TupleTools v1.6.0
  [9602ed7d] Unrolled v0.1.5
@bvdmitri
Copy link
Contributor Author

bvdmitri commented Nov 5, 2024

The error

julia> GraphPPL.add_terminated_submodel!(model, ctx, options, hgf, (y = y,), static(1))

[54306] signal 11 (2): Segmentation fault: 11
in expression starting at REPL[28]:1
add_edge! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1748 [inlined]
add_edge! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1702
unknown function (ip: 0x327230123)
#93 at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2093
foreach at ./abstractarray.jl:3187 [inlined]
materialize_factor_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2092 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2078
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1976 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1905 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1901 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:594 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:243 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:710
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2034 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1892 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:545 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:248 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:710
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2034 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1892 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:548 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:270 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726
unknown function (ip: 0x3271ed073)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
do_call at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_stmt_value at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:174
eval_body at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:663
jl_interpret_toplevel_thunk at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:821
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:952 [inlined]
ijl_toplevel_eval_in at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:245
repl_backend_loop at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:342
#start_repl_backend#59 at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:327
start_repl_backend at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:324
#run_repl#72 at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:483
run_repl at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:469
jfptr_run_repl_10089 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/compiled/v1.11/REPL/u0gqU_pEq4i.dylib (unknown line)
#1139 at ./client.jl:446
jfptr_YY.1139_14579 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/compiled/v1.11/REPL/u0gqU_pEq4i.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
jl_f__call_latest at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72559.1 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
true_main at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/jlapi.c:1059
Allocations: 30712733 (Pool: 30710467; Big: 2266); GC: 42
[1]    54306 segmentation fault  julia

@giordano
Copy link
Contributor

giordano commented Nov 5, 2024

Can we get a self-contained MWE instead of referencing code on another repository? I'm asking this also because

The segmentation fault can be reproduced by copy-pasting the content of the test (plus necessary imports) in REPL.

is very much not true as one needs also to copy a bunch of definitions in https://github.com/ReactiveBayes/GraphPPL.jl/blob/c97718a10bcf035cff093acf52ee9fe30f225b35/test/testutils.jl and tracking down all missing imports (which are a lot) isn't fun.

Side note, sometimes starting julia with --check-bounds=yes helps tracking down segfaults, if caused by indexing arrays out-of-bounds.

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Nov 5, 2024

Ok, the problem with #undef seems to be fully visual, it is initialized, but the generic code for show prints it as #undef for whatever reason. But the segmentation fault in 1.11 is still real.

instead of referencing code on another repository

Here is minimal I could come up with @giordano, however, the issue appears in the repository and I cannot create an MWE without the package:

using GraphPPL, Distributions
import GraphPPL: @model

@model function gcv(κ, ω, z, x, y)
    log_σ := κ * z + ω
    y ~ Normal(x, exp(log_σ))
end

@model function gcv_lm(y, x_prev, x_next, z, ω, κ)
    x_next ~ gcv(x = x_prev, z = z, ω = ω, κ = κ)
    y ~ Normal(x_next, 1)
end

@model function hgf(y)

    # Specify priors

    ξ ~ Gamma(1, 1)
    ω_1 ~ Normal(0, 1)
    ω_2 ~ Normal(0, 1)
    κ_1 ~ Normal(0, 1)
    κ_2 ~ Normal(0, 1)
    x_1[1] ~ Normal(0, 1)
    x_2[1] ~ Normal(0, 1)
    x_3[1] ~ Normal(0, 1)

    # Specify generative model

    for i in 2:(length(y) + 1)
        x_3[i] ~ Normal(x_3[i - 1], ξ)
        x_2[i] ~ gcv(x = x_2[i - 1], z = x_3[i], ω = ω_2, κ = κ_2)
        x_1[i] ~ gcv_lm(x_prev = x_1[i - 1], z = x_2[i], ω = ω_1, κ = κ_1, y = y[i - 1])
    end
end


function mwe()
    model = GraphPPL.Model(identity, GraphPPL.PluginsCollection(), GraphPPL.DefaultBackend())
    ctx = GraphPPL.getcontext(model)
    y = nothing
    for i in 1:10
        y = GraphPPL.getorcreate!(model, ctx, :y, i)
    end
    GraphPPL.add_terminated_submodel!(model, ctx, GraphPPL.NodeCreationOptions(), hgf, (y = y,), GraphPPL.static(1))
    return model
end

mwe() isa GraphPPL.Model

This code segfaults in 1.11.

I also tried to manually debug it with no success. I also dev-ed all the dependencies and removed all the @inbounds from their code. It didn't help. Using --check-bounds=yes didn't help to identify the issue either. However, what I noticed is that if I change the following code in GraphPPL from

for variable_node in variable_nodes
        add_edge!(model, factor_node_id, factor_node_propeties, variable_node, interface_name, index)
        index += increase_index(variable_node)
end

to

foreach(variable_nodes) do variable_node
        add_edge!(model, factor_node_id, factor_node_propeties, variable_node, interface_name, index)
        index += increase_index(variable_node)
    end

fixes the problem and there is no segmentation fault. My CS expertise is not good enough to track down segmentation faults.

@KristofferC KristofferC added bisect wanted regression 1.11 Regression in the 1.11 release labels Nov 5, 2024
@giordano
Copy link
Contributor

giordano commented Nov 5, 2024

however, the issue appears in the repository and I cannot create an MWE without the package:

While a reproducer should preferably be as small as possible (crafting a minimal reproducer, for example by binary search if you have no other clue, is already a large chunk of the work of hunting down a bug), saying "go and copy some code from somewhere else" doesn't work very well. I tried for like 10 minutes to build the example by copying the code piece by piece from the tests but gave up out of frustration because I'm not familiar with the codebase and didn't know what to do exactly.

That said, the segfault doesn't seem to reproduce on master (at least not on ee09ae7, on some later versions JLD2.jl is broken Edit: JLD2 v0.5.8 fixed the issue) for me, so the bisection could be done to find the patch which fixed it.

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Nov 6, 2024

saying "go and copy some code from somewhere else" doesn't work very well. I tried for like 10 minutes

Point taken, indeed I thought it would be easier, sorry for not preparing a better MWE. Nice to hear that it is fixed on master. I can try run the bisection, is there a script that simplifies this process?

@giordano
Copy link
Contributor

giordano commented Nov 6, 2024

is there a script that simplifies this process?

I usually use a variation of following script with git bisect run, depending on what exactly is needed to reproduce the bug

#!/bin/bash

make cleanall || true
make -j60 USECCACHE=1 || exit 125

./usr/bin/julia --startup-file=no my_reproducer.jl

EXIT_CODE=$?
if [[ "${EXIT_CODE}" -eq 139 ]]; then
    # For git bisect we need to return an exit status less than 128, but if a
    # program segfaults with exit code 11+129=139 we return 11.  Don't change
    # all other cases.
    exit 11
else
    exit "${EXIT_CODE}"
fi

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Nov 7, 2024

Well I tried for quite some time to run git bisect (for a couple of hours given the compilation time), but it either says Some good revs are not ancestors of the bad rev. or Bisecting: a merge base must be tested. I tried bisecting from v1.11 to master. I think v1.11 and master have diverged? I'm not sure how I'm supposed to bisect it so any help is appreciated here. How am I supposed to identify linear commit history to just run git bisect run?

@giordano
Copy link
Contributor

giordano commented Nov 7, 2024

Releases are cut from branches, not from master. Find the first commit in the release 1.11 branch since the branching out, the parent will be in master. Also, check if you can reproduce the bug on 1.11 alpha 0, 1 or whatever that's called, that gives you an idea of what direction to look at

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Nov 7, 2024

Find the first commit in the release 1.11 branch since the branching out, the parent will be in master.

That's what I'm struggling with, I'm not sure how to do it

@giordano
Copy link
Contributor

giordano commented Nov 7, 2024

From the github web interface: go to https://github.com/JuliaLang/julia, choose the release-1.11 branch, you get to https://github.com/JuliaLang/julia/tree/release-1.11, click on 448 commits ahead of and get to master...release-1.11. The top commit (7dad444) is the first one since branching out, its parent aecd8fd is on master

From the command line, you can probably do something like git log master...release-1.11, or something like that (I can't check it on the phone). Edit: you can use git log --reverse --oneline master..origin/release-1.11 to see what's the first commit.

@giordano
Copy link
Contributor

giordano commented Nov 8, 2024

Couple of comments:

  • I'm able to reproduce the segfault on v1.11.0-alpha1
  • I'm not able to reproduce the segfault in any version of julia on x86_64-linux-gnu. When I could reproduce it the other day on v1.11.1 (and then see that it was fixed on ee09ae7) I was using aarch64-darwin, and I see from your versioninfo that you're also on the same platforms.

I'd say this is the range to look into for the fix: aecd8fd...ee09ae7 (first is bad, last is good). Edit: for the record, it reproduces also on a06a801 but not 4b27a16

@giordano
Copy link
Contributor

giordano commented Nov 9, 2024

Good news: the segfault disappeared on 25cbe00 (merge commit of #55767). Bad news: that looks a bit too large of a commit to backport it to v1.11. CC: @vtjnash in case he has a clue of how to solve this on v1.11.

@vtjnash
Copy link
Member

vtjnash commented Nov 9, 2024

That does at least give us a pretty good idea of what kind of issue it is likely to be. Somewhat hard to be sure if it is better just to backport that (lots of lines, but very low risk internal only change which only helps Enzyme support this version easier even though it also breaks Enzyme) or investigate whether a more specific fix is possible

@giordano
Copy link
Contributor

giordano commented Nov 9, 2024

Segfault first appeared in #52405 (corresponding change in our fork of llvm: JuliaLang/llvm-project#23)

e5046b4579cf571931714abbe14a3a049ca6383b is the first bad commit
commit e5046b4579cf571931714abbe14a3a049ca6383b
Author: Gabriel Baraldi <[email protected]>
Date:   Thu Dec 7 11:21:38 2023 -0500

    Bump LLVM to 15.0.7+10 to fix GC issue (#52405)

 deps/checksums/clang            | 216 ++++++++++----------
 deps/checksums/lld              | 216 ++++++++++----------
 deps/checksums/llvm             | 436 ++++++++++++++++++++--------------------
 deps/clang.version              |   2 +-
 deps/lld.version                |   2 +-
 deps/llvm-tools.version         |   4 +-
 deps/llvm.version               |   6 +-
 stdlib/LLD_jll/Project.toml     |   2 +-
 stdlib/libLLVM_jll/Project.toml |   2 +-
 9 files changed, 443 insertions(+), 443 deletions(-)

but this looks unhelpful, since it was backported to julia v1.10 (1e66ce2) and llvm 15 isn't in julia v1.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression 1.11 Regression in the 1.11 release
Projects
None yet
Development

No branches or pull requests

4 participants