Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overrun stack and heap OTP-26.0 #7292

Closed
vshev4enko opened this issue May 24, 2023 · 42 comments · Fixed by #7581
Closed

Overrun stack and heap OTP-26.0 #7292

vshev4enko opened this issue May 24, 2023 · 42 comments · Fixed by #7581
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@vshev4enko
Copy link

Describe the bug
Application down few seconds after the run release in AWS eks

To Reproduce
Unfortunately i have no idea how to reproduce.

Affected versions
26.0

Additional context
AWS EKS
erlang 26.0
alpine 3.18.0

Application logs

hend=0x00007f4f6f34c7f0
stop=0x00007f4f6f34c670
htop=0x00007f4f6f34c678
heap=0x00007f4f6f349600
beam/erl_gc.c, line 735: <0.3141.0>: Overrun stack and heap
@vshev4enko vshev4enko added the bug Issue is reported as a bug label May 24, 2023
@vshev4enko vshev4enko changed the title Overrun stack and heap Overrun stack and heap OTP-26.0 May 24, 2023
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label May 26, 2023
@rickard-green rickard-green self-assigned this May 29, 2023
@benoawfu
Copy link

Same thing here, alpine 3.18.0, OTP 26.0 but on AWS ECS Fargate.

hend=0x00007f5329e030c8
stop=0x00007f5329e02db8
htop=0x00007f5329e02dc0
heap=0x00007f5329df5d28
beam/erl_gc.c, line 735: <0.190015.0>: Overrun stack and heap
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed

@jhogberg
Copy link
Contributor

Thanks for your report, could you provide a core dump together with the beam.smp binary?

@benoawfu
Copy link

Sorry I can't retrieve that off ECS.

@jhogberg
Copy link
Contributor

That's unfortunate :-/

Can you reproduce this with image that isn't based on alpine (or more importantly, doesn't use musl)?

@benoawfu
Copy link

It's production stuff, sorry I can't do experiments on this. I haven't had any issue on the pre-production/test servers.

@jhogberg
Copy link
Contributor

jhogberg commented May 31, 2023

musl makes the JIT run in a weird configuration that isn't as well-tested as the one under GNU libc, so it would be nice if you could try it out some time. If it only happens under musl it'll be much easier for me to narrow things down.

As an added bonus you'll get 10-15% better performance on linear code. :-)

@vshev4enko
Copy link
Author

vshev4enko commented Jun 1, 2023

You mean run the same application under ubuntu instead of alpine on environment where the latter crashed?

@jhogberg
Copy link
Contributor

jhogberg commented Jun 1, 2023

Yeah, that'd work. As long as the environment uses GNU libc instead of musl.

@vshev4enko
Copy link
Author

vshev4enko commented Jun 1, 2023

Replaced base image with debian-bullseye-20230227-slim and it works. I can't tell about performance but no crashes.

@jhogberg
Copy link
Contributor

jhogberg commented Jun 1, 2023

Thank you, then the crash may be related to that configuration. If you can provide a core dump (+beam.smp) the next time this crashes we'd be very thankful.

@wkirschbaum
Copy link

@jhogberg would you recommend as a general rule that we avoid musl for running erlang in production?

I am getting the same issue on alpine:

hend=0x00007f0a7e8fd830
stop=0x00007f0a7e8fd630
htop=0x00007f0a7e8fd638
heap=0x00007f0a7e8f0490
beam/erl_gc.c, line 735: <0.2688.0>: Overrun stack and heap

And can reproduce it on a build server. I don't see a erl_crash.dump file, so if someone can maybe help me get the dump and beam.smp files I can post it.

@jhogberg
Copy link
Contributor

@jhogberg would you recommend as a general rule that we avoid musl for running erlang in production?

I'd avoid it for now, this only seems to happen with this configuration.

And can reproduce it on a build server. I don't see a erl_crash.dump file, so if someone can maybe help me get the dump and beam.smp files I can post it.

That's great! It ought to have dumped core and not generated a crash dump, so that's why you're not seeing one. Please send us the core dump together with beam.smp if you can. :-)

@ziopio
Copy link
Contributor

ziopio commented Jun 21, 2023

FLy.io machine
erlang 26.0
alpine 3.18.0

I want to signal that this also appeared on under the same conditions when I was working on a Fly.io machine in the cloud.

I was spawning 1000 docker containers all under a single supervisor, so there was an erlang process for each one of them.
Then all stdout of all containers was stored in a single erlang process. At the same time the same process was also indirectly doing work for all containers. So that one was the trigger i think.
It did not happen with 100 containers, for example.

@wkirschbaum
Copy link

wkirschbaum commented Jun 21, 2023

@jhogberg beam.smp and core dump: https://git.sr.ht/~whk/erlang-26-crash-reports/tree

Please let me know if there is anything which will help.

Docker image: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2

@jhogberg
Copy link
Contributor

That beam.smp is stripped, do you happen to have the symbols saved somewhere?

@wkirschbaum
Copy link

@jhogberg i can't even pretend to understand what you mean, since I have very little erlang or gdb experience. If its somewhere on the instance I can upload it, but not sure where to look.

@wkirschbaum
Copy link

I just copied what was in this container, but have a feeling its not there: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2

cp /usr/local/lib/erlang/erts-14.0.1/bin/beam.smp build-archive

@jhogberg
Copy link
Contributor

That's really annoying, digging into the images they seem based on https://hub.docker.com/_/erlang which forcibly strips all symbols without having the common decency to save them anywhere. It would've been nice if they didn't strip them in that layer but left it to the final one.

We're basically stuck until we have the symbols, I'll try to get in touch with the folks maintaining those images but I'm not sure how long that will take. In the meantime I suppose we can make our own nearly-identical images by removing the lines that strip symbols (the ones that look like this), but if you want to wait until the upstream images contain symbols I'm fine with that.

@wkirschbaum
Copy link

@jhogberg thanks for having a look. I can try the suggested change on our build server somewhere this week and will post. There is no urgency for me, but sure more people will run into this soon.

@princemaple
Copy link

Probably worthless information but here it is anyway: it seems to only happen on apps that use large chunks of memory. Out of 3 apps that I'm running with OTP 26 on alpine, only one that uses large chunk of memory had issues and crashed multiple times a day. The other 2 had been running fairly smoothly.

@mudssrali
Copy link

Probably worthless information but here it is anyway: it seems to only happen on apps that use large chunks of memory. Out of 3 apps that I'm running with OTP 26 on alpine, only one that uses large chunk of memory had issues and crashed multiple times a day. The other 2 had been running fairly smoothly.

I don't think so it's memory issue -- I've 16GB machine on cloud and locally -- works fine on local however deployment on cloud is crashing -- fairly simple Elixir application that runs bunch of GenServers -- with 1MB payload. I'm using alpine -- guessing by default wiith OTP-26

FROM elixir:1.15.1-alpine

Error

hend=0x00007fdd10cc0e18
stop=0x00007fdd10cc0db8
htop=0x00007fdd10cc0dc0
heap=0x00007fdd10cc0258
beam/erl_gc.c, line 735: <0.1616.0>: Overrun stack and heap

Previously the same application was working fine on elixir:1-8-alpine

@Harrisonl
Copy link

Don't have much to add to this thread, other then that it was working fine on OTP-25 and Elixir 1.14 alpine linux (ECS-Fargate) for us, and only seen it today for the first time on OTP-26 and Elixir 1.15, will try the debian image tomorrow and see how that get's on.

@davidye
Copy link

davidye commented Aug 2, 2023

Also started happening to me when I went updated elixir:1.15.4-alpine to elixir:1.14.0-alpine.

@Lankester
Copy link

Lankester commented Aug 2, 2023

I have just got this error in a Debian docker container running off of the elixir:1.15-slim Dockerfile.

user@host:/app# cat /etc/issue
Debian GNU/Linux 11 \n \l

Error:

hend=0x00007fb33de2c288
stop=0x00007fb33de2c080
htop=0x00007fb33de2c088
heap=0x00007fb33de2b6c8
beam/erl_gc.c, line 735: <0.253.0>: Overrun stack and heap
Aborted (core dumped)

It's a relatively simple application consisting of 3 GenServers.

I have a core dump but can't share due to sensitive information within and I have been unable to replicate the issue in a test app that omits the sensitive info.

I've attached a gdb backtrace in case that's useful though and I am open to suggestions and happy to help further.

backtrace.txt

@vshev4enko
Copy link
Author

vshev4enko commented Aug 3, 2023

I am not running Mint.Websocket.
Also i stopped getting crashes after switching the image from alpine to debian.

@Lankester
Copy link

It seems to crash almost every time in the same place — receiving and decoding a websocket message. Also, possibly worth pointing out hardware is Intel Mac.

Mint is version 1.5.1.

Another gdb log attached from another crash.

gdb_crash_2.txt

@Lankester
Copy link

Moving to OTP 25 via elixir:1.15-otp-25-slim seems to have resolved the issue for me, no further crashes so far.

@nathany-copia
Copy link

nathany-copia commented Aug 3, 2023

Seeing this on OTP 26.0.2 and Alpine 3.17.4

We're using this Docker image -- hexpm/elixir:1.15.4-erlang-26.0.2-alpine-3.17.4 for the build and running under alpine:3.17.4 with Erlang bundled via mix releases.

Not sure if the core dump will have symbols for this Docker image, or where to find the core dump, but I'll follow up if we find it.

htop=0x00007fe486d4ad48
stop=0x00007fe486d4ad40
heap=0x00007fe486d47ba8
beam/erl_gc.c, line 735: <0.8138.0>: Overrun stack and heap
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
Aborted (core dumped)

UPDATE: I poked around and but didn't find a core dump -- I don't know the PID it was running under, and I think Kubernetes restarted containers so the evidence is gone. For now I'm downgrading to hexpm/elixir:1.15.4-erlang-25.3.2.5-alpine-3.17.4, which is a little easier than switching to Debian or another base image.

https://hub.docker.com/r/hexpm/elixir

oestrich added a commit to nerves-hub/nerves_hub_web that referenced this issue Aug 8, 2023
Alpine uses musl which can cause an issue to overrun the stack and heap
in OTP 26 (erlang/otp#7292).
@ramyma
Copy link

ramyma commented Aug 9, 2023

I'm getting a similar crash as well with erlang 26.0.2 and elixir 1.15.4-otp-26 with mint_web_socket following their genserver example (https://github.com/elixir-mint/mint_web_socket/blob/main/examples/genserver.exs):

hend=0x00007f79dba1b8c0
stop=0x00007f79dba1b7a8
htop=0x00007f79dba1b7b0
heap=0x00007f79dba186d0
beam/erl_gc.c, line 735: <0.777.0>: Overrun stack and heap
[1]    48048 IOT instruction (core dumped)  iex -S mix phx.server

The crash happens after receiving some messages (and while receiving more messages) over the websocket.

I'm using Ubuntu 22.04.3

@akoutmos
Copy link

I am seeing this issue too. I am also running the hexpm/elixir:1.15.4-erlang-26.0.2-alpine-3.17.4 image and have a mix release Phoenix app running in the container. The only thing that I can correlate this too without much of a deep dive was that this happened right after our app was DoS attacked.

image

I will try and get more information around this, but for now will revert to hexpm/elixir:1.15.4-erlang-25.3.2.5-alpine-3.17.4.

@bernardo-martinez
Copy link

I am seeing this issue too:

hend=0x00007ff8c845fb90
stop=0x00007ff8c845fa60
htop=0x00007ff8c845fa68
heap=0x00007ff8c845c9a0
beam/erl_gc.c, line 735: <0.67[52](https://gitlab.otters.xyz/product/business/experience/app/-/jobs/12068617#L52).0>: Overrun stack and heap

building image myself with: elixir 1.15.3 and erlang 26.0.2
running on a gitlab alpine image

@sverker
Copy link
Contributor

sverker commented Aug 17, 2023

It seems a lot of people are able to reproduce this.
It would be really helpful if someone could give detailed instructions of how to reproduce the crash. Preferably a "minimal" example that is easy and quick to run.

@ggalan87
Copy link

I reached this issue after some rabbitmq server crashes.

Setup: plain LXC container, alpine 1.18 image, erlang-26.0.2, elixir-1.15.4

Log:

  Starting broker... completed with 5 plugins.      
                                                                                                         
hend=0x00007fe234eb34b0                                                                                  
stop=0x00007fe234eb3458                                                                                  
htop=0x00007fe234eb3460                                                                                                                                                                                            
heap=0x00007fe234eb21a0                                                                                  
beam/erl_gc.c, line 735: <0.15407.0>: Overrun stack and heap                                             
Aborted (core dumped)                                                       

@JamesLavin
Copy link

JamesLavin commented Aug 17, 2023

In case my context might help anyone debug this, I hit this bug hard yesterday after building (on Apple Silicon) Ubuntu images (plural because I used a variety of base images with various versions of Elixir & Erlang, hoping that one of them might work) to run an Elixir script. Was able to build the images fine, but when I booted the container and ran my .exs file, Mix.install never managed to compile the Jason library (https://github.com/michalmuskala/jason). It always blew up with this error message during Jason library compilation.

(Compiling on Apple Silicon adds a layer of complexity, see: https://pythonspeed.com/articles/docker-build-problems-mac/, but this may be a red herring, given that others seem to have hit this issue elsewhere.)

FWIW, I wound up spinning up an EC2 instance and getting the same .exs file to run just fine there.

@ramyma
Copy link

ramyma commented Aug 18, 2023

It seems a lot of people are able to reproduce this. It would be really helpful if someone could give detailed instructions of how to reproduce the crash. Preferably a "minimal" example that is easy and quick to run.

I created a simple repo where I'm able to reproduce the issue consistently: https://github.com/ramyma/otp_crash

You'll just have to run iex -S mix phx.server, and within a couple of minutes it will crash with:

iex(1)> hend=0x00007f4f237fbdb0
stop=0x00007f4f237fbc98
htop=0x00007f4f237fbca0
heap=0x00007f4f237f6ce0
beam/erl_gc.c, line 735: <0.561.0>: Overrun stack and heap
[1]    97851 IOT instruction (core dumped)  iex -S mix phx.server

Note: I'm using asdf, you'll find a .tool-versions, so you can run asdf install to get the required Erlang and Elixir versions.
Tested on Ubuntu 22.04.3

@Lankester
Copy link

I've put together a minimal repo based on the elixir 1.15-slim docker image. https://github.com/Lankester/Otp26Crash

Run docker compose up and it should crash within between 10 and 20 minutes.

otp-26-crash-7292  | 00:38:00.255 [info] Broadcast 73898124 bytes on WebSocket.
otp-26-crash-7292  | 00:38:01.108 [info] Received message. 73898124 bytes.
otp-26-crash-7292  | 00:38:02.749 [info] Broadcast 77805516 bytes on WebSocket.
otp-26-crash-7292  | 00:38:03.670 [info] Received message. 77805516 bytes.
otp-26-crash-7292  | 00:38:05.133 [info] Broadcast 60002824 bytes on WebSocket.
otp-26-crash-7292  | 00:38:05.703 [info] Received message. 60002824 bytes.
otp-26-crash-7292  | 00:38:07.487 [info] Broadcast 54965496 bytes on WebSocket.
otp-26-crash-7292  | 00:38:08.008 [info] Received message. 54965496 bytes.
otp-26-crash-7292  | hend=0x00007f52f487d520
otp-26-crash-7292  | stop=0x00007f52f487d318
otp-26-crash-7292  | htop=0x00007f52f487d320
otp-26-crash-7292  | heap=0x00007f52f487a330
otp-26-crash-7292  | beam/erl_gc.c, line 735: <0
otp-26-crash-7292  | .2402.0>: Overrun stack and heap

@sverker
Copy link
Contributor

sverker commented Aug 21, 2023

Thanks, @ramyma and @Lankester for the "crash repos".

@ramyma Just to be clear; you ran it directly on Ubuntu without any docker image?

@ramyma
Copy link

ramyma commented Aug 21, 2023

Thanks, @ramyma and @Lankester for the "crash repos".

@ramyma Just to be clear; you ran it directly on Ubuntu without any docker image?

@sverker yes, directly on Ubuntu.

@sverker
Copy link
Contributor

sverker commented Aug 22, 2023

We think we found the bug. Here is a quick fix if anyone wants to try it out:

diff --git a/erts/emulator/beam/jit/x86/instr_bs.cpp b/erts/emulator/beam/jit/x86/instr_bs.cpp
index 39dfb64f8f..f8d8cebbdf 100644
--- a/erts/emulator/beam/jit/x86/instr_bs.cpp
+++ b/erts/emulator/beam/jit/x86/instr_bs.cpp
@@ -3829,7 +3829,7 @@ static std::vector<BsmSegment> opt_bsm_segments(
             }
             break;
         case BsmSegment::action::GET_BINARY:
-            heap_need += heap_bin_size((seg.size + 7) / 8);
+            heap_need += std::max(ERL_SUB_BIN_SIZE, heap_bin_size((seg.size + 7) / 8));
             break;
         case BsmSegment::action::GET_TAIL:
             heap_need += EXTRACT_SUB_BIN_HEAP_NEED;

The same fix can be done for ARM in erts/emulator/beam/jit/arm/instr_bs.cpp.

The bug exists since OTP 26.0. The root cause has nothing to do with elixir, alpine or musl libc.
It takes matching small byte-unaligned bitstrings from a larger binary, as done by Elixir.Mint.WebSocket.Frame:decode_raw/3, plus some bad luck with being almost out of process heap space.

We will probably soon release a fix in OTP 26.0.3.

bjorng added a commit to bjorng/otp that referenced this issue Aug 23, 2023
The runtime system could underestimate the amount of heap space needed
for matching out short bitstrings with a size not divisble by 8. That
could lead to the runtime system terminating with an "Overrun heap and
stack" error.

Fixes erlang#7292
bjorng added a commit to bjorng/otp that referenced this issue Aug 23, 2023
The runtime system could underestimate the amount of heap space needed
for matching out short bitstrings with a size not divisble by 8. That
could lead to the runtime system terminating with an "Overrun heap and
stack" error.

Fixes erlang#7292
@bjorng bjorng linked a pull request Aug 23, 2023 that will close this issue
bjorng added a commit that referenced this issue Aug 25, 2023
…/OTP-18733

Fix heap allocation for matching out short bitstrings
@bjorng bjorng closed this as completed in a9ed6ec Aug 25, 2023
bjorng added a commit to bjorng/otp that referenced this issue Aug 30, 2023
Normally, all BEAM files created by OTP 25 and later have a "Type"
chunk that contains type information. `beam_lib:strip/1` will not
discard the "Type" chunk, but build/release scripts that do their own
custom stripping could accidentally delete the chunk.

Make sure to test that loading and executing BEAM files without type
information works. Since there was an overrun-heap-and-stack
bug (reported in erlang#7292, fixed in erlang#7581) when using the bit syntax, the
bit syntax test suites seems to appropriate to clone to new BEAM files
without types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.