Overrun stack and heap OTP-26.0 #7292

vshev4enko · 2023-05-24T14:06:44Z

Describe the bug
Application down few seconds after the run release in AWS eks

To Reproduce
Unfortunately i have no idea how to reproduce.

Affected versions
26.0

Additional context
AWS EKS
erlang 26.0
alpine 3.18.0

Application logs

hend=0x00007f4f6f34c7f0
stop=0x00007f4f6f34c670
htop=0x00007f4f6f34c678
heap=0x00007f4f6f349600
beam/erl_gc.c, line 735: <0.3141.0>: Overrun stack and heap

The text was updated successfully, but these errors were encountered:

benoawfu · 2023-05-31T08:00:55Z

Same thing here, alpine 3.18.0, OTP 26.0 but on AWS ECS Fargate.

hend=0x00007f5329e030c8
stop=0x00007f5329e02db8
htop=0x00007f5329e02dc0
heap=0x00007f5329df5d28
beam/erl_gc.c, line 735: <0.190015.0>: Overrun stack and heap
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed

jhogberg · 2023-05-31T08:19:18Z

Thanks for your report, could you provide a core dump together with the beam.smp binary?

benoawfu · 2023-05-31T08:29:19Z

Sorry I can't retrieve that off ECS.

jhogberg · 2023-05-31T08:36:17Z

That's unfortunate :-/

Can you reproduce this with image that isn't based on alpine (or more importantly, doesn't use musl)?

benoawfu · 2023-05-31T08:47:50Z

It's production stuff, sorry I can't do experiments on this. I haven't had any issue on the pre-production/test servers.

jhogberg · 2023-05-31T09:25:01Z

musl makes the JIT run in a weird configuration that isn't as well-tested as the one under GNU libc, so it would be nice if you could try it out some time. If it only happens under musl it'll be much easier for me to narrow things down.

As an added bonus you'll get 10-15% better performance on linear code. :-)

vshev4enko · 2023-06-01T08:49:03Z

You mean run the same application under ubuntu instead of alpine on environment where the latter crashed?

jhogberg · 2023-06-01T11:28:14Z

Yeah, that'd work. As long as the environment uses GNU libc instead of musl.

vshev4enko · 2023-06-01T13:38:15Z

Replaced base image with debian-bullseye-20230227-slim and it works. I can't tell about performance but no crashes.

jhogberg · 2023-06-01T13:53:47Z

Thank you, then the crash may be related to that configuration. If you can provide a core dump (+beam.smp) the next time this crashes we'd be very thankful.

wkirschbaum · 2023-06-20T22:39:03Z

@jhogberg would you recommend as a general rule that we avoid musl for running erlang in production?

I am getting the same issue on alpine:

hend=0x00007f0a7e8fd830
stop=0x00007f0a7e8fd630
htop=0x00007f0a7e8fd638
heap=0x00007f0a7e8f0490
beam/erl_gc.c, line 735: <0.2688.0>: Overrun stack and heap

And can reproduce it on a build server. I don't see a erl_crash.dump file, so if someone can maybe help me get the dump and beam.smp files I can post it.

jhogberg · 2023-06-21T07:49:13Z

@jhogberg would you recommend as a general rule that we avoid musl for running erlang in production?

I'd avoid it for now, this only seems to happen with this configuration.

And can reproduce it on a build server. I don't see a erl_crash.dump file, so if someone can maybe help me get the dump and beam.smp files I can post it.

That's great! It ought to have dumped core and not generated a crash dump, so that's why you're not seeing one. Please send us the core dump together with beam.smp if you can. :-)

ziopio · 2023-06-21T09:56:05Z

FLy.io machine
erlang 26.0
alpine 3.18.0

I want to signal that this also appeared on under the same conditions when I was working on a Fly.io machine in the cloud.

I was spawning 1000 docker containers all under a single supervisor, so there was an erlang process for each one of them.
Then all stdout of all containers was stored in a single erlang process. At the same time the same process was also indirectly doing work for all containers. So that one was the trigger i think.
It did not happen with 100 containers, for example.

wkirschbaum · 2023-06-21T11:50:45Z

@jhogberg beam.smp and core dump: https://git.sr.ht/~whk/erlang-26-crash-reports/tree

Please let me know if there is anything which will help.

Docker image: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2

jhogberg · 2023-06-21T16:32:19Z

That beam.smp is stripped, do you happen to have the symbols saved somewhere?

wkirschbaum · 2023-06-21T16:35:55Z

@jhogberg i can't even pretend to understand what you mean, since I have very little erlang or gdb experience. If its somewhere on the instance I can upload it, but not sure where to look.

wkirschbaum · 2023-06-21T16:47:14Z

I just copied what was in this container, but have a feeling its not there: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2

cp /usr/local/lib/erlang/erts-14.0.1/bin/beam.smp build-archive

jhogberg · 2023-06-26T07:40:58Z

That's really annoying, digging into the images they seem based on https://hub.docker.com/_/erlang which forcibly strips all symbols without having the common decency to save them anywhere. It would've been nice if they didn't strip them in that layer but left it to the final one.

We're basically stuck until we have the symbols, I'll try to get in touch with the folks maintaining those images but I'm not sure how long that will take. In the meantime I suppose we can make our own nearly-identical images by removing the lines that strip symbols (the ones that look like this), but if you want to wait until the upstream images contain symbols I'm fine with that.

wkirschbaum · 2023-06-26T07:59:46Z

@jhogberg thanks for having a look. I can try the suggested change on our build server somewhere this week and will post. There is no urgency for me, but sure more people will run into this soon.

princemaple · 2023-07-06T11:18:46Z

Probably worthless information but here it is anyway: it seems to only happen on apps that use large chunks of memory. Out of 3 apps that I'm running with OTP 26 on alpine, only one that uses large chunk of memory had issues and crashed multiple times a day. The other 2 had been running fairly smoothly.

mudssrali · 2023-07-20T12:29:43Z

Probably worthless information but here it is anyway: it seems to only happen on apps that use large chunks of memory. Out of 3 apps that I'm running with OTP 26 on alpine, only one that uses large chunk of memory had issues and crashed multiple times a day. The other 2 had been running fairly smoothly.

I don't think so it's memory issue -- I've 16GB machine on cloud and locally -- works fine on local however deployment on cloud is crashing -- fairly simple Elixir application that runs bunch of GenServers -- with 1MB payload. I'm using alpine -- guessing by default wiith OTP-26

FROM elixir:1.15.1-alpine

Error

hend=0x00007fdd10cc0e18
stop=0x00007fdd10cc0db8
htop=0x00007fdd10cc0dc0
heap=0x00007fdd10cc0258
beam/erl_gc.c, line 735: <0.1616.0>: Overrun stack and heap

Previously the same application was working fine on elixir:1-8-alpine

Harrisonl · 2023-07-31T16:24:29Z

Don't have much to add to this thread, other then that it was working fine on OTP-25 and Elixir 1.14 alpine linux (ECS-Fargate) for us, and only seen it today for the first time on OTP-26 and Elixir 1.15, will try the debian image tomorrow and see how that get's on.

davidye · 2023-08-02T18:50:22Z

Also started happening to me when I went updated elixir:1.15.4-alpine to elixir:1.14.0-alpine.

Lankester · 2023-08-02T23:34:50Z

I have just got this error in a Debian docker container running off of the elixir:1.15-slim Dockerfile.

user@host:/app# cat /etc/issue
Debian GNU/Linux 11 \n \l

Error:

hend=0x00007fb33de2c288
stop=0x00007fb33de2c080
htop=0x00007fb33de2c088
heap=0x00007fb33de2b6c8
beam/erl_gc.c, line 735: <0.253.0>: Overrun stack and heap
Aborted (core dumped)

It's a relatively simple application consisting of 3 GenServers.

I have a core dump but can't share due to sensitive information within and I have been unable to replicate the issue in a test app that omits the sensitive info.

I've attached a gdb backtrace in case that's useful though and I am open to suggestions and happy to help further.

backtrace.txt

vshev4enko · 2023-08-03T11:40:57Z

I am not running Mint.Websocket.
Also i stopped getting crashes after switching the image from alpine to debian.

Lankester · 2023-08-03T11:51:15Z

It seems to crash almost every time in the same place — receiving and decoding a websocket message. Also, possibly worth pointing out hardware is Intel Mac.

Mint is version 1.5.1.

Another gdb log attached from another crash.

gdb_crash_2.txt

Lankester · 2023-08-03T15:34:43Z

Moving to OTP 25 via elixir:1.15-otp-25-slim seems to have resolved the issue for me, no further crashes so far.

nathany-copia · 2023-08-03T21:15:23Z

Seeing this on OTP 26.0.2 and Alpine 3.17.4

We're using this Docker image -- hexpm/elixir:1.15.4-erlang-26.0.2-alpine-3.17.4 for the build and running under alpine:3.17.4 with Erlang bundled via mix releases.

Not sure if the core dump will have symbols for this Docker image, or where to find the core dump, but I'll follow up if we find it.

htop=0x00007fe486d4ad48
stop=0x00007fe486d4ad40
heap=0x00007fe486d47ba8
beam/erl_gc.c, line 735: <0.8138.0>: Overrun stack and heap
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
Aborted (core dumped)

UPDATE: I poked around and but didn't find a core dump -- I don't know the PID it was running under, and I think Kubernetes restarted containers so the evidence is gone. For now I'm downgrading to hexpm/elixir:1.15.4-erlang-25.3.2.5-alpine-3.17.4, which is a little easier than switching to Debian or another base image.

https://hub.docker.com/r/hexpm/elixir

Alpine uses musl which can cause an issue to overrun the stack and heap in OTP 26 (erlang/otp#7292).

ramyma · 2023-08-09T15:05:43Z

I'm getting a similar crash as well with erlang 26.0.2 and elixir 1.15.4-otp-26 with mint_web_socket following their genserver example (https://github.com/elixir-mint/mint_web_socket/blob/main/examples/genserver.exs):

hend=0x00007f79dba1b8c0
stop=0x00007f79dba1b7a8
htop=0x00007f79dba1b7b0
heap=0x00007f79dba186d0
beam/erl_gc.c, line 735: <0.777.0>: Overrun stack and heap
[1]    48048 IOT instruction (core dumped)  iex -S mix phx.server

The crash happens after receiving some messages (and while receiving more messages) over the websocket.

I'm using Ubuntu 22.04.3

akoutmos · 2023-08-11T03:36:19Z

I am seeing this issue too. I am also running the hexpm/elixir:1.15.4-erlang-26.0.2-alpine-3.17.4 image and have a mix release Phoenix app running in the container. The only thing that I can correlate this too without much of a deep dive was that this happened right after our app was DoS attacked.

I will try and get more information around this, but for now will revert to hexpm/elixir:1.15.4-erlang-25.3.2.5-alpine-3.17.4.

bernardo-martinez · 2023-08-11T10:58:33Z

I am seeing this issue too:

hend=0x00007ff8c845fb90
stop=0x00007ff8c845fa60
htop=0x00007ff8c845fa68
heap=0x00007ff8c845c9a0
beam/erl_gc.c, line 735: <0.67[52](https://gitlab.otters.xyz/product/business/experience/app/-/jobs/12068617#L52).0>: Overrun stack and heap

building image myself with: elixir 1.15.3 and erlang 26.0.2
running on a gitlab alpine image

sverker · 2023-08-17T09:23:33Z

It seems a lot of people are able to reproduce this.
It would be really helpful if someone could give detailed instructions of how to reproduce the crash. Preferably a "minimal" example that is easy and quick to run.

ggalan87 · 2023-08-17T10:28:39Z

I reached this issue after some rabbitmq server crashes.

Setup: plain LXC container, alpine 1.18 image, erlang-26.0.2, elixir-1.15.4

Log:

  Starting broker... completed with 5 plugins.      
                                                                                                         
hend=0x00007fe234eb34b0                                                                                  
stop=0x00007fe234eb3458                                                                                  
htop=0x00007fe234eb3460                                                                                                                                                                                            
heap=0x00007fe234eb21a0                                                                                  
beam/erl_gc.c, line 735: <0.15407.0>: Overrun stack and heap                                             
Aborted (core dumped)

JamesLavin · 2023-08-17T14:38:08Z

In case my context might help anyone debug this, I hit this bug hard yesterday after building (on Apple Silicon) Ubuntu images (plural because I used a variety of base images with various versions of Elixir & Erlang, hoping that one of them might work) to run an Elixir script. Was able to build the images fine, but when I booted the container and ran my .exs file, Mix.install never managed to compile the Jason library (https://github.com/michalmuskala/jason). It always blew up with this error message during Jason library compilation.

(Compiling on Apple Silicon adds a layer of complexity, see: https://pythonspeed.com/articles/docker-build-problems-mac/, but this may be a red herring, given that others seem to have hit this issue elsewhere.)

FWIW, I wound up spinning up an EC2 instance and getting the same .exs file to run just fine there.

ramyma · 2023-08-18T17:21:25Z

It seems a lot of people are able to reproduce this. It would be really helpful if someone could give detailed instructions of how to reproduce the crash. Preferably a "minimal" example that is easy and quick to run.

I created a simple repo where I'm able to reproduce the issue consistently: https://github.com/ramyma/otp_crash

You'll just have to run iex -S mix phx.server, and within a couple of minutes it will crash with:

iex(1)> hend=0x00007f4f237fbdb0
stop=0x00007f4f237fbc98
htop=0x00007f4f237fbca0
heap=0x00007f4f237f6ce0
beam/erl_gc.c, line 735: <0.561.0>: Overrun stack and heap
[1]    97851 IOT instruction (core dumped)  iex -S mix phx.server

Note: I'm using asdf, you'll find a .tool-versions, so you can run asdf install to get the required Erlang and Elixir versions.
Tested on Ubuntu 22.04.3

Lankester · 2023-08-19T00:52:05Z

I've put together a minimal repo based on the elixir 1.15-slim docker image. https://github.com/Lankester/Otp26Crash

Run docker compose up and it should crash within between 10 and 20 minutes.

otp-26-crash-7292  | 00:38:00.255 [info] Broadcast 73898124 bytes on WebSocket.
otp-26-crash-7292  | 00:38:01.108 [info] Received message. 73898124 bytes.
otp-26-crash-7292  | 00:38:02.749 [info] Broadcast 77805516 bytes on WebSocket.
otp-26-crash-7292  | 00:38:03.670 [info] Received message. 77805516 bytes.
otp-26-crash-7292  | 00:38:05.133 [info] Broadcast 60002824 bytes on WebSocket.
otp-26-crash-7292  | 00:38:05.703 [info] Received message. 60002824 bytes.
otp-26-crash-7292  | 00:38:07.487 [info] Broadcast 54965496 bytes on WebSocket.
otp-26-crash-7292  | 00:38:08.008 [info] Received message. 54965496 bytes.
otp-26-crash-7292  | hend=0x00007f52f487d520
otp-26-crash-7292  | stop=0x00007f52f487d318
otp-26-crash-7292  | htop=0x00007f52f487d320
otp-26-crash-7292  | heap=0x00007f52f487a330
otp-26-crash-7292  | beam/erl_gc.c, line 735: <0
otp-26-crash-7292  | .2402.0>: Overrun stack and heap

sverker · 2023-08-21T11:54:15Z

Thanks, @ramyma and @Lankester for the "crash repos".

@ramyma Just to be clear; you ran it directly on Ubuntu without any docker image?

ramyma · 2023-08-21T12:12:01Z

Thanks, @ramyma and @Lankester for the "crash repos".

@ramyma Just to be clear; you ran it directly on Ubuntu without any docker image?

@sverker yes, directly on Ubuntu.

sverker · 2023-08-22T13:23:12Z

We think we found the bug. Here is a quick fix if anyone wants to try it out:

diff --git a/erts/emulator/beam/jit/x86/instr_bs.cpp b/erts/emulator/beam/jit/x86/instr_bs.cpp
index 39dfb64f8f..f8d8cebbdf 100644
--- a/erts/emulator/beam/jit/x86/instr_bs.cpp
+++ b/erts/emulator/beam/jit/x86/instr_bs.cpp
@@ -3829,7 +3829,7 @@ static std::vector<BsmSegment> opt_bsm_segments(
             }
             break;
         case BsmSegment::action::GET_BINARY:
-            heap_need += heap_bin_size((seg.size + 7) / 8);
+            heap_need += std::max(ERL_SUB_BIN_SIZE, heap_bin_size((seg.size + 7) / 8));
             break;
         case BsmSegment::action::GET_TAIL:
             heap_need += EXTRACT_SUB_BIN_HEAP_NEED;

The same fix can be done for ARM in erts/emulator/beam/jit/arm/instr_bs.cpp.

The bug exists since OTP 26.0. The root cause has nothing to do with elixir, alpine or musl libc.
It takes matching small byte-unaligned bitstrings from a larger binary, as done by Elixir.Mint.WebSocket.Frame:decode_raw/3, plus some bad luck with being almost out of process heap space.

We will probably soon release a fix in OTP 26.0.3.

The runtime system could underestimate the amount of heap space needed for matching out short bitstrings with a size not divisble by 8. That could lead to the runtime system terminating with an "Overrun heap and stack" error. Fixes erlang#7292

…/OTP-18733 Fix heap allocation for matching out short bitstrings

Normally, all BEAM files created by OTP 25 and later have a "Type" chunk that contains type information. `beam_lib:strip/1` will not discard the "Type" chunk, but build/release scripts that do their own custom stripping could accidentally delete the chunk. Make sure to test that loading and executing BEAM files without type information works. Since there was an overrun-heap-and-stack bug (reported in erlang#7292, fixed in erlang#7581) when using the bit syntax, the bit syntax test suites seems to appropriate to clone to new BEAM files without types.

vshev4enko added the bug Issue is reported as a bug label May 24, 2023

vshev4enko changed the title ~~Overrun stack and heap~~ Overrun stack and heap OTP-26.0 May 24, 2023

IngelaAndin added the team:VM Assigned to OTP team VM label May 26, 2023

rickard-green self-assigned this May 29, 2023

mikpe mentioned this issue Jun 25, 2023

Segmentation fault in 26.0.1 #7436

Closed

jhogberg mentioned this issue Jul 12, 2023

Segmentation Fault on String.replace/4 #7492

Closed

oestrich added a commit to nerves-hub/nerves_hub_web that referenced this issue Aug 8, 2023

Switch to debian in the Dockerfile

0bf1343

Alpine uses musl which can cause an issue to overrun the stack and heap in OTP 26 (erlang/otp#7292).

rickard-green assigned sverker and unassigned rickard-green Aug 14, 2023

bjorng mentioned this issue Aug 23, 2023

Fix heap allocation for matching out short bitstrings #7581

Merged

bjorng linked a pull request Aug 23, 2023 that will close this issue

Fix heap allocation for matching out short bitstrings #7581

Merged

bjorng added a commit that referenced this issue Aug 25, 2023

Merge pull request #7581 from bjorng/bjorn/jit/small-bitstrings/GH-7292…

8d41073

…/OTP-18733 Fix heap allocation for matching out short bitstrings

bjorng closed this as completed in a9ed6ec Aug 25, 2023

bjorng mentioned this issue Aug 30, 2023

Test BEAM files without type information #7603

Merged

nicocirio mentioned this issue Sep 22, 2023

Update to Elixir 1.15 and Erlang otp 26 Simon-Initiative/oli-torus#4242

Merged

rneswold mentioned this issue Oct 3, 2023

Erlang front-ends memory leak fermi-ad/controls#16

Closed

mcfiredrill mentioned this issue Oct 24, 2023

random crashes datafruits/fruitbot#29

Open

Overrun stack and heap OTP-26.0 #7292

Overrun stack and heap OTP-26.0 #7292

Comments

vshev4enko commented May 24, 2023

benoawfu commented May 31, 2023

jhogberg commented May 31, 2023

benoawfu commented May 31, 2023

jhogberg commented May 31, 2023

benoawfu commented May 31, 2023

jhogberg commented May 31, 2023 • edited Loading

vshev4enko commented Jun 1, 2023 • edited Loading

jhogberg commented Jun 1, 2023

vshev4enko commented Jun 1, 2023 • edited Loading

jhogberg commented Jun 1, 2023

wkirschbaum commented Jun 20, 2023

jhogberg commented Jun 21, 2023

ziopio commented Jun 21, 2023

wkirschbaum commented Jun 21, 2023 • edited Loading

jhogberg commented Jun 21, 2023

wkirschbaum commented Jun 21, 2023

wkirschbaum commented Jun 21, 2023

jhogberg commented Jun 26, 2023

wkirschbaum commented Jun 26, 2023

princemaple commented Jul 6, 2023

mudssrali commented Jul 20, 2023

Harrisonl commented Jul 31, 2023

davidye commented Aug 2, 2023

Lankester commented Aug 2, 2023 • edited Loading

vshev4enko commented Aug 3, 2023 • edited Loading

Lankester commented Aug 3, 2023

Lankester commented Aug 3, 2023

nathany-copia commented Aug 3, 2023 • edited Loading

ramyma commented Aug 9, 2023 • edited Loading

akoutmos commented Aug 11, 2023

bernardo-martinez commented Aug 11, 2023

sverker commented Aug 17, 2023 • edited Loading

ggalan87 commented Aug 17, 2023

JamesLavin commented Aug 17, 2023 • edited Loading

ramyma commented Aug 18, 2023 • edited Loading

Lankester commented Aug 19, 2023

sverker commented Aug 21, 2023

ramyma commented Aug 21, 2023

sverker commented Aug 22, 2023

jhogberg commented May 31, 2023 •

edited

Loading

vshev4enko commented Jun 1, 2023 •

edited

Loading

vshev4enko commented Jun 1, 2023 •

edited

Loading

wkirschbaum commented Jun 21, 2023 •

edited

Loading

Lankester commented Aug 2, 2023 •

edited

Loading

vshev4enko commented Aug 3, 2023 •

edited

Loading

nathany-copia commented Aug 3, 2023 •

edited

Loading

ramyma commented Aug 9, 2023 •

edited

Loading

sverker commented Aug 17, 2023 •

edited

Loading

JamesLavin commented Aug 17, 2023 •

edited

Loading

ramyma commented Aug 18, 2023 •

edited

Loading