-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overrun stack and heap OTP-26.0 #7292
Comments
Same thing here, alpine 3.18.0, OTP 26.0 but on AWS ECS Fargate.
|
Thanks for your report, could you provide a core dump together with the |
Sorry I can't retrieve that off ECS. |
That's unfortunate :-/ Can you reproduce this with image that isn't based on alpine (or more importantly, doesn't use |
It's production stuff, sorry I can't do experiments on this. I haven't had any issue on the pre-production/test servers. |
As an added bonus you'll get 10-15% better performance on linear code. :-) |
You mean run the same application under ubuntu instead of alpine on environment where the latter crashed? |
Yeah, that'd work. As long as the environment uses |
Replaced base image with |
Thank you, then the crash may be related to that configuration. If you can provide a core dump (+ |
@jhogberg would you recommend as a general rule that we avoid musl for running erlang in production? I am getting the same issue on alpine:
And can reproduce it on a build server. I don't see a |
I'd avoid it for now, this only seems to happen with this configuration.
That's great! It ought to have dumped core and not generated a crash dump, so that's why you're not seeing one. Please send us the core dump together with |
FLy.io machine I want to signal that this also appeared on under the same conditions when I was working on a Fly.io machine in the cloud. I was spawning 1000 docker containers all under a single supervisor, so there was an erlang process for each one of them. |
@jhogberg beam.smp and core dump: https://git.sr.ht/~whk/erlang-26-crash-reports/tree Please let me know if there is anything which will help. Docker image: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2 |
That |
@jhogberg i can't even pretend to understand what you mean, since I have very little erlang or gdb experience. If its somewhere on the instance I can upload it, but not sure where to look. |
I just copied what was in this container, but have a feeling its not there: hexpm/elixir:1.15.0-erlang-26.0.1-alpine-3.18.2
|
That's really annoying, digging into the images they seem based on https://hub.docker.com/_/erlang which forcibly strips all symbols without having the common decency to save them anywhere. It would've been nice if they didn't strip them in that layer but left it to the final one. We're basically stuck until we have the symbols, I'll try to get in touch with the folks maintaining those images but I'm not sure how long that will take. In the meantime I suppose we can make our own nearly-identical images by removing the lines that strip symbols (the ones that look like this), but if you want to wait until the upstream images contain symbols I'm fine with that. |
@jhogberg thanks for having a look. I can try the suggested change on our build server somewhere this week and will post. There is no urgency for me, but sure more people will run into this soon. |
Probably worthless information but here it is anyway: it seems to only happen on apps that use large chunks of memory. Out of 3 apps that I'm running with OTP 26 on alpine, only one that uses large chunk of memory had issues and crashed multiple times a day. The other 2 had been running fairly smoothly. |
I don't think so it's memory issue -- I've 16GB machine on cloud and locally -- works fine on local however deployment on cloud is crashing -- fairly simple Elixir application that runs bunch of GenServers -- with 1MB payload. I'm using alpine -- guessing by default wiith OTP-26
Error
Previously the same application was working fine on |
Don't have much to add to this thread, other then that it was working fine on OTP-25 and Elixir 1.14 alpine linux (ECS-Fargate) for us, and only seen it today for the first time on OTP-26 and Elixir 1.15, will try the debian image tomorrow and see how that get's on. |
Also started happening to me when I went updated elixir:1.15.4-alpine to elixir:1.14.0-alpine. |
I have just got this error in a Debian docker container running off of the
Error:
It's a relatively simple application consisting of 3 GenServers. I have a core dump but can't share due to sensitive information within and I have been unable to replicate the issue in a test app that omits the sensitive info. I've attached a gdb backtrace in case that's useful though and I am open to suggestions and happy to help further. |
I am not running |
It seems to crash almost every time in the same place — receiving and decoding a websocket message. Also, possibly worth pointing out hardware is Intel Mac.
Another gdb log attached from another crash. |
Moving to OTP 25 via |
Seeing this on OTP 26.0.2 and Alpine 3.17.4 We're using this Docker image -- Not sure if the core dump will have symbols for this Docker image, or where to find the core dump, but I'll follow up if we find it.
UPDATE: I poked around and but didn't find a core dump -- I don't know the PID it was running under, and I think Kubernetes restarted containers so the evidence is gone. For now I'm downgrading to |
Alpine uses musl which can cause an issue to overrun the stack and heap in OTP 26 (erlang/otp#7292).
I'm getting a similar crash as well with
The crash happens after receiving some messages (and while receiving more messages) over the websocket. I'm using Ubuntu 22.04.3 |
I am seeing this issue too. I am also running the I will try and get more information around this, but for now will revert to |
I am seeing this issue too:
building image myself with: |
It seems a lot of people are able to reproduce this. |
I reached this issue after some rabbitmq server crashes. Setup: plain LXC container, alpine 1.18 image, erlang-26.0.2, elixir-1.15.4 Log:
|
In case my context might help anyone debug this, I hit this bug hard yesterday after building (on Apple Silicon) Ubuntu images (plural because I used a variety of base images with various versions of Elixir & Erlang, hoping that one of them might work) to run an Elixir script. Was able to build the images fine, but when I booted the container and ran my .exs file, (Compiling on Apple Silicon adds a layer of complexity, see: https://pythonspeed.com/articles/docker-build-problems-mac/, but this may be a red herring, given that others seem to have hit this issue elsewhere.) FWIW, I wound up spinning up an EC2 instance and getting the same .exs file to run just fine there. |
I created a simple repo where I'm able to reproduce the issue consistently: https://github.com/ramyma/otp_crash You'll just have to run
Note: I'm using asdf, you'll find a |
I've put together a minimal repo based on the elixir Run
|
Thanks, @ramyma and @Lankester for the "crash repos". @ramyma Just to be clear; you ran it directly on Ubuntu without any docker image? |
@sverker yes, directly on Ubuntu. |
We think we found the bug. Here is a quick fix if anyone wants to try it out:
The same fix can be done for ARM in erts/emulator/beam/jit/arm/instr_bs.cpp. The bug exists since OTP 26.0. The root cause has nothing to do with elixir, alpine or musl libc. We will probably soon release a fix in OTP 26.0.3. |
The runtime system could underestimate the amount of heap space needed for matching out short bitstrings with a size not divisble by 8. That could lead to the runtime system terminating with an "Overrun heap and stack" error. Fixes erlang#7292
The runtime system could underestimate the amount of heap space needed for matching out short bitstrings with a size not divisble by 8. That could lead to the runtime system terminating with an "Overrun heap and stack" error. Fixes erlang#7292
…/OTP-18733 Fix heap allocation for matching out short bitstrings
Normally, all BEAM files created by OTP 25 and later have a "Type" chunk that contains type information. `beam_lib:strip/1` will not discard the "Type" chunk, but build/release scripts that do their own custom stripping could accidentally delete the chunk. Make sure to test that loading and executing BEAM files without type information works. Since there was an overrun-heap-and-stack bug (reported in erlang#7292, fixed in erlang#7581) when using the bit syntax, the bit syntax test suites seems to appropriate to clone to new BEAM files without types.
Describe the bug
Application down few seconds after the run release in AWS eks
To Reproduce
Unfortunately i have no idea how to reproduce.
Affected versions
26.0
Additional context
AWS EKS
erlang 26.0
alpine 3.18.0
Application logs
The text was updated successfully, but these errors were encountered: