-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: "fatal error: s.allocCount != s.nelems && freeIndex == s.nelems" during newobject on windows-amd64 #45775
Comments
Tentatively marking as release-blocker for 1.17 until we can determine whether this is a regression. (The only other occurrence of this error I could find in the logs was in 2019 on plan9, which seems likely to be unrelated.) |
/cc @bufflig |
Weekly check-in: this needs to be investigated before beta 1. |
Here's what the situation looks like: What's interesting here is that this means the program was already allocating out of this I thought maybe I saw something in a corner case, like a span that was just swept (there is some super weird stuff we do that I think we should clean up, like relying on certain pieces of span state to change then fixing them up...) but it all seems to check out. Now, on the other hand, if we consider memory corruption a (scary but) viable alternative, then an errant zero value on the span's |
That check was placed there to detect a Lazarus condition where an object
is not reachable during GC cycle n but is then reachable during cycle n+1.
Broken stack maps and unsafe pointer tricks tend to be the root (pun
intended) problem.
…On Tue, May 18, 2021 at 7:01 PM Michael Knyszek ***@***.***> wrote:
Here's what the situation looks like: nextFreeFast (the allocator fast
path) was called for this span that was already in the mcache, and it
failed. Then, nextFree was called to either replenish the span's
allocCache, or go get a new span. In this case, the span *appears* full,
so it tries to go get a new span, but before it does that, the allocator
notices that allocCount doesn't line up with nelems and throws an error.
What's interesting here is that this means the program was already
allocating out of this mcache in the current GC cycle, at least once,
anyway. And somehow the allocator missed a free slot in the process. In
general, this is very unlikely; these code paths are exercised extremely
heavily. The allocator did not change in 1.17, so if there's a bug in that
logic, it's not new. I've walked over this code a bunch of times now and I
can't find a fault in the algorithm.
I thought maybe I saw something in a corner case, like a span that was
just swept (there is some *super* weird stuff we do that I think we
should clean up, like relying on certain pieces of span state to change
then fixing them up...) but it all seems to check out.
Now, on the other hand, if we consider memory corruption a (scary but)
viable alternative, then an errant zero value on the span's allocCache
value *would* manifest as this error in many cases.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#45775 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHNNHZUEQGKVQMQUFDUQA3TOLWVLANCNFSM43TBBDGA>
.
|
Thanks Rick. I'm trying to wrap my head around why a revived object would be caught here, though, instead of when sweeping. Since only swept spans may be cached, we check for revived objects during sweeping, and marking never touches the allocation bits, I'm not sure how the allocation path could surface an error for a revived object, at least the way the code is currently written. Perhaps there's something I'm missing, though. |
I think this needs to be addressed before final release and ideally before RC, so ping. |
Checking in as the RC1 date is approaching. |
Just walked back all the windows/amd64 failures I could find in the logs up to this failure. I couldn't find anything else like it. EDIT: Or anything else untriaged or unresolved, for that matter. I'm really not sure what to do here. My analysis above lead nowhere. |
As far as I can tell this has happened exactly once. "Once is happenstance." If we don't see any way that this could happen, I think we should close the issue until it recurs. |
If it happens again, I'm happy to throw more resources at it to find the root cause, but I really don't see how this could happen, and I don't feel particularly optimistic about reproducing it given how well reproducing FreeBSD memory corruption is going (though, that has certainly happened more than once!). |
I'd be ok with closing it as non-reproducible for now, given that we also can't identify any changes that might have caused it. |
Closing. |
I just saw this same issue in Go 1.16.10 on illumos. (See oxidecomputer/omicron#1146.) I see that someone else saw this on Linux on ARM years ago. Should this issue be reopened? |
I've been able to reproduce this and what look like other GC related errors on OmniOS (an illumos based OS) running on physical hardware and under QEMU. Here's the steps that have worked for me: On Ubuntu, install the Create an OmniOS disk:
Download the OmniOS installer:
Boot qemu, and install OmniOS:
Note that $(nproc) on my system returns 24. During the install process:
Then pick Halt. Boot the freshly installed OmniOS image without the USB stick and forwarding SSH:
There's some post-install setup required. Switch over to
after that,
Inside the VM, install the prereq packages:
Disable tmpfs by editing
Reboot (with
Then, run With this, we've seen
on some physical machines, we've seen:
and
|
Thanks for the report. Given that this issue has been closed for a year and we hadn't seen crashes on illumos previously, could you move this to a new issue? |
@mknyszek Platform is ARM, Yogurt (Phytec Example Distribution) BSP-Yocto-i.MX6-PD18.1.2 use go1.17.13.linux-armv6l.tar.gz. This is the log: runtime: s.allocCount= 511 s.nelems= 512 goroutine 69009 [running]: runtime: s.allocCount= 340 s.nelems= 341 goroutine 17 [running, locked to thread]: can you help me ? thanks |
This also occurred for me on this machine: go1.20.6, AMD Ryzen 9 6900HX , 32Gb RAM, Fedora Linux 38 (6.4.6-200.fc38.x86_64)
|
@ajstarks Can you please file a new issue? Also, is this reproducible? Thanks. |
@mknyszek will do, and no I have not been able to reproduce this after several attempts. |
2021-04-23T21:42:59-41e5ae4/windows-amd64-2012
CC @mknyszek
The text was updated successfully, but these errors were encountered: