-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users on headless completely freeze after some time #2079
Comments
This is a bit too vague. I've seen headless run bigger events without this necessarily occurring.
|
This comment was marked as off-topic.
This comment was marked as off-topic.
@shiftyscales
@Frooxius
For now, this has been reproduced on at least three worlds (see additional details), I will need to conduct a test on an empty grid.
I did not record server FPS but here are some metrics I pulled from Grafana (all times in UTC): Baremetal (third test): Docker (second test): To re-iterate the issue, after running a while, with enough users, the session starts having issues.
|
@jae1911 When this happens, can you spawn the users debug string and see if the Delta messages & sync ticks are incrementing? It sounds to me like the data model or streams somehow stop updating properly. Or something gets clogged up somewhere. We need to isolate what and where. I don't think this is really performance/load thing though given that everything just freezes. If it was performance, then it should still work, just be very laggy and sluggish, but a complete freeze suggest something just straight up breaks. |
Sure, I will organise two tests then:
Using a baremetal install. |
So we ran a test using @Readun's headless running Club Crystalline on Windows 10 Pro in a Proxmox VM with only the headless running. I recorded this video near the end, showcasing the issue well: https://youtu.be/ZVBfU90qhYo Logs from the headless: READUNSERVER - 2024.5.24.202 - 2024-05-24 18_12_55.log |
To add a bit more Information: The Usage suddenly spikes intensively. Then the sudden change to a full lockup -> 24 threads at what is posted above.... The day before we had it aswell, with 12-14 people. all 24T @ 60% with Headless fps at 10-20 fps.... it suddenly freed itself again after 2h with the same usercount. Edit: My guess is, that the server locked up as we had 22 people this time, yesterday with 12-16 people it "survived" at 10-20fps but was horrible. Voice and QM were still fine! |
Given how close that is to 50% on a processor with SMT- I'm curious if Resonite is just being CPU bottlenecked on the headless. Do you have any stats on how the usage went up over time, @jae1911? E.g. you mentioned it being a 'sudden change'- can you pin the sudden change to any particular thing that occurred in the world? E.g. a user spawning an item, switching to a particular avatar, another user joining- or some other potential source that could account for a sudden jump in CPU use? Have you tested if the same ended up occurring on the gridspace? |
From the issue you linked, I see you mention:
Does this issue also occur with a vanilla headless installation? |
@shiftyscales I was eying the usage the entire time. We tried switching avatars and it was extremly irregular... this will be a hard one to debug I feel like x..x This time with my log postet from @jae1911 is a Vanilla Headless on Windows. (Completly removed everything from modloader and mods) PS: I think Smoin mentioned, that Blood on the Clocktower having similiar issues lately too, |
Thanks for the new data and the video. I've renamed the issue, since this is a complete freeze, rather than a hitch (hitch to me means that it freezes for a second or a few and then resumes). With how this behaves, I'm very confident this is not a performance issue at all, rather something breaks/explodes. Were you not table to spawn and capture the user debug string with the sync tick counter that I requested above? That would've been very helpful. For the stream in the video, were you the source of the stream? Or was that coming from another user and still functional? From the log, it seems like the session network threads have possibly exploded and stopped processing data. There's an exception when shutting down, which sounds like something else exploded earlier, but wasn't caught in the log. We might need to add some additional logging. |
@shiftyscales
Sadly, it seems somebody deleted it for some reason, and when I tried spawning it in the video, it wouldn't update :/ |
I ... sadly deleted it a couple minutes before... As I thought we ruled it onto the mod from yesterday. Buuut we were "celebrating" too early :/ J4 was not the Source of the Stream. It was me, but im in LAN with the Headless itself. |
Let me ask it this way so it's less confusing on my end: Is the audio stream shown in the video being streamed by the user who recorded that video? Or is it streamed by another user in the session who is currently frozen (in the video)? |
The video was recorded by me (j4) while the audio was streamed by @Readun. |
I see mono 6.8 as one of the only concrete mono versions you're listing. Mono is currently up to 6.12.0.2xx and greater at this point. Can you 100% confirm that you're using at least up to that version on all configs by running the version command? I'd highly recommend updating mono for all of these configurations if it's anything less, just to be sure. |
I can definitely confirm mono is on its latest available version with the Docker tests since the image is directly pulled from DockerHub. |
Small update, I was on a headless hosted by @decoybird and Avantis and the locking issue happened. I will still conduct another test during this week with an empty gridspace to see if the issue persists there and also put this panel around in the sessions this week on Wed and Thu to get more information. |
We might have clue on what is causing it now after yesterdays event and it might actually be something on someones Avatar. |
Update: Okay... Nevermind, it is not Avatar Specific. God this is such a wonky thing to pin down DX My current state Observation: It so far can be triggered or "Fixed" when people join/Leave or change their Avatars in a session. |
Could you try isolating that particular case, @Readun? E.g. have the user equip the heavy avatar, have a secondary user join and see if that causes it? I'm not sure why you ruled out it being avatar specific if the same avatar was in use for each of the occasions it happened. |
Because it also went down while they were still wearing it. Im not sure how to futher pinpoint it down :/ It is really heavy for the Headless and we had expirienced full lockups when the headless only is a more reasonable 6 Threads CPU. We havent tried to rule out, if its only the headless client or the normal client too btw. Is there anything Froox can do to be able to log the issue?
|
Like I mentioned above, I don't think this really has anything to do with the actual performance, but rather something that completely breaks part of the networking code - like throwing an exception inside of it. Typically performance problems will still keep the system chugging forward - just slowly and jittery. They will not result in a complete halt that never recovers. I need to add more instrumentation to gather more data. We had this happen during BoTC yesterday as well, even when people were joining in simple avatars. |
New rough Info we were able to observe:
Im a bit confused on what it could be, as we were already deleting a lot of his stuff from the Avatar before. It is -not- the same like in #2213 as it goes back to normal and no Restart is required. |
Okay, we went through futher testing, but I have no Idea how we can futher pinpoint it...
Im running out of Ideas what we can test futher :/ |
I'm adding some extra diagnostics in the next build. I'm starting to build a suspicion this might be a deadlock in one of the threads. Which is a bit unfortunate, but it could be an excuse needed to rewrite this system (which would also solve a number of other issues). |
Okay the diagnostic command is now in 2024.6.24.1317! If the session freezes your headless, type this command and save the output. Then type it again a few seconds later and save the output again. And post it here. |
Since it didn't seem to be mentioned in the issue itself, the command is
|
Mmnnhhh... @Frooxius this is slightly tricky. Does it still output the desired diagnostic or is this only specific for a freeze? Gonna try it in the evening. |
I don't know. I just need you to run the command when this issue happens and see what it outputs (or if anything else happens). |
Will do! |
That was my first suspicion too after reading
Would make perfect sense to explain why it only gets stuck when something else isn't done fast enough. |
Thanks! I'm a bit confused by these outputs though. Is the "With the issue" screenshot after everything already froze? Meaning are both of these after everything freezes? |
The first one is with that bug accuring, the second one after is after Ink removed the problematic avatar & the Asset Cleanup took place. Okay, another test with only 4 Threads..... @jae1911 we need to test this with a Linux headless. Apparently a Windows Headless does not "Freeze" fully. READUNSERVER - 2024.6.24.1317 - 2024-06-25 19_11_57.log I will try another quick test with just one thread to be fully sure... |
Wait, I'm a bit more confused - so you're able to unfreeze the session by deleting something? And then it continues as normal? In order to diagnose this, I need two outputs of that command while everything is frozen - I need to see what's happening during everything being frozen at two distinct points in time. That's why I asked to wait a few seconds and type it again. Otherwise this won't be very useful. |
Yes, described here. All that from me is with a Windows Headless. @jae1911 is booting up his Linux headless right now to do futher testing and Im restricting mine to 1 Thread.
In my last screenshoot with 4 threads, I wrote "Starting now" into the command line, when we triggered the bug. |
This is with 1 Thread. This is with the bug triggered, it does not freeze on Windows. Next up is Testing Linux headless with @jae1911 ! |
I'm not fully clear on the sequence of events there. You mention high CPU utilization, but I'm not sure if that means that everything in the session is frozen? Similarly when there's low CPU utilization, if that means that everything is unfrozen - you could have low CPU utilization, while session still remains frozen. Can you explicitly specify when session freezes and when it unfreezes please? It's a bit ambiguous on what correlates to what. From the screenshot however, it doesn't actually seem like this is a deadlock at all - the sync threads keep processing. If the session is able to unfreeze as well, that would also indicate that something completely different is at play. We'd probably need to isolate the item that needs to be deleted to unfreeze everything (e.g. by deleting it piece by piece). Actually this brings another question - when a new user joins after the server is already frozen, is everything frozen for them too? |
In my original testing, this is what happened, everybody was frozen at spawn (see screenshot in #2079 (comment)) |
I have the feeling this is a mixture of having this specific bug + having a lot of players in the session to overwhelm a specific process.
As written above, it is Ink_25 Body-slot and Assets. vvvv
PS: We just confirmed, this bug will trigger a Headless freeze, -IF- there are enough players aswell. |
Alright, as for the Linux test. Here is without the bug happening (before triggering anything):
The bug starts (all cores start being 80-98% all the time):
Everybody is frozen:
Log files: Some notes: the freeze happened after somebody joined after the bug started happening. |
I think we need to separate here what we're actually trying to narrow. High CPU usage itself isn't necessarily an issue - it can happen, but if things don't freeze, it might be lots of other things - we should ignore those unless the freeze also happens, because they might have a different cause and that can make things more confusing. When things do freeze though, that indicates something else than just high CPU usage - something gets heavily broken or overwhelmed and that's what we need to focus on. Given the latest info, it looks like it's not a deadlock however, which means I'll have to add some more diagnostics - you don't need to send any more of the debug command outputs, these ones are sufficient, thank you! Would you be able to write concise bullet point list of specific steps on what exactly happens, in what order - including when things freeze and when they unfreeze so I have a clear timeline of this? I'm a bit scrambled on the exact details, since it's through a bunch of messages and descriptions. |
Interestingly this time my client got affected after that freeze and crash. Im in my local world and the client seems to studder heavily between 200 fps and a stopped frame. But we should Ignore it for now.
It is 100% an Issue, in this case a combination with Inks Avatar Assets & the Asset Cleanup event.
This is definatly a combination of multiple issues. If the Issue happens with a lot of players, then it deadlocks something. PS: Oh woup, I was writing my respond to the first part of your respond. Doing more Testing now! |
It can be a symptom of an issue. Problem is, it can be symptom of lots of different issues, or none at all, so we need to be careful we're not muddling the data necessarily - e.g. the new diagnostic command assumed a deadlock and that everything freezes completely. The cases where it does deadlock get most useful data to get at the "core" of this.
How exactly are you triggering it?
|
Alright, a single person joining does not causes the freeze. During the test, the headless crashed while I was typing the
Full logs: We're gonna proceed to more tests to see about the other causes. |
Only to cause the CPU Load:
As long as the Headless didnt froze into the Deadlock (When we have more players in a session), we can revert the cpu load issue by:
|
What I mean, how exactly are you triggering the asset cleanup? |
Describe the bug?
When hosting a weekly event, after having 6+ people join, the headless is starting to have huge performance issues.
Users stop moving, voices are crackling and audio streams basically stop.
The headless does not appear to be crashing, it simply becomes unusable until it has to be manually shut down.
When a user joins while the headless is in this state, they will see all the other present users at spawn, seemingly having IK issues (see screenshots provided).
To Reproduce
Expected behavior
Headless should be able to handle this load.
Screenshots
Users frozen:
Screenshot from a user that joined showing all other users stuck at spawn during the issue:
Resonite Version Number
2024.5.22.1274
What Platforms does this occur on?
Linux
What headset if any do you use?
Headless
Log Files
From last test, custom, baremetal (see additional context below).
cheesebox - 2024.5.22.1274 - 2024-05-23 19_44_21.log
Additional Context
Tested with three setups:
In all three cases:
In the case of Linode:
Mono version for baremetal:
Mono version for Docker:
mono:latest
on DockerHubDocker image for the headless is Shadowpanther's.
Linux version:
Worlds:
resrec:///G-RetroGames/R-7d1cbf0d-0180-4392-8262-9bc9c20a9e06
resrec:///G-United-Space-Force-N/R-96728ad3-a07b-4ab6-8cf5-f4086045bd4d
Custom server is served by 10/10Gbps, Linode has 40/6Gbps.
Sample headless config (sensitive info removed): https://g.j4.lc/-/snippets/10
This issue has been happening for the past three weeks, in worlds having a culling system or not.
Reporters
U-j4 | j4.lc (Discord)
U-Readun | @Readun
U-Hikari-Akimori
U-Ink-25
The text was updated successfully, but these errors were encountered: