Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users on headless completely freeze after some time #2079

Open
jae1911 opened this issue May 23, 2024 · 49 comments
Open

Users on headless completely freeze after some time #2079

jae1911 opened this issue May 23, 2024 · 49 comments
Assignees
Labels
bug Something isn't working as intended. needs more information More information is requested about this issue. triaged This issue has been assessed

Comments

@jae1911
Copy link

jae1911 commented May 23, 2024

Describe the bug?

When hosting a weekly event, after having 6+ people join, the headless is starting to have huge performance issues.
Users stop moving, voices are crackling and audio streams basically stop.

The headless does not appear to be crashing, it simply becomes unusable until it has to be manually shut down.

When a user joins while the headless is in this state, they will see all the other present users at spawn, seemingly having IK issues (see screenshots provided).

To Reproduce

  1. Launch headless
  2. Gather people
  3. Wait around 10 minutes
  4. Headless will start having performance issues until all users are frozen

Expected behavior

Headless should be able to handle this load.

Screenshots

Users frozen:
Resonite_9M2ysqXMIR

Screenshot from a user that joined showing all other users stuck at spawn during the issue:

Resonite Version Number

2024.5.22.1274

What Platforms does this occur on?

Linux

What headset if any do you use?

Headless

Log Files

From last test, custom, baremetal (see additional context below).
cheesebox - 2024.5.22.1274 - 2024-05-23 19_44_21.log

Additional Context

Tested with three setups:

  • Linode 16G server - Docker
  • Custom 32G server - Docker
  • Custom 32G server - Baremetal

In all three cases:

  • Headless is downloaded from scratch

In the case of Linode:

  • A fresh server is created every time

Mono version for baremetal:

Mono JIT compiler version 6.8.0.105 (Debian 6.8.0.105+dfsg-3.3~deb11u1 Sat Feb 18 21:26:37 UTC 2023)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug
        Interpreter:   yes
        LLVM:          supported, not enabled.
        Suspend:       hybrid
        GC:            sgen (concurrent by default)

Mono version for Docker: mono:latest on DockerHub
Docker image for the headless is Shadowpanther's.

Linux version:

Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

Worlds:

  • Club Crystalline - Private (basically a fancy box)
  • Blast Arena - resrec:///G-RetroGames/R-7d1cbf0d-0180-4392-8262-9bc9c20a9e06
  • AGRA JP Yellow Line - resrec:///G-United-Space-Force-N/R-96728ad3-a07b-4ab6-8cf5-f4086045bd4d

Custom server is served by 10/10Gbps, Linode has 40/6Gbps.

Sample headless config (sensitive info removed): https://g.j4.lc/-/snippets/10

This issue has been happening for the past three weeks, in worlds having a culling system or not.

Reporters

U-j4 | j4.lc (Discord)
U-Readun | @Readun
U-Hikari-Akimori
U-Ink-25

@jae1911 jae1911 added the bug Something isn't working as intended. label May 23, 2024
@Frooxius
Copy link
Member

This is a bit too vague. I've seen headless run bigger events without this necessarily occurring.

  • Does this happen at particular points/actions? E.g. users joining/leaving?
  • Does this happen when hosting a particular world? Or are you able to reproduce this even on empty grid?
  • Do you have any performance metrics? CPU usage, RAM usage, server FPS?

@Frooxius Frooxius added the needs more information More information is requested about this issue. label May 23, 2024
@shiftyscales

This comment was marked as off-topic.

@shiftyscales shiftyscales removed their assignment May 23, 2024
@jae1911
Copy link
Author

jae1911 commented May 23, 2024

This is also a duplicate of #1477.

@shiftyscales
It is not, this is a different issue, the join lag is not in question.

Does this happen at particular points/actions? E.g. users joining/leaving?

@Frooxius
This happens when the session reaches around 6+ players after around 10-15 minutes. It seems to be accelerated by spawning an audio stream.

Does this happen when hosting a particular world? Or are you able to reproduce this even on empty grid?

For now, this has been reproduced on at least three worlds (see additional details), I will need to conduct a test on an empty grid.

Do you have any performance metrics? CPU usage, RAM usage, server FPS?

I did not record server FPS but here are some metrics I pulled from Grafana (all times in UTC):

Baremetal (third test):

CPU
firefox_tRZYhgZQXH

RAM
firefox_LxF6t2LeJH

Docker (second test):

CPU (large pink line):
firefox_oZwpsBDX2I

RAM (large purple line):
firefox_dHgLPQa1my


To re-iterate the issue, after running a while, with enough users, the session starts having issues.
Issues include:

  • Players being frozen in place indefinitely (this is not a temporary thing)
  • Voices breaking down (while still being able to somewhat hear other players)
  • Audio streams either stopping completely or cutting every few seconds
  • When a user joins while the headless is in this state, all other users will appear at spawn (see screenshots in original report)

@Frooxius
Copy link
Member

@jae1911 When this happens, can you spawn the users debug string and see if the Delta messages & sync ticks are incrementing?

It sounds to me like the data model or streams somehow stop updating properly. Or something gets clogged up somewhere. We need to isolate what and where.

I don't think this is really performance/load thing though given that everything just freezes. If it was performance, then it should still work, just be very laggy and sluggish, but a complete freeze suggest something just straight up breaks.

@jae1911
Copy link
Author

jae1911 commented May 23, 2024

When this happens, can you spawn the users debug string and see if the Delta messages & sync ticks are incrementing?

Sure, I will organise two tests then:

  • Regular world
  • Gridspace

Using a baremetal install.

@jae1911
Copy link
Author

jae1911 commented May 24, 2024

So we ran a test using @Readun's headless running Club Crystalline on Windows 10 Pro in a Proxmox VM with only the headless running.
Same issue happened after around an hour and half running, at peak reached 22 people.

I recorded this video near the end, showcasing the issue well: https://youtu.be/ZVBfU90qhYo

Logs from the headless: READUNSERVER - 2024.5.24.202 - 2024-05-24 18_12_55.log

Task manager during the issue happened:
image

@Readun
Copy link

Readun commented May 24, 2024

To add a bit more Information: The Usage suddenly spikes intensively.
The Server is capped at 45 fps and at 21 People there were only 3 Threads at ~ 80%, all other threads were idlying at 3-5%
Headless had a fps of 36-43 fps during that.

Then the sudden change to a full lockup -> 24 threads at what is posted above....

The day before we had it aswell, with 12-14 people. all 24T @ 60% with Headless fps at 10-20 fps.... it suddenly freed itself again after 2h with the same usercount.

Edit: My guess is, that the server locked up as we had 22 people this time, yesterday with 12-16 people it "survived" at 10-20fps but was horrible. Voice and QM were still fine!

@shiftyscales
Copy link
Collaborator

Given how close that is to 50% on a processor with SMT- I'm curious if Resonite is just being CPU bottlenecked on the headless. Do you have any stats on how the usage went up over time, @jae1911? E.g. you mentioned it being a 'sudden change'- can you pin the sudden change to any particular thing that occurred in the world? E.g. a user spawning an item, switching to a particular avatar, another user joining- or some other potential source that could account for a sudden jump in CPU use?

Have you tested if the same ended up occurring on the gridspace?

@Frooxius Frooxius changed the title Headless large hitches Users on headless completely freeze after some time May 24, 2024
@shiftyscales
Copy link
Collaborator

From the issue you linked, I see you mention:

We may just found out that either the modloader or Outflow are bugging heavily out with the recents version, causing massive CPU usage on the Headless.

Does this issue also occur with a vanilla headless installation?

@Readun
Copy link

Readun commented May 24, 2024

@shiftyscales I was eying the usage the entire time.
It was a sudden change with 3 threads @ 80% going to 24 threads @ 60% and fps plumiting heavily.

We tried switching avatars and it was extremly irregular... this will be a hard one to debug I feel like x..x
It was Avatar independend, Equipping avatar could trigger -and- fix the issues.
(Someone changed to a heavier avatar and it suddenly went back down.

This time with my log postet from @jae1911 is a Vanilla Headless on Windows. (Completly removed everything from modloader and mods)
Outflow seem to have only made it more "Prone" to run into this issue.

PS: I think Smoin mentioned, that Blood on the Clocktower having similiar issues lately too,

@Frooxius
Copy link
Member

Thanks for the new data and the video. I've renamed the issue, since this is a complete freeze, rather than a hitch (hitch to me means that it freezes for a second or a few and then resumes).

With how this behaves, I'm very confident this is not a performance issue at all, rather something breaks/explodes.

Were you not table to spawn and capture the user debug string with the sync tick counter that I requested above? That would've been very helpful.

For the stream in the video, were you the source of the stream? Or was that coming from another user and still functional?

From the log, it seems like the session network threads have possibly exploded and stopped processing data. There's an exception when shutting down, which sounds like something else exploded earlier, but wasn't caught in the log. We might need to add some additional logging.

@jae1911
Copy link
Author

jae1911 commented May 24, 2024

can you pin the sudden change to any particular thing that occurred in the world? E.g. a user spawning an item, switching to a particular avatar, another user joining- or some other potential source that could account for a sudden jump in CPU use?

@shiftyscales
Sadly, nothing stands out as users were joining and leaving for the whole time.

For the stream in the video, were you the source of the stream? Or was that coming from another user and still functional?

@Frooxius
It was @Readun's.

Were you not table to spawn and capture the user debug string with the sync tick counter that I requested above? That would've been very helpful.

Sadly, it seems somebody deleted it for some reason, and when I tried spawning it in the video, it wouldn't update :/

@Readun
Copy link

Readun commented May 24, 2024

Were you not table to spawn and capture the user debug string with the sync tick counter that I requested above? That would've been very helpful.

I ... sadly deleted it a couple minutes before... As I thought we ruled it onto the mod from yesterday. Buuut we were "celebrating" too early :/

J4 was not the Source of the Stream. It was me, but im in LAN with the Headless itself.

@Frooxius
Copy link
Member

Let me ask it this way so it's less confusing on my end: Is the audio stream shown in the video being streamed by the user who recorded that video? Or is it streamed by another user in the session who is currently frozen (in the video)?

@jae1911
Copy link
Author

jae1911 commented May 24, 2024

The video was recorded by me (j4) while the audio was streamed by @Readun.

@BlueCyro
Copy link

BlueCyro commented May 24, 2024

I see mono 6.8 as one of the only concrete mono versions you're listing. Mono is currently up to 6.12.0.2xx and greater at this point. Can you 100% confirm that you're using at least up to that version on all configs by running the version command?

I'd highly recommend updating mono for all of these configurations if it's anything less, just to be sure.

@jae1911
Copy link
Author

jae1911 commented May 24, 2024

Can you 100% confirm that you're using at least up to that version on all configs by running the version command?

I can definitely confirm mono is on its latest available version with the Docker tests since the image is directly pulled from DockerHub.
The image was also rebuilt between the original test 1 & 2 to remove mods, so it was definitely 100% up-to-date for the Docker test.

@jae1911
Copy link
Author

jae1911 commented May 27, 2024

Small update, I was on a headless hosted by @decoybird and Avantis and the locking issue happened.
I managed to get some screenshots of the tick sync panel:

image

I will still conduct another test during this week with an empty gridspace to see if the issue persists there and also put this panel around in the sessions this week on Wed and Thu to get more information.

@Readun
Copy link

Readun commented May 30, 2024

We might have clue on what is causing it now after yesterdays event and it might actually be something on someones Avatar.
We will investigate futher with them, because that extreme change in CPU load should not happen. (2 Threads on 50% -> All 24 Threads).
Tonight is another event and we gradually gonna take their avatar appart, everytime it is gonna happen again to pinpoint it down.

@Readun
Copy link

Readun commented May 30, 2024

Update: Okay... Nevermind, it is not Avatar Specific.

God this is such a wonky thing to pin down DX
We were watching the problem continue with them heaving already switched to a light avatar and it went back down when someone left.
He reequiped the heavy avatar -> Nothing happend
Until Canadian Git joined and all cores went back up.

My current state Observation: It so far can be triggered or "Fixed" when people join/Leave or change their Avatars in a session.

@shiftyscales
Copy link
Collaborator

Could you try isolating that particular case, @Readun?

E.g. have the user equip the heavy avatar, have a secondary user join and see if that causes it?

I'm not sure why you ruled out it being avatar specific if the same avatar was in use for each of the occasions it happened.

@Readun
Copy link

Readun commented May 30, 2024

Because it also went down while they were still wearing it.
Or we had instances were they weren't even there and it reapeared today.

Im not sure how to futher pinpoint it down :/

It is really heavy for the Headless and we had expirienced full lockups when the headless only is a more reasonable 6 Threads CPU.
We dont need so many threads, if that "Bug" doesnt appear.

We havent tried to rule out, if its only the headless client or the normal client too btw.

Is there anything Froox can do to be able to log the issue?

From the log, it seems like the session network threads have possibly exploded and stopped processing data. There's an exception when shutting down, which sounds like something else exploded earlier, but wasn't caught in the log. We might need to add some additional logging.

@Frooxius
Copy link
Member

Frooxius commented Jun 2, 2024

Like I mentioned above, I don't think this really has anything to do with the actual performance, but rather something that completely breaks part of the networking code - like throwing an exception inside of it.

Typically performance problems will still keep the system chugging forward - just slowly and jittery. They will not result in a complete halt that never recovers.

I need to add more instrumentation to gather more data. We had this happen during BoTC yesterday as well, even when people were joining in simple avatars.

@Readun
Copy link

Readun commented Jun 14, 2024

New rough Info we were able to observe:
The assumption that it is a specific Avatar from a friend solidifies again. Specificly in combination with the Asset Cleanup!

  • He spawns in his Avatar
    -> No Issue, until the Asset Cleanup accures! (Saw hidden Avatar Fluff flash pink for a frame)
  • High CPU Utílization, even when he deletes his Avatar.
  • CPU goes down immidiatly, when triggering another Asset Cleanup.

Im a bit confused on what it could be, as we were already deleting a lot of his stuff from the Avatar before.
Now with the Asset Cleanup triggering it, we'll need to take a look at his meshes and Materials.

It is -not- the same like in #2213 as it goes back to normal and no Restart is required.

@Readun
Copy link

Readun commented Jun 24, 2024

Okay, we went through futher testing, but I have no Idea how we can futher pinpoint it...

  • It is related to an Asset on his Body Slot.
  • When the Body is deleted, it stays, until the next Asset cleanup
  • If the specific avatar asset folder is persistent and the Avatar is deleted. It stays, until said avatar asset folder is deleted & the Cleanup did its work.
  • Deleting the body Node and Undoing it fixes the issue, but breaks a couple of links from the visimes.
  • Moving every asset of the body slot over onto the Avatar does not help, it still creates that CPU load behaivior, when the cleanup accures.
  • Recreated every Material under the affected Body slot. To be sure its not a problem from the past. (We did not reupload the textures tho..)

Im running out of Ideas what we can test futher :/

@Frooxius
Copy link
Member

I'm adding some extra diagnostics in the next build. I'm starting to build a suspicion this might be a deadlock in one of the threads.

Which is a bit unfortunate, but it could be an excuse needed to rewrite this system (which would also solve a number of other issues).

@Frooxius
Copy link
Member

Okay the diagnostic command is now in 2024.6.24.1317!

If the session freezes your headless, type this command and save the output. Then type it again a few seconds later and save the output again. And post it here.

@shiftyscales
Copy link
Collaborator

Since it didn't seem to be mentioned in the issue itself, the command is debugWorldState @jae1911 @Readun.

Added debugWorldState command to headless, which prints out debugging info for state of all worlds (this is to help diagnose issue reported by @ Jae “awa” J4, @ Readun, @ Hikari Akimori, @ Ink_25, issue #2079)
-- If the session freezes your headless, type this command and save the output. Then type it again a few seconds later and save the output again.

@Readun
Copy link

Readun commented Jun 25, 2024

Mmnnhhh... @Frooxius this is slightly tricky.
Does this work when the headless also just get that high load?
The Headless freezes with this specific bug only when the Headless CPU isnt powerfull enough.

Does it still output the desired diagnostic or is this only specific for a freeze?

Gonna try it in the evening.

@Frooxius
Copy link
Member

I don't know. I just need you to run the command when this issue happens and see what it outputs (or if anything else happens).

@Readun
Copy link

Readun commented Jun 25, 2024

Will do!

@Banane9
Copy link

Banane9 commented Jun 25, 2024

I'm starting to build a suspicion this might be a deadlock in one of the threads.

That was my first suspicion too after reading

The Headless freezes with this specific bug only when the Headless CPU isnt powerfull enough.

Would make perfect sense to explain why it only gets stuck when something else isn't done fast enough.

@Readun
Copy link

Readun commented Jun 25, 2024

Okay, just tested it:

With the issue
grafik

After the Issue:
grafik

@Frooxius
Copy link
Member

Thanks!

I'm a bit confused by these outputs though. Is the "With the issue" screenshot after everything already froze? Meaning are both of these after everything freezes?

@Readun
Copy link

Readun commented Jun 25, 2024

The first one is with that bug accuring, the second one after is after Ink removed the problematic avatar & the Asset Cleanup took place.

Okay, another test with only 4 Threads..... @jae1911 we need to test this with a Linux headless. Apparently a Windows Headless does not "Freeze" fully.

grafik

READUNSERVER - 2024.6.24.1317 - 2024-06-25 19_11_57.log

I will try another quick test with just one thread to be fully sure...

@Frooxius
Copy link
Member

Wait, I'm a bit more confused - so you're able to unfreeze the session by deleting something? And then it continues as normal?

In order to diagnose this, I need two outputs of that command while everything is frozen - I need to see what's happening during everything being frozen at two distinct points in time. That's why I asked to wait a few seconds and type it again.

Otherwise this won't be very useful.

@Readun
Copy link

Readun commented Jun 25, 2024

  • He spawns in his Avatar
    -> No Issue, until the Asset Cleanup accures! (Saw hidden Avatar Fluff flash pink for a frame)

  • High CPU Utílization, even when he deletes his Avatar.

  • CPU goes down immidiatly, when triggering another Asset Cleanup.

Yes, described here. All that from me is with a Windows Headless. @jae1911 is booting up his Linux headless right now to do futher testing and Im restricting mine to 1 Thread.

I need two outputs of that command while everything is frozen

In my last screenshoot with 4 threads, I wrote "Starting now" into the command line, when we triggered the bug.

@Readun
Copy link

Readun commented Jun 25, 2024

This is with 1 Thread.
The Headless itself already spiked to 100% on that one core, when just typing debugworldstate

This is with the bug triggered, it does not freeze on Windows.
grafik

Next up is Testing Linux headless with @jae1911 !

@Frooxius
Copy link
Member

I'm not fully clear on the sequence of events there. You mention high CPU utilization, but I'm not sure if that means that everything in the session is frozen? Similarly when there's low CPU utilization, if that means that everything is unfrozen - you could have low CPU utilization, while session still remains frozen.

Can you explicitly specify when session freezes and when it unfreezes please? It's a bit ambiguous on what correlates to what.


From the screenshot however, it doesn't actually seem like this is a deadlock at all - the sync threads keep processing.

If the session is able to unfreeze as well, that would also indicate that something completely different is at play. We'd probably need to isolate the item that needs to be deleted to unfreeze everything (e.g. by deleting it piece by piece).


Actually this brings another question - when a new user joins after the server is already frozen, is everything frozen for them too?

@jae1911
Copy link
Author

jae1911 commented Jun 25, 2024

Actually this brings another question - when a new user joins after the server is already frozen, is everything frozen for them too?

In my original testing, this is what happened, everybody was frozen at spawn (see screenshot in #2079 (comment))

@Readun
Copy link

Readun commented Jun 25, 2024

I have the feeling this is a mixture of having this specific bug + having a lot of players in the session to overwhelm a specific process.
We are now using J4s Linux headless with that we had the freeze issue. -> It has similiar symptoms of high cpu utulization but it doesnt freeze.
We gonna try to invite a couple people quick!
AAaaaand it happend, J4 will report!

Can you explicitly specify when session freezes and when it unfreezes please? It's a bit ambiguous on what correlates to what.

As written above, it is Ink_25 Body-slot and Assets.
We tried to figure out what asset, but we were at it for hours x..x

vvvv

Okay, we went through futher testing, but I have no Idea how we can futher pinpoint it...

* It is related to an Asset on his Body Slot.

* When the Body is deleted, it stays, until the next Asset cleanup

* If the specific avatar asset folder is persistent and the Avatar is deleted. It stays, until said avatar asset folder is deleted & the Cleanup did its work.

* Deleting the body Node and Undoing it fixes the issue, but breaks a couple of links from the visimes.

* Moving every asset of the body slot over onto the Avatar does not help, it still creates that CPU load behaivior, when the cleanup accures.

* Recreated every Material under the affected Body slot. To be sure its not a problem from the past. (We did not reupload the textures tho..)

Im running out of Ideas what we can test futher :/

PS: We just confirmed, this bug will trigger a Headless freeze, -IF- there are enough players aswell.

@jae1911
Copy link
Author

jae1911 commented Jun 25, 2024

Alright, as for the Linux test.

Here is without the bug happening (before triggering anything):

DN0 World 0>debugWorldState
World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: DN0 World 0, Handle: 2
        WorldStage: RefreshBegin
        SyncTick: 11256
        WorldSessionState: WaitingForSyncThreadEvent
        WorldSessionStopProcessing: False
        WorldMessagesToProcess: 0
        WorldTotalProcessedMessages: 24908
        WorldMessagesToTransmit: 0
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
DN0 World 0>

The bug starts (all cores start being 80-98% all the time):

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test, Handle: 2
        WorldStage: PhysicsMoved
        SyncTick: 78656
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 2
        WorldTotalProcessedMessages: 361268
        WorldMessagesToTransmit: 0
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

Everybody is frozen:

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test - Bug testing (WE NEED PLAYERS), Handle: 2
        WorldStage: ProtoFluxContinuousChanges
        SyncTick: 86297
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 2
        WorldTotalProcessedMessages: 484363
        WorldMessagesToTransmit: 7246
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test - Bug testing (WE NEED PLAYERS), Handle: 2
        WorldStage: ProtoFluxContinuousChanges
        SyncTick: 86452
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 18
        WorldTotalProcessedMessages: 492286
        WorldMessagesToTransmit: 10852
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test - Bug testing (WE NEED PLAYERS), Handle: 2
        WorldStage: ProtoFluxContinuousChanges
        SyncTick: 86554
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 11
        WorldTotalProcessedMessages: 495155
        WorldMessagesToTransmit: 12303
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test - Bug testing (WE NEED PLAYERS), Handle: 2
        WorldStage: ProtoFluxUpdates
        SyncTick: 86651
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 15
        WorldTotalProcessedMessages: 498137
        WorldMessagesToTransmit: 12590
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

World: Userspace, Handle: 1
        WorldStage: RefreshBegin
        SyncTick: 1
        WorldSessionState:
        WorldSessionStopProcessing:
        WorldMessagesToProcess:
        WorldTotalProcessedMessages:
        WorldMessagesToTransmit:
        ProcessingSyncMessage:
        CurrentlyDecodingStream:
World: 2079 Linux Test - Bug testing (WE NEED PLAYERS), Handle: 2
        WorldStage: ProtoFluxUpdates
        SyncTick: 86758
        WorldSessionState: ProcessingStopped
        WorldSessionStopProcessing: True
        WorldMessagesToProcess: 15
        WorldTotalProcessedMessages: 501819
        WorldMessagesToTransmit: 12971
        ProcessingSyncMessage:
        CurrentlyDecodingStream:

Log files:
cheesebox - 2024.6.24.1317 - 2024-06-25 17_28_09.log


Some notes: the freeze happened after somebody joined after the bug started happening.

@Frooxius
Copy link
Member

I think we need to separate here what we're actually trying to narrow. High CPU usage itself isn't necessarily an issue - it can happen, but if things don't freeze, it might be lots of other things - we should ignore those unless the freeze also happens, because they might have a different cause and that can make things more confusing.

When things do freeze though, that indicates something else than just high CPU usage - something gets heavily broken or overwhelmed and that's what we need to focus on.


Given the latest info, it looks like it's not a deadlock however, which means I'll have to add some more diagnostics - you don't need to send any more of the debug command outputs, these ones are sufficient, thank you!

Would you be able to write concise bullet point list of specific steps on what exactly happens, in what order - including when things freeze and when they unfreeze so I have a clear timeline of this?

I'm a bit scrambled on the exact details, since it's through a bunch of messages and descriptions.

@shiftyscales shiftyscales added the triaged This issue has been assessed label Jun 25, 2024
@Readun
Copy link

Readun commented Jun 25, 2024

Interestingly this time my client got affected after that freeze and crash. Im in my local world and the client seems to studder heavily between 200 fps and a stopped frame. But we should Ignore it for now.

High CPU usage itself isn't necessarily an issue

It is 100% an Issue, in this case a combination with Inks Avatar Assets & the Asset Cleanup event.

  • I can deactivate the Asset cleanup and we never run into that issue with his avatar.
  • I can instantly trigger a high CPU load, when I trigger the cleanup once.
  • It stays, even when deleting his avatar
  • Triggering the asset cleanup again after deletion brings it back down to normal levels.

This is definatly a combination of multiple issues.
As long as we dont have many users, the Headlesses are surviving.

If the Issue happens with a lot of players, then it deadlocks something.

PS: Oh woup, I was writing my respond to the first part of your respond. Doing more Testing now!

@Frooxius
Copy link
Member

It is 100% an Issue, in this case a combination with Inks Avatar Assets & the Asset Cleanup event.

It can be a symptom of an issue. Problem is, it can be symptom of lots of different issues, or none at all, so we need to be careful we're not muddling the data necessarily - e.g. the new diagnostic command assumed a deadlock and that everything freezes completely.

The cases where it does deadlock get most useful data to get at the "core" of this.


I can instantly trigger a high CPU load, when I trigger the cleanup once.

How exactly are you triggering it?


  • Could you actually record the Background Jobs/Tasks (I forget exact name) tab in the debug dialog while this happens? This might have some indication of things
  • If you make changes to the world (e.g. say change material on the environment) while things are frozen, do these show up to other users?

@jae1911
Copy link
Author

jae1911 commented Jun 25, 2024

Alright, a single person joining does not causes the freeze.

During the test, the headless crashed while I was typing the debugWorldState command, but unsure if it is related to this issue:

2079 Linux testing (NEED USERS)>worldState[ERROR] FATAL UNHANDLED EXCEPTION: System.ArgumentOutOfRangeException: Value must be positive and below the buffer height.
Parameter name: top
  at System.TermInfoDriver.SetCursorPosition (System.Int32 left, System.Int32 top) [0x0004d] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.TermInfoDriver.WriteSpecialKey (System.ConsoleKeyInfo key) [0x0006e] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.TermInfoDriver.Echo (System.ConsoleKeyInfo key) [0x00020] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.TermInfoDriver.ReadUntilConditionInternal (System.Boolean haltOnNewLine) [0x000de] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.TermInfoDriver.ReadLine () [0x00000] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.ConsoleDriver.ReadLine () [0x00000] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.Console.ReadLine () [0x00019] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at FrooxEngine.Headless.CommandHandler+<>c.<ProcessCommands>b__13_1 () [0x00000] in <43eb7cd72ff54254b33cdcda802c02e3>:0
  at System.Threading.Tasks.Task`1[TResult].InnerInvoke () [0x00012] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
  at System.Threading.Tasks.Task.Execute () [0x00000] in <12b418a7818c4ca0893feeaaf67f1e7f>:0
--- End of stack trace from previous location where exception was thrown ---

  at FrooxEngine.Headless.CommandHandler.ProcessCommands () [0x0012e] in <43eb7cd72ff54254b33cdcda802c02e3>:0
  at FrooxEngine.Headless.Program.Main (System.String[] args) [0x008bc] in <43eb7cd72ff54254b33cdcda802c02e3>:0
  at FrooxEngine.Headless.Program.<Main> (System.String[] args) [0x0000c] in <43eb7cd72ff54254b33cdcda802c02e3>:0

Full logs:
cheesebox - 2024.6.24.1317 - 2024-06-25 18_24_36.log

We're gonna proceed to more tests to see about the other causes.

@Readun
Copy link

Readun commented Jun 25, 2024

How exactly are you triggering it?

Only to cause the CPU Load:

  1. Have @Ink_24 Specific Avatar or simply his Body slot in the world. (Nothing happens yet)
  2. Start a cleanup of unused Assets. (CPU gets a big load)

As long as the Headless didnt froze into the Deadlock (When we have more players in a session), we can revert the cpu load issue by:

  • Deleting Ink_24's Avatar / Body slot & Run a Cleanup of unused assets again

@Frooxius
Copy link
Member

What I mean, how exactly are you triggering the asset cleanup?

@Readun
Copy link

Readun commented Jun 25, 2024

grafik
Using it in the Session tab. I Disable it first, turn it down to 5 Seconds. Let him spawn the avatar, turn it on and wait the 5 Seconds. (Or 300) and turn it off again after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as intended. needs more information More information is requested about this issue. triaged This issue has been assessed
Projects
None yet
Development

No branches or pull requests

6 participants