Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

anti-catastrophic-lag-watchdog #12

Open
Dinhero21 opened this issue Oct 28, 2023 · 2 comments
Open

anti-catastrophic-lag-watchdog #12

Dinhero21 opened this issue Oct 28, 2023 · 2 comments

Comments

@Dinhero21
Copy link
Owner

The idea is to have a program running in another thread (via node:child_process or node:worker_threads) that will monitor the main process.

The main process will send data to the watchdog to signal its state (something like alive every second).

If the server does not send any data for a long amount of time (probably a minute) the watchdog will initialize the data loss mitigation protocol.

Data Loss Mitigation Protocol

The watchdog could simply kill the main process but that would destroy progress and rollback the server.

My idea is for the watchdog to have a node:inspector instance inspecting the main process.

Upon noticing the catastrophic lag the watchdog is going to send a signal to the server telling it to disconnect all clients, save the world, and shut down (to mitigate data loss). This will, however, not happen as the server is currently stuck in a loop.

I have many ideas on how to get out of the loop programmatically, some dead simple and some overly complex, I will document some of them here:

  • break (in intervals of 1 second probably), this should hopefully break out of while (true) {} loops but might not be able to get out of more complicated code
  • Have a chance of not-ing an if, while or for which would start at 0% and very slowly go up to avoid catastrophic data corruption and fatal errors
  • Have a "stuck" counter, count each time lines are visited, if some line has more than N (probably 1024 or some large number like that) visits, do one (or multiple) of the methods stated above. Reset the counter when a new line gets visited. This might be useful to avoid false-positives.

A problem with all of these anti-loop solutions is that they create invalid states (ex. a function was supposed to return a string but because it prematurely exited the loop it returned undefined), this might cause data corruption.

To mitigate this the last world should be backed up and upon server start the latest world should be attempted to load, if corrupted, load the backup.

A better solution might be to have a loose data parser which when encountering unexpected results would try its best to not crash.

@Dinhero21 Dinhero21 changed the title anti-while-loop-watchdog anti-catastrophic-lag-watchdog Oct 28, 2023
@Dinhero21
Copy link
Owner Author

Runtime.terminateExecution seems like a viable way of doing idea 1

@Dinhero21
Copy link
Owner Author

After a lot of searching, I finally found this which allows you to debug node remotely.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant