You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running on a large number of cores, the current process stealing starts dead-locking schedulers and shows a few other bugs:
a process gets queued up in several schedulers, which is likely a bug in the Proc_queue or Proc_set, and once its terminated in one scheduler, the next scheduler that tries to run it will fail because finalized processes should never be put on a queue.
when moving timers around sometimes a timer will get triggered on a scheduler before its moved out of it – moving timers to the IO scheduler helps, and can improve the reliability of the timers since the polling workload has a strict deadline, but also means reworking the timeouts for receives and syscalls.
I've been unable to fix with additional safeguards (like more restrictive locking of the process queue), but I have identified that the Proc_set is not working as intended (likely due to the use of Atomics instead of a lock).
In the meantime main has disabled process-stealing until we figure out next steps here.
This is a good time to step back and maybe rewrite the scheduler into more module pieces that can be easier to reason about and test.
The text was updated successfully, but these errors were encountered:
When running on a large number of cores, the current process stealing starts dead-locking schedulers and shows a few other bugs:
a process gets queued up in several schedulers, which is likely a bug in the Proc_queue or Proc_set, and once its terminated in one scheduler, the next scheduler that tries to run it will fail because finalized processes should never be put on a queue.
when moving timers around sometimes a timer will get triggered on a scheduler before its moved out of it – moving timers to the IO scheduler helps, and can improve the reliability of the timers since the polling workload has a strict deadline, but also means reworking the timeouts for receives and syscalls.
I've been unable to fix with additional safeguards (like more restrictive locking of the process queue), but I have identified that the Proc_set is not working as intended (likely due to the use of Atomics instead of a lock).
In the meantime
main
has disabled process-stealing until we figure out next steps here.This is a good time to step back and maybe rewrite the scheduler into more module pieces that can be easier to reason about and test.
The text was updated successfully, but these errors were encountered: