You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The OWGE as a competitive and cooperative game has the scenario where many players fight in the same planet at the same time, sometimes coordinating multiple task to run at the same second, making OWGE backend a real concurrency challenge.
Running multiple missions at the same time in the same planet means that some threads would be concurrently manipulating the same rows at the same time, this can cause a lot of problems, hard to fix and debug, deadlocks, inconsistent data, high computation due to random repeat after deadlocks and so on
My idea to avoid this problem is to use named mutex locks in the database (to allow horizontal scaling of game instances), in MySQL this lock function is GET_LOCK('lock_name',timeout); .... and later RELEASE_LOCK('lock_name'), this also has it's hard to debug problems, for example Hikari DB pool won't release locks when returning the connection to the pool, and MySQL only clears the locks for you, when your mysql connection ends or the specified timeout in GET_LOCK gets reached, with Hikari pool, connections don't end, they stay open, to avoid latency and overhead of opening new connection, so one has to be SURE that the RELEASE_LOCK is invoked ALWAYS... placing a low value for timeout in the GET_LOCK() is not a good solution neither, because if server has high load, that timeout can be triggered on unwanted situations
Everyone in the community calls this bug the "random bug", because from the perspective of the player, mission fails for random reasons
The old problem
The initial MySQL locking strategy was bugged, we were getting "random" deadlocks, when we should not, because:
thread A wants to do a mission on planet 4,6 and 1 and thread B wants to do in 6,4,1 it would deadlock because thread A locked 4 and waits for 6 which is locked by B and thread B waits for 4 because it's locked by thread A
The old fix
The random bug was fixed by sorting the planets when locking, so there's a warranty that the planets are locked always in the same order, and no possible deadlock can happen, so thread A and B would try to acquire locks in the same order 1,4,6, as the first thread get lock 1, would cause thread B to wait for 1 to be released, it would not lock 4,6, because it's waiting for 1, that fix worked for a lot of time
Why that fix stopped working?
A fix to other bug was added in commit 7351524 because the TransactionUtilService.doAfterCommit() was not been triggered under strange circumstances... that bug which is other offtopic long story was fixed, and then came the new problem, sometimes the action triggered by doAfterCommit() also needs locking planets, this can break the lock-in-order feature because: (explained as a table of thread and their time actions)
As I should not make the actions of doAfterCommit() be aware of outside locking as would break some good practises, the MySQLLockUtil when prompted to lock additional planets, would unlock all planets and then lock then again in the same order, so for example thread A would lock 1,2,4,6,8 when invoked from transactionUtilService, as the first time we locked 1,4,6 and in the second one 2,8,1 but as we released the previous locks, we ensured that thread B could get to run its task with planet 4 and 6 with no problem, and then it would release the lock and planet A would get the lock for all its required planets. it's not pretty common for threads to be locking the same planets, so the potential high response time for this case that can reach the 500ms is not worth the effort to study other solutions
This new fix can also fail if...
As the new fix would store the currently locked planets in a threadLocal, if there's logic that runs on a separated thread but blocks the current thread till that logic ends, planets locked inside that logic could also cause deadlock, lucky with Java 21 one can avoid using thread pools, and so a InheritableThreadLocal should do the trick, there are some places in the OWGE that may still be using platform pool-threads, OWGE's AsyncRunnerBo was updated to virtual threads, so it's safe to use, the behavior of Spring's @async annotation is unknown to me for now (spring should not be using a thread pool when configured to use virtual threads)
🗒️ Note: I think it may be possible to detect that MysqlLockUtilService is run inside a thread pool by looking at thread name
The text was updated successfully, but these errors were encountered:
Notice: Had to fix frontend dependencies going crazy when doing a clean install
Also fixed a mysql container error due to changes in modern mysql version
Notice: Had to fix frontend dependencies going crazy when doing a clean install
Also fixed a mysql container error due to changes in modern mysql version
Background context (long text... 😱 )
The OWGE as a competitive and cooperative game has the scenario where many players fight in the same planet at the same time, sometimes coordinating multiple task to run at the same second, making OWGE backend a real concurrency challenge.
Running multiple missions at the same time in the same planet means that some threads would be concurrently manipulating the same rows at the same time, this can cause a lot of problems, hard to fix and debug, deadlocks, inconsistent data, high computation due to random repeat after deadlocks and so on
My idea to avoid this problem is to use named mutex locks in the database (to allow horizontal scaling of game instances), in MySQL this lock function is GET_LOCK('lock_name',timeout); .... and later RELEASE_LOCK('lock_name'), this also has it's hard to debug problems, for example Hikari DB pool won't release locks when returning the connection to the pool, and MySQL only clears the locks for you, when your mysql connection ends or the specified timeout in GET_LOCK gets reached, with Hikari pool, connections don't end, they stay open, to avoid latency and overhead of opening new connection, so one has to be SURE that the RELEASE_LOCK is invoked ALWAYS... placing a low value for timeout in the GET_LOCK() is not a good solution neither, because if server has high load, that timeout can be triggered on unwanted situations
🗒️ Notice: This locking strategy is implemented in class MysqlLockUtilService and wrapped for planet's purpose in PlanetLockUtilService
This 🐛 is famous in the community
Everyone in the community calls this bug the "random bug", because from the perspective of the player, mission fails for random reasons
The old problem
The initial MySQL locking strategy was bugged, we were getting "random" deadlocks, when we should not, because:
thread A wants to do a mission on planet 4,6 and 1 and thread B wants to do in 6,4,1 it would deadlock because thread A locked 4 and waits for 6 which is locked by B and thread B waits for 4 because it's locked by thread A
The old fix
The random bug was fixed by sorting the planets when locking, so there's a warranty that the planets are locked always in the same order, and no possible deadlock can happen, so thread A and B would try to acquire locks in the same order 1,4,6, as the first thread get lock 1, would cause thread B to wait for 1 to be released, it would not lock 4,6, because it's waiting for 1, that fix worked for a lot of time
Why that fix stopped working?
A fix to other bug was added in commit 7351524 because the
TransactionUtilService.doAfterCommit()
was not been triggered under strange circumstances... that bug which is other offtopic long story was fixed, and then came the new problem, sometimes the action triggered bydoAfterCommit()
also needs locking planets, this can break the lock-in-order feature because: (explained as a table of thread and their time actions)The fix that should work
As I should not make the actions of doAfterCommit() be aware of outside locking as would break some good practises, the MySQLLockUtil when prompted to lock additional planets, would unlock all planets and then lock then again in the same order, so for example thread A would lock 1,2,4,6,8 when invoked from transactionUtilService, as the first time we locked 1,4,6 and in the second one 2,8,1 but as we released the previous locks, we ensured that thread B could get to run its task with planet 4 and 6 with no problem, and then it would release the lock and planet A would get the lock for all its required planets. it's not pretty common for threads to be locking the same planets, so the potential high response time for this case that can reach the 500ms is not worth the effort to study other solutions
This new fix can also fail if...
As the new fix would store the currently locked planets in a threadLocal, if there's logic that runs on a separated thread but blocks the current thread till that logic ends, planets locked inside that logic could also cause deadlock, lucky with Java 21 one can avoid using thread pools, and so a InheritableThreadLocal should do the trick, there are some places in the OWGE that may still be using platform pool-threads, OWGE's
AsyncRunnerBo
was updated to virtual threads, so it's safe to use, the behavior of Spring's @async annotation is unknown to me for now (spring should not be using a thread pool when configured to use virtual threads)🗒️ Note: I think it may be possible to detect that
MysqlLockUtilService
is run inside a thread pool by looking at thread nameThe text was updated successfully, but these errors were encountered: