Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix to new random bug (memory note) #545

Open
KevinGuancheDarias opened this issue Apr 26, 2024 · 0 comments
Open

Potential fix to new random bug (memory note) #545

KevinGuancheDarias opened this issue Apr 26, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@KevinGuancheDarias
Copy link
Owner

KevinGuancheDarias commented Apr 26, 2024

Background context (long text... 😱 )

The OWGE as a competitive and cooperative game has the scenario where many players fight in the same planet at the same time, sometimes coordinating multiple task to run at the same second, making OWGE backend a real concurrency challenge.

Running multiple missions at the same time in the same planet means that some threads would be concurrently manipulating the same rows at the same time, this can cause a lot of problems, hard to fix and debug, deadlocks, inconsistent data, high computation due to random repeat after deadlocks and so on

My idea to avoid this problem is to use named mutex locks in the database (to allow horizontal scaling of game instances), in MySQL this lock function is GET_LOCK('lock_name',timeout); .... and later RELEASE_LOCK('lock_name'), this also has it's hard to debug problems, for example Hikari DB pool won't release locks when returning the connection to the pool, and MySQL only clears the locks for you, when your mysql connection ends or the specified timeout in GET_LOCK gets reached, with Hikari pool, connections don't end, they stay open, to avoid latency and overhead of opening new connection, so one has to be SURE that the RELEASE_LOCK is invoked ALWAYS... placing a low value for timeout in the GET_LOCK() is not a good solution neither, because if server has high load, that timeout can be triggered on unwanted situations

🗒️ Notice: This locking strategy is implemented in class MysqlLockUtilService and wrapped for planet's purpose in PlanetLockUtilService

This 🐛 is famous in the community

Everyone in the community calls this bug the "random bug", because from the perspective of the player, mission fails for random reasons

The old problem

The initial MySQL locking strategy was bugged, we were getting "random" deadlocks, when we should not, because:

thread A wants to do a mission on planet 4,6 and 1 and thread B wants to do in 6,4,1 it would deadlock because thread A locked 4 and waits for 6 which is locked by B and thread B waits for 4 because it's locked by thread A

The old fix

The random bug was fixed by sorting the planets when locking, so there's a warranty that the planets are locked always in the same order, and no possible deadlock can happen, so thread A and B would try to acquire locks in the same order 1,4,6, as the first thread get lock 1, would cause thread B to wait for 1 to be released, it would not lock 4,6, because it's waiting for 1, that fix worked for a lot of time

Why that fix stopped working?

A fix to other bug was added in commit 7351524 because the TransactionUtilService.doAfterCommit() was not been triggered under strange circumstances... that bug which is other offtopic long story was fixed, and then came the new problem, sometimes the action triggered by doAfterCommit() also needs locking planets, this can break the lock-in-order feature because: (explained as a table of thread and their time actions)

======================================================================================================================
== Thread A === LOCKS 1,4,6 == **ACQUIRES 1,4,6**       == DOES ITS JOB == doAfterCommit LOCKS 2,8,1 == **DEADLOCK**
=======================================================================================================================
=======================================================================================================================
== Thread B === LOCKS 2,4,6 == **ACQUIRES 2, WAIT 4,6** == KEEPS WAIT 4 == KEEPS WAIT                == **DEADLOCK**
========================================================================================================================

The fix that should work

As I should not make the actions of doAfterCommit() be aware of outside locking as would break some good practises, the MySQLLockUtil when prompted to lock additional planets, would unlock all planets and then lock then again in the same order, so for example thread A would lock 1,2,4,6,8 when invoked from transactionUtilService, as the first time we locked 1,4,6 and in the second one 2,8,1 but as we released the previous locks, we ensured that thread B could get to run its task with planet 4 and 6 with no problem, and then it would release the lock and planet A would get the lock for all its required planets. it's not pretty common for threads to be locking the same planets, so the potential high response time for this case that can reach the 500ms is not worth the effort to study other solutions

This new fix can also fail if...

As the new fix would store the currently locked planets in a threadLocal, if there's logic that runs on a separated thread but blocks the current thread till that logic ends, planets locked inside that logic could also cause deadlock, lucky with Java 21 one can avoid using thread pools, and so a InheritableThreadLocal should do the trick, there are some places in the OWGE that may still be using platform pool-threads, OWGE's AsyncRunnerBo was updated to virtual threads, so it's safe to use, the behavior of Spring's @async annotation is unknown to me for now (spring should not be using a thread pool when configured to use virtual threads)

🗒️ Note: I think it may be possible to detect that MysqlLockUtilService is run inside a thread pool by looking at thread name

@KevinGuancheDarias KevinGuancheDarias added the bug Something isn't working label Apr 26, 2024
@github-project-automation github-project-automation bot moved this to To do in v0.11 Aug 28, 2024
KevinGuancheDarias added a commit that referenced this issue Sep 7, 2024
Notice: Had to fix frontend dependencies going crazy when doing a clean install

Also fixed a mysql container error due to changes in modern mysql version
KevinGuancheDarias added a commit that referenced this issue Sep 8, 2024
Notice: Had to fix frontend dependencies going crazy when doing a clean install

Also fixed a mysql container error due to changes in modern mysql version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To do
Development

No branches or pull requests

1 participant