-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible data loss when multiple threads access the extra of a node #6628
Comments
Thanks for the report @cpignedoli . This almost certainly has the same cause as #6262 . Whether it is a different thread or process in parallel, this would likely have the same mechanism |
There is no problem whatsoever @cpignedoli I very much appreciate you taking the time to make a detailed report with steps to reproduce! Was just linking it to another known issue for the developers so they can handle them at the same time. |
Hi @cpignedoli, I try to reproduce the issue on both my laptop and on the demo server but I can not. In my laptop, I tried with running with SQLite and with PostgreSQL using the aiida-core docker image. The time sleep I set was My suggestion is check on you machine to do the same test with |
Hi @unkcpz , Thanks a lot, will try on the demo server as you suggest |
It is 1.5 CPU, see here. |
Thanks @yakutovicha. Then it makes sense that the problem didn't show up in demo server in my test. For D_v3 it uses hyper-threading and 1.5 is actually 1.5 virtual CPUs. I'll further test with my workstation with hyper-threading to try to produce the problem. |
For this specific issue, the problem may not lie on the AiiDA side. It seems that additional synchronization is required on the user side to prevent race conditions. To better illustrate the issue, I have included a sequence diagram to demonstrate what might be happening: sequenceDiagram
actor usr as User
participant n1 as Notebook 1
participant n2 as Notebook 2
participant db as Database
usr ->>+ n1: Loop 3
usr ->>+ n2: Loop 4
n1 ->> db: Read current state [1, 2]
n2 ->> db: Read current state [1, 2]
Note over n1, n2: Both notebooks read the initial list state [1, 2].
n1 ->> n1: append 3 to the list and get [1, 2, 3]
n2 ->> n2: append 4 to the list and get [1, 2, 4]
Note over n1, n2: Notebook 1 appends 3, Notebook 2 appends 4.<br/>No synchronization prevents concurrent modifications.
n1 ->>- db: Write back [1, 2, 3]
n2 ->>- db: Write back [1, 2, 4]
Note over db: Notebook 2 overwrites Notebook 1's updates. <br/>The final list becomes [1, 2, 4], <br/>and Notebook 1's update is lost.
db ->> usr: Return final state [1, 2, 4]
Assuming the list already have However, for issue #6262, it is more likely a problem that needs to be addressed on the AiiDA side. I would recommend closing this issue and continuing to track the problem under that issue. |
Thanks for the nice summary @rabbull
Since we expect that user access the DB always be the APIs we provide, is that possible to put lock through AiiDA to prevent this to happened? |
Hi @unkcpz, I can come up with some ideas, but I don't see a viable solution. First, I would like to demonstrate the problem. Even if all AiiDA operations are performed atomically (which, due to what was uncovered in #6262, might not be true), race conditions would still occur. This is because the critical section contains user code. To avoid unintended behavior, a large lock needs to wrap all three lines quoted below: mylist = anode.base.extras.get('test')
mylist.append(i)
anode.base.extras.set('test', mylist) Before all three lines of code have been executed, any other threads entering the same procedure must wait. The Lock ApproachOne potential solution is to add a lock to the critical section, as I mentioned before. However, there are a few drawbacks:
CAS ApproachAnother potential approach is to provide a check-and-set (CAS) interface to replace
Lazy OperationThis might be the least feasible idea. AiiDA can add some wrapper classes that provide an identical interface to the built-in types. For example, a class named The drawbacks are obvious:
In conclusion, the best I can see is to leave all of these problems to be handled by users. This is beyond what a framework can do. |
Describe the bug
If different threads try to update the extras of the same node data loss may result.
I tested it in version 2.5.1
Steps to reproduce
To test it I created a 'test' extra in a node. The extra contains a list and I started populating it with odd integers with one thread and even integers with another thread.
At the end I checked how many off and even numbers where missing in the list.
Expected behavior
Your environment
Other relevant software versions, e.g. Postres & RabbitMQ
Additional context
The text was updated successfully, but these errors were encountered: