Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data loss when multiple threads access the extra of a node #6628

Open
2 tasks
cpignedoli opened this issue Nov 21, 2024 · 11 comments
Open
2 tasks

Possible data loss when multiple threads access the extra of a node #6628

cpignedoli opened this issue Nov 21, 2024 · 11 comments

Comments

@cpignedoli
Copy link

cpignedoli commented Nov 21, 2024

Describe the bug

If different threads try to update the extras of the same node data loss may result.
I tested it in version 2.5.1

Steps to reproduce

To test it I created a 'test' extra in a node. The extra contains a list and I started populating it with odd integers with one thread and even integers with another thread.
At the end I checked how many off and even numbers where missing in the list.
image

Expected behavior

Your environment

  • Operating system: within AiiDAlab docker on ARM64
  • Python version: 3.9.13
  • aiida-core version: 2.5.1

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

@sphuber
Copy link
Contributor

sphuber commented Nov 22, 2024

Thanks for the report @cpignedoli . This almost certainly has the same cause as #6262 . Whether it is a different thread or process in parallel, this would likely have the same mechanism

@cpignedoli
Copy link
Author

Hi @sphuber, I apologize for the confusion. I thought I had posted this only on Slack since I wasn’t able to find my #6262 issue, so I ended up creating a new one. I’ll make an effort to improve my titles next time :)

@sphuber
Copy link
Contributor

sphuber commented Nov 23, 2024

There is no problem whatsoever @cpignedoli I very much appreciate you taking the time to make a detailed report with steps to reproduce! Was just linking it to another known issue for the developers so they can handle them at the same time.

@unkcpz
Copy link
Member

unkcpz commented Nov 23, 2024

Hi @cpignedoli, I try to reproduce the issue on both my laptop and on the demo server but I can not.
So I guess this also related to the operating system and machine architecture.
Can you try it on the demo server as well?

In my laptop, I tried with running with SQLite and with PostgreSQL using the aiida-core docker image. The time sleep I set was 0.02.
In my laptop, I did turn off the CPU hyper-threading, so maybe that it is related.
However, the demo server machine has hyper-threading (but we only give < 2 cores per user @yakutovicha correct me if I am wrong), so maybe that's why I didn't reproduce the issue.

My suggestion is check on you machine to do the same test with verdi presto which use SQLite to see if the problem comes from DB backend.

@cpignedoli
Copy link
Author

Hi @unkcpz , Thanks a lot, will try on the demo server as you suggest

@yakutovicha
Copy link
Contributor

but we only give < 2 cores per user @yakutovicha correct me if I am wrong

It is 1.5 CPU, see here.

@unkcpz
Copy link
Member

unkcpz commented Nov 25, 2024

Thanks @yakutovicha. Then it makes sense that the problem didn't show up in demo server in my test. For D_v3 it uses hyper-threading and 1.5 is actually 1.5 virtual CPUs. I'll further test with my workstation with hyper-threading to try to produce the problem.

@rabbull
Copy link
Contributor

rabbull commented Nov 26, 2024

For this specific issue, the problem may not lie on the AiiDA side. It seems that additional synchronization is required on the user side to prevent race conditions. To better illustrate the issue, I have included a sequence diagram to demonstrate what might be happening:

sequenceDiagram
    actor usr as User
    participant n1 as Notebook 1
    participant n2 as Notebook 2
    participant db as Database

    usr ->>+ n1: Loop 3
    usr ->>+ n2: Loop 4

    n1 ->> db: Read current state [1, 2]
    n2 ->> db: Read current state [1, 2]

    Note over n1, n2: Both notebooks read the initial list state [1, 2].

    n1 ->> n1: append 3 to the list and get [1, 2, 3]
    n2 ->> n2: append 4 to the list and get [1, 2, 4]

    Note over n1, n2: Notebook 1 appends 3, Notebook 2 appends 4.<br/>No synchronization prevents concurrent modifications.

    n1 ->>- db: Write back [1, 2, 3]
    n2 ->>- db: Write back [1, 2, 4]

    Note over db: Notebook 2 overwrites Notebook 1's updates. <br/>The final list becomes [1, 2, 4], <br/>and Notebook 1's update is lost.

    db ->> usr: Return final state [1, 2, 4]
Loading

Assuming the list already have [1, 2] in it, and notebook 1 is trying to append 3 while simultaneously notebook is trying to append 4. The thread that writed updated list later may overwrite what was stored by the other thread.

However, for issue #6262, it is more likely a problem that needs to be addressed on the AiiDA side. I would recommend closing this issue and continuing to track the problem under that issue.

@unkcpz
Copy link
Member

unkcpz commented Nov 26, 2024

Thanks for the nice summary @rabbull

It seems that additional synchronization is required on the user side to prevent race conditions.

Since we expect that user access the DB always be the APIs we provide, is that possible to put lock through AiiDA to prevent this to happened?

@unkcpz unkcpz changed the title Possible data loss when multiple threads access teh extra of a node Possible data loss when multiple threads access the extra of a node Nov 26, 2024
@rabbull
Copy link
Contributor

rabbull commented Nov 26, 2024

Since we expect that user access the DB always be the APIs we provide, is that possible to put lock through AiiDA to prevent this to happened?

Hi @unkcpz,

I can come up with some ideas, but I don't see a viable solution.

First, I would like to demonstrate the problem. Even if all AiiDA operations are performed atomically (which, due to what was uncovered in #6262, might not be true), race conditions would still occur. This is because the critical section contains user code. To avoid unintended behavior, a large lock needs to wrap all three lines quoted below:

mylist = anode.base.extras.get('test')
mylist.append(i)
anode.base.extras.set('test', mylist)

Before all three lines of code have been executed, any other threads entering the same procedure must wait.

The Lock Approach

One potential solution is to add a lock to the critical section, as I mentioned before. However, there are a few drawbacks:

  1. Even if AiiDA provides interfaces to create and manage locks, users still need to manually include them in their code.
  2. It's tricky to share the lock across multiple notebooks, which are different processes. Different platforms may require different implementations.
  3. It's common to use Redis as a distributed lock, but this is not as reliable as OS locks and will introduce new components and increase complexity.

CAS Approach

Another potential approach is to provide a check-and-set (CAS) interface to replace set. The idea is to check if anode.base.extras has been modified since it was read from the database. If the data has not been modified, then it's good to proceed; but if it has been modified, the procedure fails immediately and lets the users retry themselves. This approach also has some significant drawbacks:

  1. This might be useful in web applications where race conditions are not that frequent, but in scientific computation (I'm not that familiar with this area, but I assume that) race conditions seem to be more intensive, which could lead to severe performance problems.
  2. The implementation of this CAS still needs investigation. I'm not sure if PostgreSQL or SQLite databases themselves provide such an interface. If not, then locks have to be used in the AiiDA framework, which also harms performance.

Lazy Operation

This might be the least feasible idea. AiiDA can add some wrapper classes that provide an identical interface to the built-in types. For example, a class named FakeList can be created, providing an append method. The difference between fake objects and real objects is that fake objects record user operations instead of performing them, and unless the user asks to read the real value, the AiiDA interface just returns a fake object without performing real queries. In this case, AiiDA will be able to know the user's operations and reconstruct them properly to avoid race conditions (and improve query performance).

The drawbacks are obvious:

  1. Extremely difficult to implement.
  2. May still have cases where race conditions cannot be prevented.

In conclusion, the best I can see is to leave all of these problems to be handled by users. This is beyond what a framework can do.

@cpignedoli
Copy link
Author

Thanks a lot @rabbull and @unkcpz .
The cases I have in mind would be compatible with reduced performance. I cannot comment on how performance may be an issue in a context such as "extreme high throughput calculations" that are meant to run with AiiDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants