Possible data loss when multiple threads access the extra of a node #6628

cpignedoli · 2024-11-21T13:24:45Z

Describe the bug

If different threads try to update the extras of the same node data loss may result.
I tested it in version 2.5.1

Steps to reproduce

To test it I created a 'test' extra in a node. The extra contains a list and I started populating it with odd integers with one thread and even integers with another thread.
At the end I checked how many off and even numbers where missing in the list.

Expected behavior

Your environment

Operating system: within AiiDAlab docker on ARM64
Python version: 3.9.13
aiida-core version: 2.5.1

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

sphuber · 2024-11-22T20:08:46Z

Thanks for the report @cpignedoli . This almost certainly has the same cause as #6262 . Whether it is a different thread or process in parallel, this would likely have the same mechanism

cpignedoli · 2024-11-22T22:35:40Z

Hi @sphuber, I apologize for the confusion. I thought I had posted this only on Slack since I wasn’t able to find my #6262 issue, so I ended up creating a new one. I’ll make an effort to improve my titles next time :)

sphuber · 2024-11-23T07:53:30Z

There is no problem whatsoever @cpignedoli I very much appreciate you taking the time to make a detailed report with steps to reproduce! Was just linking it to another known issue for the developers so they can handle them at the same time.

unkcpz · 2024-11-23T17:01:55Z

Hi @cpignedoli, I try to reproduce the issue on both my laptop and on the demo server but I can not.
So I guess this also related to the operating system and machine architecture.
Can you try it on the demo server as well?

In my laptop, I tried with running with SQLite and with PostgreSQL using the aiida-core docker image. The time sleep I set was 0.02.
In my laptop, I did turn off the CPU hyper-threading, so maybe that it is related.
However, the demo server machine has hyper-threading (but we only give < 2 cores per user @yakutovicha correct me if I am wrong), so maybe that's why I didn't reproduce the issue.

My suggestion is check on you machine to do the same test with verdi presto which use SQLite to see if the problem comes from DB backend.

cpignedoli · 2024-11-23T18:28:41Z

Hi @unkcpz , Thanks a lot, will try on the demo server as you suggest

yakutovicha · 2024-11-24T19:38:31Z

but we only give < 2 cores per user @yakutovicha correct me if I am wrong

It is 1.5 CPU, see here.

unkcpz · 2024-11-25T06:28:16Z

Thanks @yakutovicha. Then it makes sense that the problem didn't show up in demo server in my test. For D_v3 it uses hyper-threading and 1.5 is actually 1.5 virtual CPUs. I'll further test with my workstation with hyper-threading to try to produce the problem.

rabbull · 2024-11-26T13:22:54Z

For this specific issue, the problem may not lie on the AiiDA side. It seems that additional synchronization is required on the user side to prevent race conditions. To better illustrate the issue, I have included a sequence diagram to demonstrate what might be happening:

sequenceDiagram
    actor usr as User
    participant n1 as Notebook 1
    participant n2 as Notebook 2
    participant db as Database

    usr ->>+ n1: Loop 3
    usr ->>+ n2: Loop 4

    n1 ->> db: Read current state [1, 2]
    n2 ->> db: Read current state [1, 2]

    Note over n1, n2: Both notebooks read the initial list state [1, 2].

    n1 ->> n1: append 3 to the list and get [1, 2, 3]
    n2 ->> n2: append 4 to the list and get [1, 2, 4]

    Note over n1, n2: Notebook 1 appends 3, Notebook 2 appends 4.<br/>No synchronization prevents concurrent modifications.

    n1 ->>- db: Write back [1, 2, 3]
    n2 ->>- db: Write back [1, 2, 4]

    Note over db: Notebook 2 overwrites Notebook 1's updates. <br/>The final list becomes [1, 2, 4], <br/>and Notebook 1's update is lost.

    db ->> usr: Return final state [1, 2, 4]

Assuming the list already have [1, 2] in it, and notebook 1 is trying to append 3 while simultaneously notebook is trying to append 4. The thread that writed updated list later may overwrite what was stored by the other thread.

However, for issue #6262, it is more likely a problem that needs to be addressed on the AiiDA side. I would recommend closing this issue and continuing to track the problem under that issue.

unkcpz · 2024-11-26T14:27:59Z

Thanks for the nice summary @rabbull

It seems that additional synchronization is required on the user side to prevent race conditions.

Since we expect that user access the DB always be the APIs we provide, is that possible to put lock through AiiDA to prevent this to happened?

rabbull · 2024-11-26T15:37:31Z

Since we expect that user access the DB always be the APIs we provide, is that possible to put lock through AiiDA to prevent this to happened?

Hi @unkcpz,

I can come up with some ideas, but I don't see a viable solution.

First, I would like to demonstrate the problem. Even if all AiiDA operations are performed atomically (which, due to what was uncovered in #6262, might not be true), race conditions would still occur. This is because the critical section contains user code. To avoid unintended behavior, a large lock needs to wrap all three lines quoted below:

mylist = anode.base.extras.get('test')
mylist.append(i)
anode.base.extras.set('test', mylist)

Before all three lines of code have been executed, any other threads entering the same procedure must wait.

The Lock Approach

One potential solution is to add a lock to the critical section, as I mentioned before. However, there are a few drawbacks:

Even if AiiDA provides interfaces to create and manage locks, users still need to manually include them in their code.
It's tricky to share the lock across multiple notebooks, which are different processes. Different platforms may require different implementations.
It's common to use Redis as a distributed lock, but this is not as reliable as OS locks and will introduce new components and increase complexity.

CAS Approach

Another potential approach is to provide a check-and-set (CAS) interface to replace set. The idea is to check if anode.base.extras has been modified since it was read from the database. If the data has not been modified, then it's good to proceed; but if it has been modified, the procedure fails immediately and lets the users retry themselves. This approach also has some significant drawbacks:

This might be useful in web applications where race conditions are not that frequent, but in scientific computation (I'm not that familiar with this area, but I assume that) race conditions seem to be more intensive, which could lead to severe performance problems.
The implementation of this CAS still needs investigation. I'm not sure if PostgreSQL or SQLite databases themselves provide such an interface. If not, then locks have to be used in the AiiDA framework, which also harms performance.

Lazy Operation

This might be the least feasible idea. AiiDA can add some wrapper classes that provide an identical interface to the built-in types. For example, a class named FakeList can be created, providing an append method. The difference between fake objects and real objects is that fake objects record user operations instead of performing them, and unless the user asks to read the real value, the AiiDA interface just returns a fake object without performing real queries. In this case, AiiDA will be able to know the user's operations and reconstruct them properly to avoid race conditions (and improve query performance).

The drawbacks are obvious:

Extremely difficult to implement.
May still have cases where race conditions cannot be prevented.

In conclusion, the best I can see is to leave all of these problems to be handled by users. This is beyond what a framework can do.

cpignedoli · 2024-11-26T15:47:04Z

Thanks a lot @rabbull and @unkcpz .
The cases I have in mind would be compatible with reduced performance. I cannot comment on how performance may be an issue in a context such as "extreme high throughput calculations" that are meant to run with AiiDA.

cpignedoli added the type/bug label Nov 21, 2024

GeigerJ2 added topic/sqlalchemy topic/storage labels Nov 26, 2024

unkcpz changed the title ~~Possible data loss when multiple threads access teh extra of a node~~ Possible data loss when multiple threads access the extra of a node Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data loss when multiple threads access the extra of a node #6628

Possible data loss when multiple threads access the extra of a node #6628

cpignedoli commented Nov 21, 2024 •

edited

Loading

sphuber commented Nov 22, 2024

cpignedoli commented Nov 22, 2024

sphuber commented Nov 23, 2024

unkcpz commented Nov 23, 2024 •

edited

Loading

cpignedoli commented Nov 23, 2024

yakutovicha commented Nov 24, 2024

unkcpz commented Nov 25, 2024 •

edited

Loading

rabbull commented Nov 26, 2024

unkcpz commented Nov 26, 2024

rabbull commented Nov 26, 2024

cpignedoli commented Nov 26, 2024

Possible data loss when multiple threads access the extra of a node #6628

Possible data loss when multiple threads access the extra of a node #6628

Comments

cpignedoli commented Nov 21, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Your environment

Additional context

sphuber commented Nov 22, 2024

cpignedoli commented Nov 22, 2024

sphuber commented Nov 23, 2024

unkcpz commented Nov 23, 2024 • edited Loading

cpignedoli commented Nov 23, 2024

yakutovicha commented Nov 24, 2024

unkcpz commented Nov 25, 2024 • edited Loading

rabbull commented Nov 26, 2024

unkcpz commented Nov 26, 2024

rabbull commented Nov 26, 2024

The Lock Approach

CAS Approach

Lazy Operation

cpignedoli commented Nov 26, 2024

cpignedoli commented Nov 21, 2024 •

edited

Loading

unkcpz commented Nov 23, 2024 •

edited

Loading

unkcpz commented Nov 25, 2024 •

edited

Loading