Lock-based thread safety #92

crusaderky · 2023-03-25T15:31:32Z

Partially closes [WIP] Asynchronous SpillBuffer distributed#7686
The whole lockless thread safety design from Partial thread-safety #82 and Enhanced thread-safety in zict.File #90 was extremely brittle. This PR swaps it for an industry-standard RLock based design.
Full thread safety for all classes (except Zip and LMDB), with much less caveats than before
Follow-up: get(), pop(), popitem(), and setdefault() are not thread-safe #99

crusaderky · 2023-03-25T15:44:58Z

zict/buffer.py

+        with self._lock:
+            discard(self.slow, key)
+            if key in self._cancel_restore:
+                self._cancel_restore[key] = True


Annoyingly I can't just call

self.set_noevict(key, value) self.fast.evict_until_below_capacity()

due to the bespoke exception handling in LRU.__setitem__.
I'll clean this up once distributed has switched to async mode.

crusaderky · 2023-03-25T16:25:25Z

zict/lru.py


    def __init__(
        self,
        n: float,
        d: MutableMapping[KT, VT],
+        *,


Very small API breakage; I doubt anybody will mind

crusaderky · 2023-04-02T14:55:47Z

@milesgranger this is ready for review.
I appreciate it's a lot to digest. Take your time, and if you need me to walk you through it, ping me.

crusaderky · 2023-04-02T14:56:46Z

zict/tests/utils_test.py

+
+# How many times to repeat non-deterministic stress tests.
+# You may set it as high as 50 if you wish to run in CI.
+REPEAT_STRESS_TESTS = 1


I ran it 100 times just before review; all green.

milesgranger

Looks pretty darn good to me; just some nits and clarifications about implementation. Specifically why we still need to account for potential state mgmt caused from the temporary releases of the locks and if it's worth releasing those locks during those points.

However it seems well tested and thought out, so feel free to just comment and move along with approval; I have nothing which warrants holding you up here. :)

milesgranger · 2023-04-03T08:58:01Z

zict/common.py

+
+def locked(func: Callable[P, VT]) -> Callable[P, VT]:
+    """Decorator for a method of ZictBase, which wraps the whole method in a
+    mapping-global rlock.


Nit: Not a global lock, right? One lock per instance of ZictBase it appears.

Yes, per-instance.

milesgranger · 2023-04-03T09:08:33Z

zict/cache.py

+        gen += 1
+        self._last_updated[key] = self._gen = gen
+
+        with self.unlock():


Can we comment here why this benefits from being unlocked? It seems like some of the second condition below can be removed if this were left locked. Assume there is a reason for the trade-off though; to allow another thread to make progress? If so, I'm curious if it's worth the tradeoff for added complexity of checking / correcting state (second conditional call to discard(self.cache, key)) in that same second conditional.

It's explained in the notes at the top of every class.
These methods can take arbitrarily long to run, so two threads should be able to run in unison on the same mapping while they're running:

Buffer.slow.__setitem__

Buffer.slow.__getitem__

LRU.on_evict

Cache.data.__setitem__

Cache.data.__getitem__

File.__setitem__ (specifically fh.write / fh.writelines; not the other syscalls)

File.__getitem__ (specifically fh.readinto; not the other syscalls)

Sieve.mappings[*].__setitem__

Sieve.mappings[*].__getitem__

All other methods are expected to be fast, which should be intended as it's OK for a thread that must not be busy (read: the Worker's event loop) to lock waiting for them to finish.

milesgranger · 2023-04-03T09:13:33Z

zict/file.py

        fn = self._safe_key(key)
-        with open(os.path.join(self.directory, fn), "wb") as fh:
+        with open(os.path.join(self.directory, fn), "wb") as fh, self.unlock():


Curious why writing files is considered 'safe' to release the lock but deleting it is not (see delitem)? Seems mutation of files ought to (probably) be locked, or consistently unlocked and left to user to worry about.

We are not mutating files though. We're always creating a new one. If two threads call __setitem__ on the same key, they'll end up writing to two different files.
I just noticed that I forgot about a race condition there though - one of the two files would remain there indefinitely, littering the hard drive. Fixed.

It's not that deleting files is unsafe - it's that os.remove is expected to be fast (see definition above) so I chose to keep the method simple. For the record, dict.pop (first line of __delete__) is not thread-safe and would need to be wrapped in a lock.

milesgranger · 2023-04-03T09:26:58Z

zict/sieve.py

+        with self.unlock():
+            mapping[key] = value

+        if gen != self.gen and self.key_to_mapping.get(key) is not mapping:
+            # Multithreaded race condition
+            discard(mapping, key)


I suppose this is a similar clarification; is the tradeoff worth potentially corrupting the expected state and trying to rectify it?

mappings in Sieve can be slow. Realistic example: we have appetite to change the current distributed.spill.SpillBuffer.slow

from

slow = zict.Func(dumps, loads, zict.File(local_directory)

to

def selector(k, v): if isinstance(v, (pandas.DataFrame, pandas.Series)): return "pandas" else: return "generic" slow = zict.Sieve( { "pandas": zict.ParquetFile(local_directory), "generic": zict.Func(dumps, loads, zict.File(local_directory), }, selector=selector, )

milesgranger · 2023-04-04T06:22:02Z

Thanks for the clarifications @crusaderky 😃

crusaderky self-assigned this Mar 25, 2023

crusaderky mentioned this pull request Mar 25, 2023

AsyncBuffer #88

Merged

crusaderky force-pushed the locks branch from 0138357 to ee658a2 Compare March 25, 2023 15:34

crusaderky commented Mar 25, 2023

View reviewed changes

crusaderky force-pushed the locks branch 6 times, most recently from 98ca2c4 to 87c8572 Compare March 27, 2023 15:21

crusaderky mentioned this pull request Mar 28, 2023

[WIP] Asynchronous SpillBuffer dask/distributed#7686

Draft

crusaderky force-pushed the locks branch 2 times, most recently from 526dc10 to bf4174a Compare March 29, 2023 12:51

crusaderky changed the title ~~Lock-based thread safety~~ [WIP] Lock-based thread safety Mar 30, 2023

crusaderky mentioned this pull request Mar 30, 2023

Asynchronous Disk Access in Workers dask/distributed#4424

Open

crusaderky force-pushed the locks branch 2 times, most recently from f9b559a to 981f796 Compare March 31, 2023 18:01

crusaderky mentioned this pull request Mar 31, 2023

Multithreading deadlock in Sieve #98

Closed

crusaderky requested a review from milesgranger April 2, 2023 14:47

crusaderky mentioned this pull request Apr 2, 2023

get(), pop(), popitem(), and setdefault() are not thread-safe #99

Open

crusaderky marked this pull request as ready for review April 2, 2023 14:55

crusaderky changed the title ~~[WIP] Lock-based thread safety~~ Lock-based thread safety Apr 2, 2023

crusaderky commented Apr 2, 2023

View reviewed changes

Lock-based thread safety

4ce1912

crusaderky force-pushed the locks branch from af65a2a to 4ce1912 Compare April 2, 2023 20:08

milesgranger approved these changes Apr 3, 2023

View reviewed changes

Merge branch 'main' into locks

248a4bd

crusaderky merged commit b96afc4 into dask:main Apr 3, 2023

crusaderky deleted the locks branch April 3, 2023 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock-based thread safety #92

Lock-based thread safety #92

crusaderky commented Mar 25, 2023 •

edited

Loading

crusaderky Mar 25, 2023

crusaderky Mar 25, 2023

crusaderky commented Apr 2, 2023

crusaderky Apr 2, 2023

milesgranger left a comment

milesgranger Apr 3, 2023

crusaderky Apr 3, 2023

milesgranger Apr 3, 2023

crusaderky Apr 3, 2023 •

edited

Loading

milesgranger Apr 3, 2023

crusaderky Apr 3, 2023

milesgranger Apr 3, 2023

crusaderky Apr 3, 2023 •

edited

Loading

milesgranger commented Apr 4, 2023

Lock-based thread safety #92

Lock-based thread safety #92

Conversation

crusaderky commented Mar 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Apr 2, 2023

Choose a reason for hiding this comment

milesgranger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Apr 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Apr 3, 2023 • edited Loading

Choose a reason for hiding this comment

milesgranger commented Apr 4, 2023

crusaderky commented Mar 25, 2023 •

edited

Loading

crusaderky Apr 3, 2023 •

edited

Loading

crusaderky Apr 3, 2023 •

edited

Loading