Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for no-gil/freethreaded work #4265

Open
14 tasks done
alex opened this issue Jun 20, 2024 · 64 comments
Open
14 tasks done

Tracking issue for no-gil/freethreaded work #4265

alex opened this issue Jun 20, 2024 · 64 comments

Comments

@alex
Copy link
Contributor

alex commented Jun 20, 2024

We didn't have a dedicated issue for this, so now there's one.

TODO:

  • Add a cfg for no-gil, but only allowed behind an experimental feature
  • ffi-check passing with a no-GIL build
  • Adopt new owned-reference-friendly C APIs
    • PyDict_GetItemRef
    • PyList_GetItemRef
    • PyDict_Next
    • PyWeakref_GetRef
    • PyImport_AddModuleRef
  • Identify places that assume a Python<'_> indicates only a single thread is executing:
    • pyo3::sync::GILOnceCell
    • PyClassBorrowChecker
    • GILProtected
    • PyErrState::normalize
    • ...
  • A way for extensions to declare that the Py_mod_gil slot should be set
  • pyo3_ffi datetime bindings are not thread safe (?)
@ngoldbaum
Copy link
Contributor

As a tiny piece of this and to try to learn the library better, I'm working on adding wrappers for the new GetItemRef C API functions in the 3.13 stable API. These are needed to be fully safe for free-threaded python and are nice to have anyway on older versions because strong references are easier to reason about.

@ngoldbaum
Copy link
Contributor

ngoldbaum commented Jul 30, 2024

Just to update the current state of things: pyo3 builds against the free-threaded build if you do:

UNSAFE_PYO3_BUILD_FREE_THREADED=1 cargo build

If you use pyenv, you'll also need to locally delete or modify the .python-version file.

This very quicky crashes inside of mimalloc internals, ultimately inside of Py_InitializeEx:

  * frame #0: 0x000000010135af60 libpython3.13t.dylib`chacha_block + 448
    frame #1: 0x000000010134f7e8 libpython3.13t.dylib`_mi_os_get_aligned_hint + 172
    frame #2: 0x000000010135cd68 libpython3.13t.dylib`unix_mmap_prim + 136
    frame #3: 0x0000000101355e80 libpython3.13t.dylib`_mi_prim_alloc + 220
    frame #4: 0x000000010134fb30 libpython3.13t.dylib`mi_os_prim_alloc + 68
    frame #5: 0x0000000101348560 libpython3.13t.dylib`_mi_os_alloc_aligned + 352
    frame #6: 0x0000000101349a9c libpython3.13t.dylib`mi_reserve_os_memory_ex + 80
    frame #7: 0x0000000101347ee8 libpython3.13t.dylib`_mi_arena_alloc_aligned + 392
    frame #8: 0x000000010135bd00 libpython3.13t.dylib`mi_segment_alloc + 468
    frame #9: 0x0000000101354950 libpython3.13t.dylib`mi_segments_page_alloc + 1468
    frame #10: 0x000000010135ab94 libpython3.13t.dylib`mi_page_fresh_alloc + 56
    frame #11: 0x0000000101351f5c libpython3.13t.dylib`mi_find_page + 528
    frame #12: 0x0000000101344070 libpython3.13t.dylib`_mi_malloc_generic + 208
    frame #13: 0x000000010144e3d8 libpython3.13t.dylib`gc_alloc + 284
    frame #14: 0x000000010144e268 libpython3.13t.dylib`_PyObject_GC_New + 96
    frame #15: 0x000000010132042c libpython3.13t.dylib`PyDict_New + 84
    frame #16: 0x00000001013b3d24 libpython3.13t.dylib`_PyUnicode_InitGlobalObjects + 236
    frame #17: 0x000000010147e4dc libpython3.13t.dylib`pycore_interp_init + 72
    frame #18: 0x000000010147bb88 libpython3.13t.dylib`Py_InitializeFromConfig + 1360
    frame #19: 0x000000010147bc9c libpython3.13t.dylib`Py_InitializeEx + 144
    frame #20: 0x000000010006e5f0 pyo3-1eb544a7db3e1a47`pyo3::gil::prepare_freethreaded_python::_$u7b$$u7b$closure$u7d$$u7d$::h922d8fd5db1fd90c((null)={closure_env#0} @ 0x00000001710665c7, (null)=0x0000000171066640) at gil.rs:69:13
    frame #21: 0x0000000100047a4c pyo3-1eb544a7db3e1a47`std::sync::once::Once::call_once_force::_$u7b$$u7b$closure$u7d$$u7d$::h15baf6dd1f7316ea(p=0x0000000171066640) at once.rs:208:40
    frame #22: 0x000000010031a770 pyo3-1eb544a7db3e1a47`std::sys::sync::once::queue::Once::call::heacc08786c6d7dfa at queue.rs:183:21 [opt]
    frame #23: 0x00000001000478b4 pyo3-1eb544a7db3e1a47`std::sync::once::Once::call_once_force::h7b8eb88c3a02f292(self=0x00000001004dce70, f={closure_env#0} @ 0x000000017106671f) at once.rs:208:9
    frame #24: 0x00000001001d68b4 pyo3-1eb544a7db3e1a47`pyo3::gil::prepare_freethreaded_python::h316cd04b406e24c0 at gil.rs:66:5
    frame #25: 0x00000001001d6924 pyo3-1eb544a7db3e1a47`pyo3::gil::GILGuard::acquire::h2127069d9988a593 at gil.rs:174:21
    frame #26: 0x0000000100053000 pyo3-1eb544a7db3e1a47`pyo3::marker::Python::with_gil::h38979cd5e69873c3(f={closure_env#0} @ 0x000000017106679f) at marker.rs:403:21
    frame #27: 0x00000001001c6548 pyo3-1eb544a7db3e1a47`pyo3::conversions::std::array::tests::test_extract_non_iterable_to_array::h3c220ef1fe379cdf at array.rs:226:9
    frame #28: 0x000000010004a1b4 pyo3-1eb544a7db3e1a47`pyo3::conversions::std::array::tests::test_extract_non_iterable_to_array::_$u7b$$u7b$closure$u7d$$u7d$::h7de4fa3687a88518((null)=0x00000001710667fe) at array.rs:225:44

Just to make sure all of this is reproducible and we have some feedback on CI, I think I'm going to add a free-threaded CI job marked with continue-on-error with a test run that crashes like this.

@davidhewitt
Copy link
Member

That sounds great to me, thanks!

ngoldbaum added a commit to ngoldbaum/pyo3 that referenced this issue Jul 31, 2024
ngoldbaum added a commit to ngoldbaum/pyo3 that referenced this issue Aug 1, 2024
github-merge-queue bot pushed a commit that referenced this issue Aug 1, 2024
* Update dict.get_item binding to use PyDict_GetItemRef

Refs #4265

* test: add test for dict.get_item error path

* test: add test for dict.get_item error path

* test: add test for dict.get_item error path

* fix: fix logic error in dict.get_item bindings

* update: apply david's review suggestions for dict.get_item bindings

* update: create ffi::compat to store compatibility shims

* update: move PyDict_GetItemRef bindings to spot in order from dictobject.h

* build: fix build warning with --no-default-features

* doc: expand release note fragments

* fix: fix clippy warnings

* respond to review comments

* Apply suggestion from @mejrs

* refactor so cfg is applied to functions

* properly set cfgs

* fix clippy lints

* Apply @davidhewitt's suggestion

* deal with upstream deprecation of new_bound
@alex
Copy link
Contributor Author

alex commented Aug 2, 2024

I added a new checkbox for " Adopt new owned-reference-friendly C APIs". If we have a list of all the ones we need, I can make those sub-checkboxes.

@ngoldbaum
Copy link
Contributor

If we have a list of all the ones we need, I can make those sub-checkboxes.

I think PyDict_GetItemRef and PyList_GetItemRef are the most important ones. There'a a listing of the remaining ones in the HOWOTO guide for free-threading in the CPython docs: https://docs.python.org/3.13/howto/free-threading-extensions.html#borrowed-references

I also had a chat with @davidhewitt today and in addition to GilOnceCell, he pointed to GILProtected and PyCell as spots that make strong assumptions about the GIL.

Our first idea is to make GILProtected a no-op on Py_GIL_DISABLED builds (although we'll need to see if that has major fallout on user code) and as a first pass PyCell needs atomic increments and decrements to avoid data races in the free-threaded build.

In addition we need to use pyo3_ffi_check to update the assumptions the FFI bindings make about the free-threaded ABI. Doing this should hopefully fix some of the most egregious build issues. I am planning to work on that step next week.

I looked at adding a failing CI job, but that won't work right now because of if you run the tests on a free-threaded build with --no-fail-fast the tests will eventually deadlock. At least as far as I can see there's no option in cargo to automatically kill hung tests that run longer than a configurable timeout. You can do it manually pretty easily with a macro but I'd prefer not to do that and instead hold off on adding CI until the tests are runnable without deadlocks. Hopefully that won't be too long :)

@alex
Copy link
Contributor Author

alex commented Aug 2, 2024

Ok, updated the tracking list.

PyCell no longer exists, should that be something else?

@ngoldbaum
Copy link
Contributor

I'm still learning the library and it shows...

I think David meant Bound in our discussion and he just got mixed up with the old API after a long day. I'll let him clarify.

@alex
Copy link
Contributor Author

alex commented Aug 2, 2024

My guess is it's a reference to PyClassBorrowChecker, which manages the various borrow flags. But I'll let David say for sure.

@davidhewitt
Copy link
Member

My mistake, yes we removed the PyCell name with the Gil refs API 👍

@ngoldbaum
Copy link
Contributor

See #4421 which updates the FFI bindings for the free-threaded build. That's enough to get the tests to pass without deadlocking, so I added a CI config as well.

@alex
Copy link
Contributor Author

alex commented Aug 6, 2024

Added a checkbox for ffi-check being green.

@alex
Copy link
Contributor Author

alex commented Sep 13, 2024

Defaulting to frozen, and then requiring users to pick a strategy for mutability seems good to me.

In my experience, a lot of types don't have particularly useful concurrent semantics, and while a lock can make them safe, it can't make it sensible.

FWIW my concern is much less that locks have overhead, it's that rw-locks are full of performance cliffs and priority inversion issues (https://blog.nelhage.com/post/rwlock-contention/).

@davidhewitt davidhewitt mentioned this issue Sep 13, 2024
3 tasks
davidhewitt pushed a commit that referenced this issue Sep 15, 2024
* Update dict.get_item binding to use PyDict_GetItemRef

Refs #4265

* test: add test for dict.get_item error path

* test: add test for dict.get_item error path

* test: add test for dict.get_item error path

* fix: fix logic error in dict.get_item bindings

* update: apply david's review suggestions for dict.get_item bindings

* update: create ffi::compat to store compatibility shims

* update: move PyDict_GetItemRef bindings to spot in order from dictobject.h

* build: fix build warning with --no-default-features

* doc: expand release note fragments

* fix: fix clippy warnings

* respond to review comments

* Apply suggestion from @mejrs

* refactor so cfg is applied to functions

* properly set cfgs

* fix clippy lints

* Apply @davidhewitt's suggestion

* deal with upstream deprecation of new_bound
@davidhewitt
Copy link
Member

davidhewitt commented Sep 18, 2024

As I commented in #4265, I want to suggest that a realistic goal here for the 0.23 release is to transition from "unusable and unsound" to "unusable but sound for testing".

Given that we have breaking changes already landing in 0.23, I think much better to delay the breaking changes needed above in #4265 (comment) to 0.23 (this is primarily immutable by default with the opt-in).

Thus for 0.23 I'd like to merge the PRs which make PyO3 sound under free threading but don't change existing semantics, even if those semantics are unhelpful under free threading. This is (as far as I'm aware)

I think that would be good enough to unblock the downstream ecosystem to start testing their own projects under free threading. The focus of the 0.24 release can then be the breaking changes needed to make the semantics of PyO3 actually sensible under free threading.

@mejrs
Copy link
Member

mejrs commented Sep 18, 2024

That makes sense to me, given that this is clearly documented of course. How are other libraries like Boost and Pybind11 handling this? (If they are even handling nogil at all...)

@ngoldbaum
Copy link
Contributor

I've been experimenting with cargo-stress again, and started seeing a rare test failure that I don't understand:

stdout:

running 13 tests
test sequence_length ... ok
test test_generic_list_set ... ok
test test_delitem ... ok
test test_contains ... ok
test test_setitem ... ok
test test_option_list_get ... ok
test test_repeat ... ok
test test_inplace_repeat ... ok
test test_concat ... ok
test test_inplace_concat ... ok
test sequence_is_not_mapping ... ok
test test_generic_list_get ... ok
test test_getitem ... FAILED

failures:

---- test_getitem stdout ----
thread 'test_getitem' panicked at /Users/goldbaum/Documents/pyo3/src/impl_/pyclass/lazy_type_object.rs:55:13:
failed to create type object for ByteSequence
stack backtrace:
   0: rust_begin_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:74:14
   2: pyo3::impl_::pyclass::lazy_type_object::LazyTypeObject<T>::get_or_init::{{closure}}
             at ./src/impl_/pyclass/lazy_type_object.rs:55:13
   3: core::result::Result<T,E>::unwrap_or_else
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/result.rs:1456:23
   4: pyo3::impl_::pyclass::lazy_type_object::LazyTypeObject<T>::get_or_init
             at ./src/impl_/pyclass/lazy_type_object.rs:53:9
   5: <test_sequence::ByteSequence as pyo3::type_object::PyTypeInfo>::type_object_raw
             at ./tests/test_sequence.rs:12:1
   6: pyo3::type_object::PyTypeInfo::type_object
             at ./src/type_object.rs:57:13
   7: pyo3::marker::Python::get_type
             at ./src/marker.rs:693:9
   8: test_sequence::seq_dict
             at ./tests/test_sequence.rs:110:31
   9: test_sequence::test_getitem::{{closure}}
             at ./tests/test_sequence.rs:119:17
  10: pyo3::marker::Python::with_gil
             at ./src/marker.rs:409:9
  11: test_sequence::test_getitem
             at ./tests/test_sequence.rs:118:5
  12: test_sequence::test_getitem::{{closure}}
             at ./tests/test_sequence.rs:117:18
  13: core::ops::function::FnOnce::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
  14: core::ops::function::FnOnce::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    test_getitem

test result: FAILED. 12 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s


stderr:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte

The above exception was the direct cause of the following exception:

RuntimeError: An error occurred while initializing class ByteSequence
status code: 101

(here I'm using a slightly modified version of cargo-stress to get better error reporting, see danhhz/cargo-stress#6).

I don't understand how we could be getting a UTF-8 decode error while defining a class. This could be a sign of some thread-safety issue in LazyTypeObject, I guess?

@davidhewitt
Copy link
Member

@ngoldbaum
Copy link
Contributor

Ohhh, I get it, it's because LazyTypeObject depends on GILOnceCell. I bet if we finish #4512 this will go away.

@davidhewitt
Copy link
Member

Good point, I will try to fix that PR up next time I type a line of code!

@davidhewitt
Copy link
Member

Ah, just realised #4584 - I've added a checkbox for PyErrState::normalize

@aniketmaurya
Copy link

aniketmaurya commented Oct 8, 2024

really looking forward to this!

@davidhewitt
Copy link
Member

#4298 might imply append_to_inittab! is not thread safe, though I think given this is already broken I don't mind missing that fix from 0.23.

@davidhewitt
Copy link
Member

As per python/cpython#125243 (comment) I've added a bullet to the top for datetime bindings.

@alex
Copy link
Contributor Author

alex commented Oct 25, 2024

Are we good for PyDict_Next to be checked off?

@ngoldbaum
Copy link
Contributor

Yes, we should be. The concern about thread safety in the datetime bindings should also be fixed by #4623.

@alex
Copy link
Contributor Author

alex commented Oct 25, 2024

God help us, we're close.

Has anyone here done a top to bottom perusal of pyo3 for other potential concerns?

@ngoldbaum
Copy link
Contributor

Has anyone here done a top to bottom perusal of pyo3 for other potential concerns?

Not me. I did just grep the codebase for Cell and UnsafeCell uses, and I think all the remaining ones are safe? The fact that PyAny uses an UnsafeCell to wrap a PyObject * pointer is OK, right?

I'm also hoping that finishing up #4566 will elucidate any remaining issues in tests and docs. My plan is to help finish that up next week.

@alex
Copy link
Contributor Author

alex commented Oct 25, 2024 via email

@gwenhe
Copy link

gwenhe commented Nov 11, 2024

My orm Dependency pydantic。
It is very necessary. When can it be solved?

@davidhewitt
Copy link
Member

See #4651 . We will be releasing initial support very soon, some challenges at home have delayed the work. Thank you for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants