WIP: Remove circular references from DataReaders #134

willstott101 · 2022-08-23T16:51:17Z

Ok so if we're currently unable to support circular references (read: correct object finalization during the circular island collection phase in cpython where collection order is unspecified) then we need to not create reference cycles in the library ourselves at the very least.

The only case of this I've found specifically is ReadConditions - I wonder if they really need to keep the reference to the DataReader? For now in this MR I've just made ReadConditions only exist for function scopes - which fixes the segfaults and hangs I've been seeing once removing all circular references from my own code.

Avoiding circular references is good practice anyway, really - but it is a shame the results are so explosive when they exist as they're so easy to introduce. In our case we want to expose users to a python API using cyclonedds so just avoiding them in app code is untenable.

I suspect it's actually a cyclonedds C bug that was causing the CI hanging - if dds_delete is meant to be idempotent and support any destruction order.

FWIW I found these loops quite manually by adding prints to Entity.__del__ to find classes exhibiting loops by printing if the object is in the _entities dict during destruction.

eboasson · 2022-08-23T19:15:24Z

@willstott101

I suspect it's actually a cyclonedds C bug that was causing the CI hanging - if dds_delete is meant to be idempotent and support any destruction order.

I apparently wasn't paying attention carefully enough (thanks to @thijsmie doing such a wonderful work) because this is something I'd like to know a little bit more about. dds_delete deletes the specified entity + all its children in the hierarchy + anything up towards the root (DDS_CYCLONEDDS_HANDLE) that happens to have been implicitly created (+ some odd complications around topics). If you call delete for all entities in some arbitrary order, you would expect some calls to return errors, but nothing worse than that. If it deadlocks, something is amiss ... with one exception ...

That exception is that listeners are called synchronously when some event occurs, and the consequence is that it may be called while holding some locks and hence that some operations can deadlock when invoked from inside a listener. So if there are listener invocations occur in parallel, or if you call dds_delete from a listener, it can deadlock.

In my understanding, this saga started with listeners, and it seems plausible to me that you ran into this exception. But I'd like to be sure ...

willstott101 · 2022-08-24T09:51:24Z

Well I'm a little confused honestly. I've seen several different kinds of problems including segfaults and hangs. My app only ever uses asyncio's call_soon_threadsafe in listeners (maybe a little conflation logic too).

The segfaults I've been seeing are always related to listeners and reference cycles. The hangs on the other hand are a bit harder to track down and I think vary.

I have three small tests on a branch which can produce 2 hangs, and 1 segfault. All use the BuiltinDataReader for BuiltinTopicDcpsTopic since it takes little code to trigger events (construct/destruct Topic objects) but I have no particular reason to think the problems are directly related to the builtin topic. That being said, the tests rely on the fact that there's a refcycle in the current implementation of BuiltinDataReader, but refcycles can be created by users too of course.

See: master...willstott101:cyclonedds-python:repro/dds_delete_hang

eboasson · 2022-08-24T20:11:30Z

@willstott101 I tried a few things — 0.10.1 (consistently?) crashes with your tests, but 0.10.1 + this PR and after combining your tests sometimes hangs. I haven't been paying sufficient attention to this to be sure whether this is the combination I should have tried, but it is a deadlock that fits with my expectations: it tries to invoke the Python listener in one thread on receiving data, blocks trying to obtain the GIL, and then another thread blocks on trying to deliver data.

It is so easy to accidentally end up in this kind of thing that I think the Cyclone C code should add support for listener invocations on another thread, make that the default and keep the current behaviour as an opt-in for those cases where the synchronous nature of the invocation is really needed. That would fix this type of deadlock.

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000019a429738 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x000000019a461384 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x000000019a45ecf8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x000000010726d5a0 libddsc.0.10.0.dylib`ddsrt_mutex_lock(mutex=0x000000012c907028) at sync.c:41:7
    frame #4: 0x000000010717168c libddsc.0.10.0.dylib`deliver_locally_allinsync(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b22afa8, ops=0x000000010729c900, vsourceinfo=0x000000016b22afd8) at ddsi_deliver_locally.c:259:5
    frame #5: 0x0000000107247de8 libddsc.0.10.0.dylib`deliver_locally(wr=0x000000012c906dc0, payload=0x000000014d0be2f0, tk=0x0000600002082660) at dds_write.c:194:8
    frame #6: 0x0000000107247d24 libddsc.0.10.0.dylib`dds_writecdr_local_orphan_impl(lowr=0x000000012c906dc0, d=0x000000014d0be2f0) at dds_write.c:640:3
    frame #7: 0x0000000107222ea8 libddsc.0.10.0.dylib`dds__builtin_write_topic(tpd=0x0000600002e890e0, timestamp=(v = 1661371350658530000), alive=true, vdomain=0x000000014cb5e600) at dds_builtin.c:361:3
    frame #8: 0x00000001071b0950 libddsc.0.10.0.dylib`builtintopic_write_topic(btif=0x000000014cb5e910, tpd=0x0000600002e890e0, timestamp=(v = 1661371350658530000), alive=true) at ddsi_builtin_topic_if.h:50:13
    frame #9: 0x00000001071bd928 libddsc.0.10.0.dylib`ddsi_new_topic(tp_out=0x0000600002e890d0, tpguid=0x0000600002e890bc, pp=0x000000014bf98d40, topic_name="test", sertype=0x0000600001180000, xqos=0x000000014d0bc3d0, is_builtin=false, new_topic_def=0x000000016b22b207) at ddsi_topic.c:111:5
    frame #10: 0x000000010723f4e4 libddsc.0.10.0.dylib`register_topic_type_for_discovery(gv=0x000000014cb5e940, pp=0x000000012c908ed0, ktp=0x0000600003588080, is_builtin=false, sertype=0x0000600001180000) at dds_topic.c:381:23
    frame #11: 0x000000010723ee1c libddsc.0.10.0.dylib`dds_create_topic_impl(participant=1172809280, name="test", allow_dcps=false, sertype=0x000000016b22b4a8, qos=0x0000000000000000, listener=0x0000000000000000, sedp_plist=0x0000000000000000, is_builtin=false) at dds_topic.c:576:30
    frame #12: 0x000000010723f5c0 libddsc.0.10.0.dylib`dds_create_topic_sertype(participant=1172809280, name="test", sertype=0x000000016b22b4a8, qos=0x0000000000000000, listener=0x0000000000000000, sedp_plist=0x0000000000000000) at dds_topic.c:602:10
    frame #13: 0x00000001060bbeb0 _clayer.cpython-39-darwin.so`ddspy_topic_create(self=<unavailable>, args=<unavailable>) at pysertype.c:978:11 [opt]
    frame #14: 0x000000010520241c Python`cfunction_call + 96
    frame #15: 0x00000001051b3314 Python`_PyObject_MakeTpCall + 132
    frame #16: 0x00000001052a8460 Python`_PyEval_EvalFrameDefault + 41608
    frame #17: 0x000000010529d1d8 Python`_PyEval_EvalCode + 452
...
    frame #171: 0x0000000105311c90 Python`Py_RunMain + 876
    frame #172: 0x0000000105313168 Python`Py_BytesMain + 40
    frame #173: 0x0000000104d3908c dyld`start + 520
  thread #2, name = 'gc'
    frame #0: 0x000000019a42a270 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a46483c libsystem_pthread.dylib`_pthread_cond_wait + 1236
    frame #2: 0x000000010529c0a4 Python`take_gil + 556
    frame #3: 0x00000001052f0cf4 Python`PyGILState_Ensure + 112
    frame #4: 0x0000000106edb448 _ctypes.cpython-39-darwin.so`_CallPythonObject + 64
    frame #5: 0x00000001a86fcf34 libffi.dylib`ffi_closure_SYSV_inner + 816
    frame #6: 0x00000001a86f41e8 libffi.dylib`ffi_closure_SYSV + 56
    frame #7: 0x000000010722397c libddsc.0.10.0.dylib`dds_reader_data_available_cb(rd=0x000000014d0bd7f0) at dds_reader.c:254:7
    frame #8: 0x000000010722be90 libddsc.0.10.0.dylib`dds_rhc_default_store(rhc_common=0x000000014d03bcd0, wrinfo=0x000000016b2bacf8, sample=0x000000014be09cf0, tk=0x0000600002082660) at dds_rhc_default.c:1755:7
    frame #9: 0x000000010716e990 libddsc.0.10.0.dylib`ddsi_rhc_store(rhc=0x000000014d03bcd0, wrinfo=0x000000016b2bacf8, sample=0x000000014be09cf0, tk=0x0000600002082660) at ddsi_rhc.h:66:10
    frame #10: 0x00000001071718b8 libddsc.0.10.0.dylib`deliver_locally_fastpath(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b2bacf8, ops=0x000000010729c900, vsourceinfo=0x000000016b2bad28) at ddsi_deliver_locally.c:238:17
    frame #11: 0x000000010717172c libddsc.0.10.0.dylib`deliver_locally_allinsync(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b2bacf8, ops=0x000000010729c900, vsourceinfo=0x000000016b2bad28) at ddsi_deliver_locally.c:264:14
    frame #12: 0x0000000107247de8 libddsc.0.10.0.dylib`deliver_locally(wr=0x000000012c906dc0, payload=0x000000014be09cf0, tk=0x0000600002082660) at dds_write.c:194:8
    frame #13: 0x0000000107247d24 libddsc.0.10.0.dylib`dds_writecdr_local_orphan_impl(lowr=0x000000012c906dc0, d=0x000000014be09cf0) at dds_write.c:640:3
    frame #14: 0x0000000107222ea8 libddsc.0.10.0.dylib`dds__builtin_write_topic(tpd=0x0000600002e847e0, timestamp=(v = 1661371350658477000), alive=false, vdomain=0x000000014cb5e600) at dds_builtin.c:361:3
    frame #15: 0x00000001071b0950 libddsc.0.10.0.dylib`builtintopic_write_topic(btif=0x000000014cb5e910, tpd=0x0000600002e847e0, timestamp=(v = 1661371350658477000), alive=false) at ddsi_builtin_topic_if.h:50:13
    frame #16: 0x00000001071bf53c libddsc.0.10.0.dylib`gc_delete_topic_definition(gcreq=0x000000014be07dc0) at ddsi_topic.c:204:3
    frame #17: 0x00000001071d5988 libddsc.0.10.0.dylib`gcreq_queue_thread(q=0x0000600001184000) at q_gc.c:183:9
    frame #18: 0x00000001071f9d2c libddsc.0.10.0.dylib`create_thread_wrapper(ptr=0x000000014c96fa80) at q_thread.c:257:24
    frame #19: 0x000000010726e7e8 libddsc.0.10.0.dylib`os_startRoutineWrapper(threadContext=0x00006000020853c0) at threads.c:175:17
    frame #20: 0x000000019a46426c libsystem_pthread.dylib`_pthread_start + 148
  thread #3, name = 'dq.builtins'
  thread #4, name = 'dq.user'
  thread #5, name = 'tev'
  thread #6, name = 'recv'
  thread #8, name = 'recvUC'
(lldb) kill
Process 57100 exited with status = 9 (0x00000009) 
(lldb)

willstott101 · 2022-08-25T09:01:34Z

Yes I don't think it's possible to create listeners which don't take the GIL right now in this lib. So another solution would be great. Both queuing an event in asyncio as well as running in another cyclonedds thread would work for me. It does make me wander if we can solve this by just releasing the GIL on more calls into cyclonedds? Though that could be a bit too delicate to get right.

thijsmie · 2022-08-25T09:18:54Z

Any ctypes based call does release the GIL, and I do think we can prevent this specific deadlock by explicitly releasing the GIL around dds_read/dds_take/dds_create_topic calls in pysertype.c.

I have found something in the Python C API docs called Asynchronous Notifications which could be exactly what we need. I'll explore that after my holiday.

Remove circular references from DataReaders

793592a

willstott101 mentioned this pull request Aug 25, 2022

Release GIL when creating topics #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Remove circular references from DataReaders #134

WIP: Remove circular references from DataReaders #134

willstott101 commented Aug 23, 2022 •

edited

Loading

eboasson commented Aug 23, 2022

willstott101 commented Aug 24, 2022 •

edited

Loading

eboasson commented Aug 24, 2022

willstott101 commented Aug 25, 2022

thijsmie commented Aug 25, 2022

WIP: Remove circular references from DataReaders #134

Are you sure you want to change the base?

WIP: Remove circular references from DataReaders #134

Conversation

willstott101 commented Aug 23, 2022 • edited Loading

eboasson commented Aug 23, 2022

willstott101 commented Aug 24, 2022 • edited Loading

eboasson commented Aug 24, 2022

willstott101 commented Aug 25, 2022

thijsmie commented Aug 25, 2022

willstott101 commented Aug 23, 2022 •

edited

Loading

willstott101 commented Aug 24, 2022 •

edited

Loading