Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Remove circular references from DataReaders #134

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

willstott101
Copy link
Contributor

@willstott101 willstott101 commented Aug 23, 2022

Ok so if we're currently unable to support circular references (read: correct object finalization during the circular island collection phase in cpython where collection order is unspecified) then we need to not create reference cycles in the library ourselves at the very least.

The only case of this I've found specifically is ReadConditions - I wonder if they really need to keep the reference to the DataReader? For now in this MR I've just made ReadConditions only exist for function scopes - which fixes the segfaults and hangs I've been seeing once removing all circular references from my own code.

Avoiding circular references is good practice anyway, really - but it is a shame the results are so explosive when they exist as they're so easy to introduce. In our case we want to expose users to a python API using cyclonedds so just avoiding them in app code is untenable.

I suspect it's actually a cyclonedds C bug that was causing the CI hanging - if dds_delete is meant to be idempotent and support any destruction order.

FWIW I found these loops quite manually by adding prints to Entity.__del__ to find classes exhibiting loops by printing if the object is in the _entities dict during destruction.

@eboasson
Copy link
Contributor

@willstott101

I suspect it's actually a cyclonedds C bug that was causing the CI hanging - if dds_delete is meant to be idempotent and support any destruction order.

I apparently wasn't paying attention carefully enough (thanks to @thijsmie doing such a wonderful work) because this is something I'd like to know a little bit more about. dds_delete deletes the specified entity + all its children in the hierarchy + anything up towards the root (DDS_CYCLONEDDS_HANDLE) that happens to have been implicitly created (+ some odd complications around topics). If you call delete for all entities in some arbitrary order, you would expect some calls to return errors, but nothing worse than that. If it deadlocks, something is amiss ... with one exception ...

That exception is that listeners are called synchronously when some event occurs, and the consequence is that it may be called while holding some locks and hence that some operations can deadlock when invoked from inside a listener. So if there are listener invocations occur in parallel, or if you call dds_delete from a listener, it can deadlock.

In my understanding, this saga started with listeners, and it seems plausible to me that you ran into this exception. But I'd like to be sure ...

@willstott101
Copy link
Contributor Author

willstott101 commented Aug 24, 2022

Well I'm a little confused honestly. I've seen several different kinds of problems including segfaults and hangs. My app only ever uses asyncio's call_soon_threadsafe in listeners (maybe a little conflation logic too).

The segfaults I've been seeing are always related to listeners and reference cycles. The hangs on the other hand are a bit harder to track down and I think vary.

I have three small tests on a branch which can produce 2 hangs, and 1 segfault. All use the BuiltinDataReader for BuiltinTopicDcpsTopic since it takes little code to trigger events (construct/destruct Topic objects) but I have no particular reason to think the problems are directly related to the builtin topic. That being said, the tests rely on the fact that there's a refcycle in the current implementation of BuiltinDataReader, but refcycles can be created by users too of course.

See: master...willstott101:cyclonedds-python:repro/dds_delete_hang

@eboasson
Copy link
Contributor

@willstott101 I tried a few things — 0.10.1 (consistently?) crashes with your tests, but 0.10.1 + this PR and after combining your tests sometimes hangs. I haven't been paying sufficient attention to this to be sure whether this is the combination I should have tried, but it is a deadlock that fits with my expectations: it tries to invoke the Python listener in one thread on receiving data, blocks trying to obtain the GIL, and then another thread blocks on trying to deliver data.

It is so easy to accidentally end up in this kind of thing that I think the Cyclone C code should add support for listener invocations on another thread, make that the default and keep the current behaviour as an opt-in for those cases where the synchronous nature of the invocation is really needed. That would fix this type of deadlock.

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000019a429738 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x000000019a461384 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x000000019a45ecf8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x000000010726d5a0 libddsc.0.10.0.dylib`ddsrt_mutex_lock(mutex=0x000000012c907028) at sync.c:41:7
    frame #4: 0x000000010717168c libddsc.0.10.0.dylib`deliver_locally_allinsync(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b22afa8, ops=0x000000010729c900, vsourceinfo=0x000000016b22afd8) at ddsi_deliver_locally.c:259:5
    frame #5: 0x0000000107247de8 libddsc.0.10.0.dylib`deliver_locally(wr=0x000000012c906dc0, payload=0x000000014d0be2f0, tk=0x0000600002082660) at dds_write.c:194:8
    frame #6: 0x0000000107247d24 libddsc.0.10.0.dylib`dds_writecdr_local_orphan_impl(lowr=0x000000012c906dc0, d=0x000000014d0be2f0) at dds_write.c:640:3
    frame #7: 0x0000000107222ea8 libddsc.0.10.0.dylib`dds__builtin_write_topic(tpd=0x0000600002e890e0, timestamp=(v = 1661371350658530000), alive=true, vdomain=0x000000014cb5e600) at dds_builtin.c:361:3
    frame #8: 0x00000001071b0950 libddsc.0.10.0.dylib`builtintopic_write_topic(btif=0x000000014cb5e910, tpd=0x0000600002e890e0, timestamp=(v = 1661371350658530000), alive=true) at ddsi_builtin_topic_if.h:50:13
    frame #9: 0x00000001071bd928 libddsc.0.10.0.dylib`ddsi_new_topic(tp_out=0x0000600002e890d0, tpguid=0x0000600002e890bc, pp=0x000000014bf98d40, topic_name="test", sertype=0x0000600001180000, xqos=0x000000014d0bc3d0, is_builtin=false, new_topic_def=0x000000016b22b207) at ddsi_topic.c:111:5
    frame #10: 0x000000010723f4e4 libddsc.0.10.0.dylib`register_topic_type_for_discovery(gv=0x000000014cb5e940, pp=0x000000012c908ed0, ktp=0x0000600003588080, is_builtin=false, sertype=0x0000600001180000) at dds_topic.c:381:23
    frame #11: 0x000000010723ee1c libddsc.0.10.0.dylib`dds_create_topic_impl(participant=1172809280, name="test", allow_dcps=false, sertype=0x000000016b22b4a8, qos=0x0000000000000000, listener=0x0000000000000000, sedp_plist=0x0000000000000000, is_builtin=false) at dds_topic.c:576:30
    frame #12: 0x000000010723f5c0 libddsc.0.10.0.dylib`dds_create_topic_sertype(participant=1172809280, name="test", sertype=0x000000016b22b4a8, qos=0x0000000000000000, listener=0x0000000000000000, sedp_plist=0x0000000000000000) at dds_topic.c:602:10
    frame #13: 0x00000001060bbeb0 _clayer.cpython-39-darwin.so`ddspy_topic_create(self=<unavailable>, args=<unavailable>) at pysertype.c:978:11 [opt]
    frame #14: 0x000000010520241c Python`cfunction_call + 96
    frame #15: 0x00000001051b3314 Python`_PyObject_MakeTpCall + 132
    frame #16: 0x00000001052a8460 Python`_PyEval_EvalFrameDefault + 41608
    frame #17: 0x000000010529d1d8 Python`_PyEval_EvalCode + 452
...
    frame #171: 0x0000000105311c90 Python`Py_RunMain + 876
    frame #172: 0x0000000105313168 Python`Py_BytesMain + 40
    frame #173: 0x0000000104d3908c dyld`start + 520
  thread #2, name = 'gc'
    frame #0: 0x000000019a42a270 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a46483c libsystem_pthread.dylib`_pthread_cond_wait + 1236
    frame #2: 0x000000010529c0a4 Python`take_gil + 556
    frame #3: 0x00000001052f0cf4 Python`PyGILState_Ensure + 112
    frame #4: 0x0000000106edb448 _ctypes.cpython-39-darwin.so`_CallPythonObject + 64
    frame #5: 0x00000001a86fcf34 libffi.dylib`ffi_closure_SYSV_inner + 816
    frame #6: 0x00000001a86f41e8 libffi.dylib`ffi_closure_SYSV + 56
    frame #7: 0x000000010722397c libddsc.0.10.0.dylib`dds_reader_data_available_cb(rd=0x000000014d0bd7f0) at dds_reader.c:254:7
    frame #8: 0x000000010722be90 libddsc.0.10.0.dylib`dds_rhc_default_store(rhc_common=0x000000014d03bcd0, wrinfo=0x000000016b2bacf8, sample=0x000000014be09cf0, tk=0x0000600002082660) at dds_rhc_default.c:1755:7
    frame #9: 0x000000010716e990 libddsc.0.10.0.dylib`ddsi_rhc_store(rhc=0x000000014d03bcd0, wrinfo=0x000000016b2bacf8, sample=0x000000014be09cf0, tk=0x0000600002082660) at ddsi_rhc.h:66:10
    frame #10: 0x00000001071718b8 libddsc.0.10.0.dylib`deliver_locally_fastpath(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b2bacf8, ops=0x000000010729c900, vsourceinfo=0x000000016b2bad28) at ddsi_deliver_locally.c:238:17
    frame #11: 0x000000010717172c libddsc.0.10.0.dylib`deliver_locally_allinsync(gv=0x000000014cb5e940, source_entity=0x000000012c906dc0, source_entity_locked=false, fastpath_rdary=0x000000012c907028, wrinfo=0x000000016b2bacf8, ops=0x000000010729c900, vsourceinfo=0x000000016b2bad28) at ddsi_deliver_locally.c:264:14
    frame #12: 0x0000000107247de8 libddsc.0.10.0.dylib`deliver_locally(wr=0x000000012c906dc0, payload=0x000000014be09cf0, tk=0x0000600002082660) at dds_write.c:194:8
    frame #13: 0x0000000107247d24 libddsc.0.10.0.dylib`dds_writecdr_local_orphan_impl(lowr=0x000000012c906dc0, d=0x000000014be09cf0) at dds_write.c:640:3
    frame #14: 0x0000000107222ea8 libddsc.0.10.0.dylib`dds__builtin_write_topic(tpd=0x0000600002e847e0, timestamp=(v = 1661371350658477000), alive=false, vdomain=0x000000014cb5e600) at dds_builtin.c:361:3
    frame #15: 0x00000001071b0950 libddsc.0.10.0.dylib`builtintopic_write_topic(btif=0x000000014cb5e910, tpd=0x0000600002e847e0, timestamp=(v = 1661371350658477000), alive=false) at ddsi_builtin_topic_if.h:50:13
    frame #16: 0x00000001071bf53c libddsc.0.10.0.dylib`gc_delete_topic_definition(gcreq=0x000000014be07dc0) at ddsi_topic.c:204:3
    frame #17: 0x00000001071d5988 libddsc.0.10.0.dylib`gcreq_queue_thread(q=0x0000600001184000) at q_gc.c:183:9
    frame #18: 0x00000001071f9d2c libddsc.0.10.0.dylib`create_thread_wrapper(ptr=0x000000014c96fa80) at q_thread.c:257:24
    frame #19: 0x000000010726e7e8 libddsc.0.10.0.dylib`os_startRoutineWrapper(threadContext=0x00006000020853c0) at threads.c:175:17
    frame #20: 0x000000019a46426c libsystem_pthread.dylib`_pthread_start + 148
  thread #3, name = 'dq.builtins'
  thread #4, name = 'dq.user'
  thread #5, name = 'tev'
  thread #6, name = 'recv'
  thread #8, name = 'recvUC'
(lldb) kill
Process 57100 exited with status = 9 (0x00000009) 
(lldb) 

@willstott101
Copy link
Contributor Author

Yes I don't think it's possible to create listeners which don't take the GIL right now in this lib. So another solution would be great. Both queuing an event in asyncio as well as running in another cyclonedds thread would work for me. It does make me wander if we can solve this by just releasing the GIL on more calls into cyclonedds? Though that could be a bit too delicate to get right.

@thijsmie
Copy link
Contributor

Any ctypes based call does release the GIL, and I do think we can prevent this specific deadlock by explicitly releasing the GIL around dds_read/dds_take/dds_create_topic calls in pysertype.c.

I have found something in the Python C API docs called Asynchronous Notifications which could be exactly what we need. I'll explore that after my holiday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants