[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This may resolve #1823 & #1794 based on core dump analysis of an
oc-monitor
crash.A single OC monitor process is created at launch of a parent process which plans to use OC (nodeos, a unit test application, etc). Critically, a single io_context is used in this single monitor process. As
code_cache
s are created on the parent process,compile_monitor_session
s are created in the monitor process for each of thosecode_cache
instances.When a
code_cache
needs to compile a contract, it sends a message to its out of processcompile_monitor_session
. This will then call off to another process -- the OC trampoline -- but the most important actions here to note is that thecompile_monitor_session
notes in itscurrent_compiles
a socket to receive a reply on from the trampoline,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Line 110 in 81a9d5c
and then goes off to do an
async_wait()
on that socket,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Line 116 in 81a9d5c
Consider that
current_compiles
holds some outstanding sockets because compiles are ongoing. Now consider thatcode_cache
is destroyed. When that happens theread_message_from_nodeos()
(which is a misnomer: it's really read_message_from_codecache) will indicate a signal,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 57 to 61 in 81a9d5c
this signal is an indication that the
compile_monitor_session
can be destroyed (since its code_cache pairing has been destroyed). So.. it's destroyed,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 211 to 215 in 81a9d5c
As
compile_monitor_session
is destroyed itscurrent_compiles
is first to be dtored. As part of this destruction thelocal::datagram_protocol::socket
will be destroyed ergo it will becancel()
ed egro it will make a callback to theasync_wait()
withboost::asio::error::operation_aborted
. That's not strictly a problem, but things fall apart quickly here because theasync_wait()
handler then accesses an iterator to the currently-being-destroyedcurrent_compiles
, along with potentially using the currently-being-destroyed socket.So, bail doing anything in
async_wait()
's callback if there is an error.Unfortunately while all the above makes sense, I can't really explain why the failure doesn't happen more often with just nodeos. My hunch is that a crashing oc-monitor on nodeos shutdown is simply never noticed nor has any negative side effects. It's only in these new tests that expect oc-monitor to (as designed) be long lived where something gets angry that oc-monitor has disappeared (because it crashed)