[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

spoonincode · 2023-10-26T16:57:41Z

This may resolve #1823 & #1794 based on core dump analysis of an oc-monitor crash.

A single OC monitor process is created at launch of a parent process which plans to use OC (nodeos, a unit test application, etc). Critically, a single io_context is used in this single monitor process. As code_caches are created on the parent process, compile_monitor_sessions are created in the monitor process for each of those code_cache instances.

When a code_cache needs to compile a contract, it sends a message to its out of process compile_monitor_session. This will then call off to another process -- the OC trampoline -- but the most important actions here to note is that the compile_monitor_session notes in its current_compiles a socket to receive a reply on from the trampoline,

leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp

Line 110 in 81a9d5c

current_compiles.emplace_front(code_id, std::move(response_socket));

and then goes off to do an async_wait() on that socket,

leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp

Line 116 in 81a9d5c

    
           socket.async_wait(local::datagram_protocol::socket::wait_read, [this, current_compile_it](auto ec) {

Consider that current_compiles holds some outstanding sockets because compiles are ongoing. Now consider that code_cache is destroyed. When that happens the read_message_from_nodeos() (which is a misnomer: it's really read_message_from_codecache) will indicate a signal,

leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp

Lines 57 to 61 in 81a9d5c

    
           void read_message_from_nodeos() { 
        
              _nodeos_instance_socket.async_wait(local::datagram_protocol::socket::wait_read, [this](auto ec) { 
        
                 if(ec) { 
        
                    connection_dead_signal(); 
        
                    return;

this signal is an indication that the compile_monitor_session can be destroyed (since its code_cache pairing has been destroyed). So.. it's destroyed,

leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp

Lines 211 to 215 in 81a9d5c

    
           _compile_sessions.front().connection_dead_signal.connect([&, it = _compile_sessions.begin()]() { 
        
              ctx.post([&]() { 
        
                 _compile_sessions.erase(it); 
        
              }); 
        
           });

As compile_monitor_session is destroyed its current_compiles is first to be dtored. As part of this destruction the local::datagram_protocol::socket will be destroyed ergo it will be cancel()ed egro it will make a callback to the async_wait() with boost::asio::error::operation_aborted. That's not strictly a problem, but things fall apart quickly here because the async_wait() handler then accesses an iterator to the currently-being-destroyed current_compiles, along with potentially using the currently-being-destroyed socket.

So, bail doing anything in async_wait()'s callback if there is an error.

Unfortunately while all the above makes sense, I can't really explain why the failure doesn't happen more often with just nodeos. My hunch is that a crashing oc-monitor on nodeos shutdown is simply never noticed nor has any negative side effects. It's only in these new tests that expect oc-monitor to (as designed) be long lived where something gets angry that oc-monitor has disappeared (because it crashed)

linh2931 · 2023-10-26T17:52:18Z

libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp

+         // for now just consider any error as being due to cancellation at dtor time and completely bail out (there aren't many other
+         // potential errors for an asnyc_wait)
+         if(ec)
+            return;


Do you want to log this for potential debugging or just too much noises?

No higher than debug if you do log it; don't think we need it.

Logging config won't be active in the monitor process since it's forked off before main() -- so debug won't be visible, and any error/warn/info messages won't honor the user's logging config either. So probably best to not log anything here.

don't run compile task callback when socket being dtored

62ec0b1

spoonincode linked an issue Oct 26, 2023 that may be closed by this pull request

Test Failure: failed to read response from monitor process in snapshot_scheduler_test #1794

Closed

heifner changed the base branch from main to release/5.0 October 26, 2023 17:14

heifner approved these changes Oct 26, 2023

View reviewed changes

linh2931 approved these changes Oct 26, 2023

View reviewed changes

spoonincode merged commit f371af5 into release/5.0 Oct 26, 2023
29 checks passed

spoonincode deleted the oc_monitor_skip_cb_on_dtor_5x branch October 26, 2023 18:04

spoonincode mentioned this pull request Oct 26, 2023

[5.0 -> main] don't run EOS VM OC's monitor compile task callback when socket being dtored #1830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

spoonincode commented Oct 26, 2023

linh2931 Oct 26, 2023

heifner Oct 26, 2023

spoonincode Oct 26, 2023

	void read_message_from_nodeos() {
	_nodeos_instance_socket.async_wait(local::datagram_protocol::socket::wait_read, [this](auto ec) {
	if(ec) {
	connection_dead_signal();
	return;

	_compile_sessions.front().connection_dead_signal.connect([&, it = _compile_sessions.begin()]() {
	ctx.post([&]() {
	_compile_sessions.erase(it);
	});
	});

[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

Conversation

spoonincode commented Oct 26, 2023

linh2931 Oct 26, 2023

Choose a reason for hiding this comment

heifner Oct 26, 2023

Choose a reason for hiding this comment

spoonincode Oct 26, 2023

Choose a reason for hiding this comment