You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: If cockroach receives SIGHUP too early during initialization, it can crash. I encountered this while running the test suite in a loop (even under nohup!). I don't expect we'll do anything about this but I wanted to record it in case anybody else hits it.
While running the omicron test suite in a loop overnight, I got to my system and found that my VPN connection had terminated. Since I ran the loop under nohup, I didn't think much of it. When I got logged back in, I found that the test suite had failed with:
failures:
---- db::datastore::test::test_session_methods stdout ----
log file: /home/dap/omicron-cockroachdb/tmpdir/omicron_nexus-49637a4102530ffa-test_session_methods.23068.44.log
note: configured to log to "/home/dap/omicron-cockroachdb/tmpdir/omicron_nexus-49637a4102530ffa-test_session_methods.23068.44.log"
thread 'db::datastore::test::test_session_methods' panicked at 'failed to start CockroachDB: cockroach unexpectedly terminated by signal 6 (see error output above)',
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
db::datastore::test::test_session_methods
test result: FAILED. 92 passed; 1 failed; 0 ignored; 0 measured; 68 filtered out; finished in 124.97s
error: test failed, to rerun pass `-p omicron-nexus --lib`
I got to the cockroachdb_stderr file in the tmp directory and found:
(there are more goroutine stacks but they're irrelevant)
My first clue may have been that this appeared in stderr, not the stderr log file. This happened before Cockroach had switched to using its log file. I found cockroachdb/cockroach#84638, where knz explained explicitly what was going on: cockroach took a SIGHUP that caused it to try to log something out before having initialized the logging subsystem.
The remaining mystery was: where did the SIGHUP come from? I wouldn't normally be that interested in that, but I had been in the process of trying to verify the fixes for illumos#15254 and illumos#15367 and wanted to be sure that the panic wasn't somehow a result of an OS issue. (In particular, our understanding of both issues is that they can cause memory that should be zero'd to be non-zero, and it can totally cause messages that look like "this thing that should have been uninitialized was initialized", like the addtimer called with initialized timer message mentioned in #1146). Omicron doesn't appear to use SIGHUP at all. And I ran this under nohup. But I saw in the output of last that I had a login session end during the same minute as this problem was reported. And I found that nohup only sets the disposition for SIGHUP to SIG_IGN and then exec's what you gave it. This causes immediately children to be immune to SIGHUP, and children of those processes will too, unless any of them installs a signal handler for SIGHUP. The Go runtime does install a handler for SIGHUP.
I reproduced what I thought happened by:
From a clean slate, in an ssh session over the engineering VPN, started a test run in the background (./run.sh &)
tail -f the output file. This is important -- output must be going to the ssh connection when the next step happens or the server will never notice that there's a network problem.
disconnected from the VPN
hit enter on the client and waited a few seconds for Mac to decide the TCP connection was terminated. (if you don't do this, the client may never learn the TCP connection was broken, as I haven't configured ssh-level or tcp-level keep-alives. and if I didn't wait for this, but then reconnected to the VPN, the TCP session would have been resumed intact.)
reconnected to the VPN
ssh'd back to the system. Confirmed that everything was still running. But a few seconds later, the previous TCP connection was torn down. I accidentally reproduced this bug while doing this. (I had tried this a few times and only reproduced the panic once. I wasn't actually trying to reproduce the panic. I just wanted to see what signals were sent when this happened.)
Before having started this, I traced SIGHUP signals sent with:
so that confirms that bash sends SIGHUP to all these processes. I had also checked with psig that the Go runtime had installed its SIGHUP handler.
So all of this behavior now makes sense: I had some VPN disconnect, my shell session ended, the login shell sent SIGHUP to cockroach, causing an early exit and a test failure.
Now the only question is: I'm pretty sure I've done this before (run ./run.sh in the background and either logged out or got disconnectedfrom the VPN) without issue and come back and not seen this panic. I think I must just have been getting lucky. If the SIGHUP doesn't hit cockroach before it's initialized its logs, then it will trigger a spurious config file reload, but otherwise everything will barrel on fine. It might also be fine if I weren't tail'ing the log file because then sshd may not have noticed the TCP issue. (Then some other test failure would have eventually triggered it all to come down.)
The text was updated successfully, but these errors were encountered:
tl;dr: If
cockroach
receives SIGHUP too early during initialization, it can crash. I encountered this while running the test suite in a loop (even undernohup
!). I don't expect we'll do anything about this but I wanted to record it in case anybody else hits it.While running the omicron test suite in a loop overnight, I got to my system and found that my VPN connection had terminated. Since I ran the loop under
nohup
, I didn't think much of it. When I got logged back in, I found that the test suite had failed with:I got to the
cockroachdb_stderr
file in the tmp directory and found:(there are more goroutine stacks but they're irrelevant)
My first clue may have been that this appeared in stderr, not the stderr log file. This happened before Cockroach had switched to using its log file. I found cockroachdb/cockroach#84638, where knz explained explicitly what was going on:
cockroach
took a SIGHUP that caused it to try to log something out before having initialized the logging subsystem.The remaining mystery was: where did the SIGHUP come from? I wouldn't normally be that interested in that, but I had been in the process of trying to verify the fixes for illumos#15254 and illumos#15367 and wanted to be sure that the panic wasn't somehow a result of an OS issue. (In particular, our understanding of both issues is that they can cause memory that should be zero'd to be non-zero, and it can totally cause messages that look like "this thing that should have been uninitialized was initialized", like the
addtimer called with initialized timer
message mentioned in #1146). Omicron doesn't appear to use SIGHUP at all. And I ran this undernohup
. But I saw in the output oflast
that I had a login session end during the same minute as this problem was reported. And I found thatnohup
only sets the disposition forSIGHUP
toSIG_IGN
and then exec's what you gave it. This causes immediately children to be immune to SIGHUP, and children of those processes will too, unless any of them installs a signal handler for SIGHUP. The Go runtime does install a handler for SIGHUP.I reproduced what I thought happened by:
./run.sh &
)tail -f
the output file. This is important -- output must be going to the ssh connection when the next step happens or the server will never notice that there's a network problem.Before having started this, I traced SIGHUP signals sent with:
so that confirms that bash sends SIGHUP to all these processes. I had also checked with
psig
that the Go runtime had installed its SIGHUP handler.So all of this behavior now makes sense: I had some VPN disconnect, my shell session ended, the login shell sent SIGHUP to
cockroach
, causing an early exit and a test failure.Now the only question is: I'm pretty sure I've done this before (run
./run.sh
in the background and either logged out or got disconnectedfrom the VPN) without issue and come back and not seen this panic. I think I must just have been getting lucky. If the SIGHUP doesn't hit cockroach before it's initialized its logs, then it will trigger a spurious config file reload, but otherwise everything will barrel on fine. It might also be fine if I weren't tail'ing the log file because then sshd may not have noticed the TCP issue. (Then some other test failure would have eventually triggered it all to come down.)The text was updated successfully, but these errors were encountered: