fail faster on errors reading CockroachDB listening-url file #2091
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When starting CockroachDB either as part of the test suite or
omicron-dev db-run
, we wait for it to write out its listening URL to a known file path. This is asynchronous. It's the way that we determine when CockroachDB is up. So we just have to poll on it. Naturally, if it takes too long, we give up and fail. In the meantime, if the file doesn't exist or exists but is incomplete, then we want just keep waiting.The code today continues waiting if it runs into any error reading the file, not just ENOENT. So for example while I was debugging #1146, I put a
cockroach
executable in my PATH that ran the requested command underpfexec dtrace -c
, which meant I was running it as root, but the test suite was still running as my normal user. As a result of either CockroachDB's explicit choice or else the umask in effect at the time, the listen URL file was created with permissions that prevented the test suite from reading it. Instead of failing immediately and telling me that, it waited for the full timeout period (30 seconds) and then just reported a generic timeout error (saying that CockroachDB hadn't started within the timeout, which wasn't actually true).This PR changes the error handling here in the specific case that we get an error other than ENOENT when trying to read the file. In that case, we fail immediately (rather than waiting for the full timeout) and actually print the error.
I couldn't think of a good way to test this automatically.I tested it by hand by replicating the problem I had above. It now fails quickly and prints:full output:
Update: I added an automated test, too.