Do not tolerate exceptions while storing the serialized network #946

trevorgerhardt · 2024-10-24T01:29:35Z

fileStorage.moveIntoStorage() failed recently and the reason for failure was hidden until we logged into the server and checked the logs.

This PR removes the try/catch block in order to allow exceptions here to bubble up to the user during network building and saving.

ansoncfit · 2024-10-25T15:28:53Z

Thanks for the fix. In local testing, I confirmed that an exception here thrown by a worker makes its way to a sensible error message shown in the analysis panel of the UI.

Because we were just talking about obscure Kryo serialization errors, I was wondering if there are harmless errors that occur on production. If so and we want to tolerate them, should we restrict the catch block to KryoException e instead of removing the try/catch block entirely? What do you think, @abyrd?

abyrd · 2024-10-28T18:42:41Z

I think that tolerating exceptions here was intentional, as indicated in the comment "Tolerate exceptions here as we do have a network to return, we just failed to cache it."

We may have been encountering exceptions when copying files to S3 for example, which could even be due to problems with AWS that were out of control. I think this used to happen fairly regularly for reasons we could never pin down. In such a situation, the machine would have been able to continue doing calculations but just couldn't cache the network file for reuse by other machines in the future.

If we're no longer experiencing regular inexplicable S3 transfer problems, then maybe we don't need to tolerate any exceptions here and it's better to fail fast and very visibly. I'll give it some more thought tomorrow and provide a review.

abyrd · 2024-11-01T05:26:33Z

Looking at this again, my sense is that priorities are just different when testing a new feature vs. normal usage. It is more convenient in testing if the error is immediately visible and fails fast, aborting the calling code. However, in production it's more convenient if arcane details like network or other cloud provider glitches do not prevent the user from making use of the TransportNetwork that was just successfully built, potentially after a long wait.

If the problem was most recently encountered in testing, then the fail-fast behavior might seem better. But do we want end users to experience that behavior?

Do we have a way to establish whether S3 transfer problems have become rare or nonexistent? If so, then the question can be sidestepped. We can switch to the new fail-fast behavior without imposing it on end users.

Probably the ideal is for all methods to employ a more sophisticated return type that can accumulate errors and warnings alongside any return value, so no part of the system has to choose between returning a value and failing (they can always do both at once). But that's a major change that would provide benefit only if applied broadly throughout the system.

ansoncfit · 2024-11-01T12:59:09Z

Some S3 download problems may still be lurking (such as #832), but I don't recall specific instances of other upload problems.

I just thought of another option: tolerate the exceptions but log them to Slack.

trevorgerhardt · 2024-11-02T09:17:22Z

If AWS S3 upload is failing that seems like a serious issue to our production system that we should be aware of, for whatever reason that might have caused it.

Logging to Slack is certainly an option, but do we handle any other exceptions or errors in that way?

Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.

abyrd · 2024-11-06T09:34:00Z

I agree S3 failures are serious and we need to know about them. But if I remember correctly we could never quite pin down why we saw recurring and seemingly random S3 transfer failures. So for a time we just had to provide as smooth an experience as we could until we could track down the reason.

I don't think I've seen any S3 transfer failures recently, so maybe AWS has resolved whatever the underlying weakness was. I'd be comfortable with letting these exceptions through as full-fledged errors, making them visible to both us and the end user. Then if the problem arises again we can devote some energy to finding the root of it.

About logging to Slack: we currently log any ErrorEvent that does not contain the words "Token has expired" or "Authorization header". These events are produced in response to any AnalysisServerException, IOException, FileUploadException, NullPointerException, or RuntimeException that is caught by the HTTP server. In all these cases it looks like we're also returning a HTTP response with a 4xx status code.

So it would be a new approach to catch an exception and fire an ErrorEvent without letting the exception bubble up to the HTTP server and produce a 4xx response. But I actually think it's a good approach for this kind of failure that could needlessly block the end user from getting work done, but is something that operations staff need to know about so they can attempt to fix it.

trevorgerhardt enabled auto-merge October 24, 2024 01:30

trevorgerhardt requested review from abyrd and ansoncfit October 25, 2024 06:06

trevorgerhardt added the bug label Oct 25, 2024

Throw exceptions from buildNetwork

519cbb7

Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.

trevorgerhardt force-pushed the s3-upload-exceptions branch from 8d6bb4c to 519cbb7 Compare November 6, 2024 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not tolerate exceptions while storing the serialized network #946

Do not tolerate exceptions while storing the serialized network #946

trevorgerhardt commented Oct 24, 2024

ansoncfit commented Oct 25, 2024 •

edited

Loading

abyrd commented Oct 28, 2024

abyrd commented Nov 1, 2024

ansoncfit commented Nov 1, 2024

trevorgerhardt commented Nov 2, 2024

abyrd commented Nov 6, 2024

Do not tolerate exceptions while storing the serialized network #946

Are you sure you want to change the base?

Do not tolerate exceptions while storing the serialized network #946

Conversation

trevorgerhardt commented Oct 24, 2024

ansoncfit commented Oct 25, 2024 • edited Loading

abyrd commented Oct 28, 2024

abyrd commented Nov 1, 2024

ansoncfit commented Nov 1, 2024

trevorgerhardt commented Nov 2, 2024

abyrd commented Nov 6, 2024

ansoncfit commented Oct 25, 2024 •

edited

Loading