-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not tolerate exceptions while storing the serialized network #946
base: dev
Are you sure you want to change the base?
Conversation
Thanks for the fix. In local testing, I confirmed that an exception here thrown by a worker makes its way to a sensible error message shown in the analysis panel of the UI. Because we were just talking about obscure Kryo serialization errors, I was wondering if there are harmless errors that occur on production. If so and we want to tolerate them, should we restrict the catch block to |
I think that tolerating exceptions here was intentional, as indicated in the comment "Tolerate exceptions here as we do have a network to return, we just failed to cache it." We may have been encountering exceptions when copying files to S3 for example, which could even be due to problems with AWS that were out of control. I think this used to happen fairly regularly for reasons we could never pin down. In such a situation, the machine would have been able to continue doing calculations but just couldn't cache the network file for reuse by other machines in the future. If we're no longer experiencing regular inexplicable S3 transfer problems, then maybe we don't need to tolerate any exceptions here and it's better to fail fast and very visibly. I'll give it some more thought tomorrow and provide a review. |
Looking at this again, my sense is that priorities are just different when testing a new feature vs. normal usage. It is more convenient in testing if the error is immediately visible and fails fast, aborting the calling code. However, in production it's more convenient if arcane details like network or other cloud provider glitches do not prevent the user from making use of the TransportNetwork that was just successfully built, potentially after a long wait. If the problem was most recently encountered in testing, then the fail-fast behavior might seem better. But do we want end users to experience that behavior? Do we have a way to establish whether S3 transfer problems have become rare or nonexistent? If so, then the question can be sidestepped. We can switch to the new fail-fast behavior without imposing it on end users. Probably the ideal is for all methods to employ a more sophisticated return type that can accumulate errors and warnings alongside any return value, so no part of the system has to choose between returning a value and failing (they can always do both at once). But that's a major change that would provide benefit only if applied broadly throughout the system. |
Some S3 download problems may still be lurking (such as #832), but I don't recall specific instances of other upload problems. I just thought of another option: tolerate the exceptions but log them to Slack. |
If AWS S3 upload is failing that seems like a serious issue to our production system that we should be aware of, for whatever reason that might have caused it. Logging to Slack is certainly an option, but do we handle any other exceptions or errors in that way? |
Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.
8d6bb4c
to
519cbb7
Compare
I agree S3 failures are serious and we need to know about them. But if I remember correctly we could never quite pin down why we saw recurring and seemingly random S3 transfer failures. So for a time we just had to provide as smooth an experience as we could until we could track down the reason. I don't think I've seen any S3 transfer failures recently, so maybe AWS has resolved whatever the underlying weakness was. I'd be comfortable with letting these exceptions through as full-fledged errors, making them visible to both us and the end user. Then if the problem arises again we can devote some energy to finding the root of it. About logging to Slack: we currently log any ErrorEvent that does not contain the words "Token has expired" or "Authorization header". These events are produced in response to any AnalysisServerException, IOException, FileUploadException, NullPointerException, or RuntimeException that is caught by the HTTP server. In all these cases it looks like we're also returning a HTTP response with a 4xx status code. So it would be a new approach to catch an exception and fire an ErrorEvent without letting the exception bubble up to the HTTP server and produce a 4xx response. But I actually think it's a good approach for this kind of failure that could needlessly block the end user from getting work done, but is something that operations staff need to know about so they can attempt to fix it. |
fileStorage.moveIntoStorage()
failed recently and the reason for failure was hidden until we logged into the server and checked the logs.This PR removes the try/catch block in order to allow exceptions here to bubble up to the user during network building and saving.