-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LoggingThread recover on all errors [RHELDST-332] #248
Make LoggingThread recover on all errors [RHELDST-332] #248
Conversation
Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded. Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task. Fixes release-engineering#60 (This commit is a reimplementation of release-engineering#106)
Unfortunately, I've found out that this change increases the probability of the logging thread to never terminate. If the |
@rohanpm @crungehottman Do you have any ideas how to resolve #248 (comment) without effectively reverting changes introduced in this PR? |
In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
@lzaoral Because |
In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded.
Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task.
Fixes #60
(This commit is a reimplementation of
#106)