Make LoggingThread recover on all errors [RHELDST-332] #248

crungehottman · 2024-02-06T15:51:32Z

Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task.

Fixes #60

(This commit is a reimplementation of
#106)

Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded. Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task. Fixes release-engineering#60 (This commit is a reimplementation of release-engineering#106)

lzaoral · 2024-03-12T12:25:34Z

Unfortunately, I've found out that this change increases the probability of the logging thread to never terminate. If the self._send_data variable is not empty and the try block always raises some Exception, the condition of the outer loop will always evaluate to True.

lzaoral · 2024-03-14T12:04:46Z

@rohanpm @crungehottman Do you have any ideas how to resolve #248 (comment) without effectively reverting changes introduced in this PR?

In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.

crungehottman · 2024-03-14T17:00:35Z

Unfortunately, I've found out that this change increases the probability of the logging thread to never terminate. If the self._send_data variable is not empty and the try block always raises some Exception, the condition of the outer loop will always evaluate to True.

@lzaoral Because self._send_data is reset after a successful call of upload_task_log (in the try/except block), it would seem that the only thing preventing the self._send_data variable from being empty is the persistent failure of upload_task_log. Would something like #253 help? It doesn't completely revert the changes introduced in this PR

In release-engineering#169, fatal LoggingThread errors were logged to the worker's local log file before exiting. In release-engineering#248, a more drastic measure was taken: all exceptions were indefinitely retried and the ability to write exceptions to the worker's local log file was removed. This approach could prevent the LoggingThread from terminating when encountering a fatal error. This commit combines the two approaches, backing out and exiting only after determining we've identified a persistent fatal error. The means by which we identify a fatal (vs a temporary/non-fatal) LoggingThread error is by simply retrying the `upload_task_log` method during a defined interval (defined by the LoggingThread's `_timeout` attribute). If the method continues to fail for the duration of the interval (i.e., does not succeed by the time the timeout is reached), we can consider the error to be fatal. At this point, we attempt to instead write the error to the worker's local log file, and raise the exception. Note that the timeout can be toggled using the `KOBO_LOGGING_THREAD_TIMEOUT` environment variable.

lzaoral mentioned this pull request Feb 6, 2024

hub: exceptions in upload_task_log pause worker logging openscanhub/openscanhub#101

Closed

crungehottman marked this pull request as ready for review February 6, 2024 15:57

crungehottman requested a review from rohanpm February 6, 2024 15:57

rohanpm approved these changes Feb 7, 2024

View reviewed changes

crungehottman merged commit 313d553 into release-engineering:master Feb 7, 2024
19 checks passed

crungehottman mentioned this pull request Mar 14, 2024

Add fatal LoggingThread error backout strategy #253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make LoggingThread recover on all errors [RHELDST-332] #248

Make LoggingThread recover on all errors [RHELDST-332] #248

crungehottman commented Feb 6, 2024

lzaoral commented Mar 12, 2024

lzaoral commented Mar 14, 2024

crungehottman commented Mar 14, 2024

Make LoggingThread recover on all errors [RHELDST-332] #248

Make LoggingThread recover on all errors [RHELDST-332] #248

Conversation

crungehottman commented Feb 6, 2024

lzaoral commented Mar 12, 2024

lzaoral commented Mar 14, 2024

crungehottman commented Mar 14, 2024