Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LoggingThread recover on all errors [RHELDST-332] #248

Conversation

crungehottman
Copy link
Member

Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task.

Fixes #60

(This commit is a reimplementation of
#106)

Previously, LoggingThread would recover from any XML-RPC fault, but
would stop when any other type of exception was encountered.
That is a problem, as it means the worker will permanently give
up sending messages to the hub when all kinds of temporary
issues occur (e.g. a temporary network disruption between worker
and hub). The task underneath may continue running for hours,
with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt
recovering from all kinds of errors, as we should try hard not
to lose log messages from a task.

Fixes release-engineering#60

(This commit is a reimplementation of
release-engineering#106)
@crungehottman crungehottman merged commit 313d553 into release-engineering:master Feb 7, 2024
19 checks passed
@lzaoral
Copy link
Contributor

lzaoral commented Mar 12, 2024

Unfortunately, I've found out that this change increases the probability of the logging thread to never terminate. If the self._send_data variable is not empty and the try block always raises some Exception, the condition of the outer loop will always evaluate to True.

@lzaoral
Copy link
Contributor

lzaoral commented Mar 14, 2024

@rohanpm @crungehottman Do you have any ideas how to resolve #248 (comment) without effectively reverting changes introduced in this PR?

crungehottman added a commit to crungehottman/kobo that referenced this pull request Mar 14, 2024
In release-engineering#169,
fatal LoggingThread errors were logged to the worker's local
log file before exiting.

In release-engineering#248, a more
drastic measure was taken: all exceptions were indefinitely
retried and the ability to write exceptions to the worker's
local log file was removed. This approach could prevent the
LoggingThread from terminating when encountering a fatal error.

This commit combines the two approaches, backing out and exiting
only after determining we've identified a persistent fatal error.

The means by which we identify a fatal (vs a temporary/non-fatal)
LoggingThread error is by simply retrying the `upload_task_log`
method during a defined interval (defined by the LoggingThread's
`_timeout` attribute).

If the method continues to fail for the duration of the interval
(i.e., does not succeed by the time the timeout is reached), we
can consider the error to be fatal. At this point, we attempt to
instead write the error to the worker's local log file, and raise
the exception.

Note that the timeout can be toggled using the
`KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
@crungehottman
Copy link
Member Author

Unfortunately, I've found out that this change increases the probability of the logging thread to never terminate. If the self._send_data variable is not empty and the try block always raises some Exception, the condition of the outer loop will always evaluate to True.

@lzaoral Because self._send_data is reset after a successful call of upload_task_log (in the try/except block), it would seem that the only thing preventing the self._send_data variable from being empty is the persistent failure of upload_task_log. Would something like #253 help? It doesn't completely revert the changes introduced in this PR

crungehottman added a commit to crungehottman/kobo that referenced this pull request Mar 15, 2024
In release-engineering#169,
fatal LoggingThread errors were logged to the worker's local
log file before exiting.

In release-engineering#248, a more
drastic measure was taken: all exceptions were indefinitely
retried and the ability to write exceptions to the worker's
local log file was removed. This approach could prevent the
LoggingThread from terminating when encountering a fatal error.

This commit combines the two approaches, backing out and exiting
only after determining we've identified a persistent fatal error.

The means by which we identify a fatal (vs a temporary/non-fatal)
LoggingThread error is by simply retrying the `upload_task_log`
method during a defined interval (defined by the LoggingThread's
`_timeout` attribute).

If the method continues to fail for the duration of the interval
(i.e., does not succeed by the time the timeout is reached), we
can consider the error to be fatal. At this point, we attempt to
instead write the error to the worker's local log file, and raise
the exception.

Note that the timeout can be toggled using the
`KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
crungehottman added a commit to crungehottman/kobo that referenced this pull request Mar 15, 2024
In release-engineering#169,
fatal LoggingThread errors were logged to the worker's local
log file before exiting.

In release-engineering#248, a more
drastic measure was taken: all exceptions were indefinitely
retried and the ability to write exceptions to the worker's
local log file was removed. This approach could prevent the
LoggingThread from terminating when encountering a fatal error.

This commit combines the two approaches, backing out and exiting
only after determining we've identified a persistent fatal error.

The means by which we identify a fatal (vs a temporary/non-fatal)
LoggingThread error is by simply retrying the `upload_task_log`
method during a defined interval (defined by the LoggingThread's
`_timeout` attribute).

If the method continues to fail for the duration of the interval
(i.e., does not succeed by the time the timeout is reached), we
can consider the error to be fatal. At this point, we attempt to
instead write the error to the worker's local log file, and raise
the exception.

Note that the timeout can be toggled using the
`KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
crungehottman added a commit to crungehottman/kobo that referenced this pull request Mar 18, 2024
In release-engineering#169,
fatal LoggingThread errors were logged to the worker's local
log file before exiting.

In release-engineering#248, a more
drastic measure was taken: all exceptions were indefinitely
retried and the ability to write exceptions to the worker's
local log file was removed. This approach could prevent the
LoggingThread from terminating when encountering a fatal error.

This commit combines the two approaches, backing out and exiting
only after determining we've identified a persistent fatal error.

The means by which we identify a fatal (vs a temporary/non-fatal)
LoggingThread error is by simply retrying the `upload_task_log`
method during a defined interval (defined by the LoggingThread's
`_timeout` attribute).

If the method continues to fail for the duration of the interval
(i.e., does not succeed by the time the timeout is reached), we
can consider the error to be fatal. At this point, we attempt to
instead write the error to the worker's local log file, and raise
the exception.

Note that the timeout can be toggled using the
`KOBO_LOGGING_THREAD_TIMEOUT` environment variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LoggingThread should recover from temporary outage on hub
3 participants