Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker: fix deadlock when LoggingThread wrote into its own Queue #257

Merged

Conversation

lzaoral
Copy link
Contributor

@lzaoral lzaoral commented Mar 25, 2024

If self._hub.upload_task_log() called self._queue.put(), it would cause deadlock because:

  1. self._queue uses locks that are not reentrant.
  2. it will block if the Queue is already full.

@siteshwar
Copy link
Contributor

Such failures were seen with firefox builds on staging instance of OpenScanHub. I do not have logs from previous failures, but I have triggered another build to reproduce this problem. It takes around 4-5 hours to reproduce this issue.

@rohanpm
Copy link
Member

rohanpm commented Mar 25, 2024

I'm a little confused under exactly what conditions this occurs. I don't suppose you have a backtrace in the deadlock case, i.e. how exactly did we arrive at write ?

I'm asking because: the first thing that comes to mind is that there's a logger which is attached to sys.stdout or stderr, which feeds back into the LoggingThread, so if such a logger is activated from within the LoggingThread it can get stuck waiting for space in the queue forever.

But in this change, you've also added a logging statement in this case: self._logger.log_error("Error in LoggingThread: %r", data). I'm reading through the code and trying to understand exactly how this logger is initialized, and it looks like at least in some cases, it will be attached to stderr. I believe it's the same logger which appears in this snippet from kobo/worker/main.py:

    if foreground and tm._logger is not None:
        kobo.log.add_stderr_logger(tm._logger)

But if so, will that lead to infinite recursion? It seems like that log statement can bring us back into write which will again fall into the elif block and log the same message again, and so on.

@kdudka
Copy link
Contributor

kdudka commented Mar 26, 2024

@rohanpm Good point. If self._logger is writing to stderr, we will have a problem. In our case, the logger was writing to a separate file. The backtrace of the logging thread in the deadlock case looked like this:

Traceback (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 312, in wait
    waiter.acquire()
  File "/usr/lib64/python3.9/queue.py", line 140, in put
    self.not_full.wait()
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 96, in write
    self._queue.put(data)
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 119, in write
    self._thread.write(data)
  <built-in method print of module object at remote 0x7f8697bcfae0>
  File "/usr/lib/python3.9/site-packages/kobo/xmlrpc.py", line 487, in request
    print("XML-RPC connection to %s failed: %s, %s" % (args[0], ex.args[1:], retries), file=sys.stderr)
  File "/usr/lib64/python3.9/xmlrpc/client.py", line 1464, in __request
    response = self.__transport.request(
  File "/usr/lib64/python3.9/xmlrpc/client.py", line 1122, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.9/site-packages/kobo/client/__init__.py", line 510, in upload_task_log
    self._hub.worker.upload_task_log(task_id, remote_file_name, mode, chunk_start, chunk_len, chunk_checksum, encoded_chunk)
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 69, in run
    self._hub.upload_task_log(BytesIO(self._send_data), self._task_id, "stdout.log", append=True)
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()

If self._hub.upload_task_log() called self._queue.put(), it would cause
deadlock because

1. self._queue uses locks that are not reentrant.
2. it will block if the Queue is already full.

Co-authored-by: Kamil Dudka <[email protected]>
Co-authored-by: Lukáš Zaoral <[email protected]>
@lzaoral lzaoral force-pushed the worker-fix-logger-deadlock branch from 2072322 to bc3df3e Compare April 3, 2024 12:32
@lzaoral
Copy link
Contributor Author

lzaoral commented Apr 3, 2024

@rohanpm I've amended the commit to avoid infinite recursion when the logger has a StreamHandler attached that points to a LoggingIO instance using the same thread.

Note that it is not the case of kobo/worker/main.py because the stderr handler is added in the main worker process but sys.stderr is only reassigned in child processes.

@rohanpm rohanpm merged commit c08f3d9 into release-engineering:master Apr 4, 2024
19 checks passed
@lzaoral lzaoral deleted the worker-fix-logger-deadlock branch April 4, 2024 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants