You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For our setup, we normally install (and keep tools updated to new versions) through ephemeris shed-tools calls from a CI.
It seems that the process of making all the different processes aware beyond the web handler on tools installation is not working well. First I see authentication/timeout issues between the workflow container and AMPQ:
galaxy.queue_worker INFO 2022-12-04 16:03:04,510 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Queuing sync task reload_toolbox for workflow_scheduler0.
galaxy.queue_worker ERROR 2022-12-04 16:03:14,519 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Error waiting for task: '{'task': 'reload_toolbox', 'kwargs': {}}' sent with routing key 'control.workflow_scheduler0@galaxy-dev-workflow-7b8577f98c-v4kq5'
Traceback (most recent call last):
File "/galaxy/server/lib/galaxy/queue_worker.py", line 124, in send_task
self.connection.drain_events(timeout=timeout)
File "/galaxy/server/.venv/lib/python3.10/site-packages/kombu/connection.py", line 316, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/galaxy/server/.venv/lib/python3.10/site-packages/kombu/transport/pyamqp.py", line 169, in drain_events
return connection.drain_events(**kwargs)
File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/connection.py", line 525, in drain_events
while not self.blocking_read(timeout):
File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/connection.py", line 530, in blocking_read
frame = self.transport.read_frame()
File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/transport.py", line 294, in read_frame
frame_header = read(7, True)
File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/transport.py", line 627, in _read
s = recv(n - len(rbuf))
TimeoutError: timed out
galaxy.queue_worker INFO 2022-12-04 16:03:14,520 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Sending reload_toolbox control task.
and then when executing a workflow:
galaxy.workflow.modules WARNING 2022-12-05 09:25:33,720 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] The tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0' is missing. Cannot build workflow module.
galaxy.workflow.run ERROR 2022-12-05 09:25:33,721 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Failed to execute scheduled workflow.
Traceback (most recent call last):
File "/galaxy/server/lib/galaxy/workflow/run.py", line 42, in __invoke
outputs = invoker.invoke()
File "/galaxy/server/lib/galaxy/workflow/run.py", line 142, in invoke
remaining_steps = self.progress.remaining_steps()
File "/galaxy/server/lib/galaxy/workflow/run.py", line 275, in remaining_steps
self.module_injector.inject(step, step_args=self.param_map.get(step.id, {}))
File "/galaxy/server/lib/galaxy/workflow/modules.py", line 2194, in inject
module.add_dummy_datasets(connections=step.input_connections, steps=steps)
File "/galaxy/server/lib/galaxy/workflow/modules.py", line 1749, in add_dummy_datasets
raise ToolMissingException(f"Tool {self.tool_id} missing. Cannot add dummy datasets.", tool_id=self.tool_id)
galaxy.exceptions.ToolMissingException: Tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0 missing. Cannot add dummy datasets.
my values.yaml doesn't change any aspect of the rabbit config. I suspect that on a restart, the install will pick up the tools, sorting the problem transiently. But of course what should happen is that new tool version installs should appear on all processes (web, job, workflow handlers) without a restart.
The text was updated successfully, but these errors were encountered:
@pcm32 I spoke to Enis and Keith, and none of us recall experiencing this issue on k8s. If you're experiencing this regularly, maybe there is a networking issue in your k8s cluster? Anything in the kubeproxy logs or other host logs that may indicate an issue?
We have however, seen this error on usegalaxy.au (non-kubernetes), where a rabbitmq restart would result in the above error, and handlers would need to be restarted to recover. That is a resilience issue on the Galaxy side, and probably needs a bug logged in the Galaxy repo.
For our setup, we normally install (and keep tools updated to new versions) through ephemeris shed-tools calls from a CI.
It seems that the process of making all the different processes aware beyond the web handler on tools installation is not working well. First I see authentication/timeout issues between the workflow container and AMPQ:
and then when executing a workflow:
The different pods look like this:
so all healthy in my view.
my values.yaml doesn't change any aspect of the rabbit config. I suspect that on a restart, the install will pick up the tools, sorting the problem transiently. But of course what should happen is that new tool version installs should appear on all processes (web, job, workflow handlers) without a restart.
The text was updated successfully, but these errors were encountered: