Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new jobs are not getting scheduled due to the error "InternalError galaxy.jobs.handler in __handle_waiting_jobs" #17893

Closed
sanjaysrikakulam opened this issue Apr 3, 2024 · 5 comments · Fixed by #17913
Assignees

Comments

@sanjaysrikakulam
Copy link
Contributor

We are facing the following issue in the EU (release 24.0), which is preventing all new jobs from being scheduled.

Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: galaxy.jobs.runners.pulsar DEBUG 2024-04-03 13:21:01,118 [pN:handler_sn06_0,p:3479151,tN:PulsarJobRunner.monitor_thread] (68429601) Received status update: running
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: galaxy.jobs.handler DEBUG 2024-04-03 13:21:01,301 [pN:handler_sn06_0,p:3479151,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 68435701
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: galaxy.jobs.handler ERROR 2024-04-03 13:21:01,308 [pN:handler_sn06_0,p:3479151,tN:JobHandlerQueue.monitor_thread] Exception in monitor_step
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: Traceback (most recent call last):
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     self.dialect.do_execute(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     cursor.execute(statement, parameters)
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: psycopg2.errors.InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: The above exception was the direct cause of the following exception:
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: Traceback (most recent call last):
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/server/lib/galaxy/jobs/handler.py", line 377, in __monitor
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     self.__monitor_step()
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/server/lib/galaxy/jobs/handler.py", line 395, in __monitor_step
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     self.__handle_waiting_jobs()
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/server/lib/galaxy/jobs/handler.py", line 474, in __handle_waiting_jobs
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     .all()
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:      ^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2773, in all
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     return self._iter().all()
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:            ^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     result = self.session.execute(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:              ^^^^^^^^^^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1717, in execute
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     result = conn._execute_20(statement, params or {}, execution_options)
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     return meth(self, args_10style, kwargs_10style, execution_options)
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     return connection._execute_clauseelement(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     ret = self._execute_context(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:           ^^^^^^^^^^^^^^^^^^^^^^
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     self._handle_dbapi_exception(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     util.raise_(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     raise exception
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     self.dialect.do_execute(
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:   File "/opt/galaxy/venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]:     cursor.execute(statement, parameters)
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: sqlalchemy.exc.InternalError: (psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: [SQL: SELECT job.id AS job_id, job.create_time AS job_create_time, job.update_time AS job_update_time, job.history_id AS job_history_id, job.library_folder_id AS job_library_folder_id, job.tool_id AS job_tool_id, job.tool_version AS job_tool_version, job.galaxy_version AS job_galaxy_version, job.dynamic_tool_id AS job_dynamic_tool_id, job.state AS job_state, job.info AS job_info, job.copied_from_job_id AS job_copied_from_job_id, job.command_line AS job_command_line, job.dependencies AS job_dependencies, job.job_messages AS job_job_messages, job.param_filename AS job_param_filename, job.runner_name AS job_runner_name_1, job.job_stdout AS job_job_stdout, job.job_stderr AS job_job_stderr, job.tool_stdout AS job_tool_stdout, job.tool_stderr AS job_tool_stderr, job.exit_code AS job_exit_code, job.traceback AS job_traceback, job.session_id AS job_session_id, job.user_id AS job_user_id, job.job_runner_name AS job_job_runner_name, job.job_runner_external_id AS job_job_runner_external_id, job.destination_id AS job_destination_id, job.destination_params AS job_destination_params, job.object_store_id AS job_object_store_id, job.imported AS job_imported, job.params AS job_params, job.handler AS job_handler, job.preferred_object_store_id AS job_preferred_object_store_id, job.object_store_id_overrides AS job_object_store_id_overrides
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM job JOIN (SELECT job.id AS id, job.create_time AS create_time, job.update_time AS update_time, job.history_id AS history_id, job.library_folder_id AS library_folder_id, job.tool_id AS tool_id, job.tool_version AS tool_version, job.galaxy_version AS galaxy_version, job.dynamic_tool_id AS dynamic_tool_id, job.state AS state, job.info AS info, job.copied_from_job_id AS copied_from_job_id, job.command_line AS command_line, job.dependencies AS dependencies, job.job_messages AS job_messages, job.param_filename AS param_filename, job.runner_name AS runner_name, job.job_stdout AS job_stdout, job.job_stderr AS job_stderr, job.tool_stdout AS tool_stdout, job.tool_stderr AS tool_stderr, job.exit_code AS exit_code, job.traceback AS traceback, job.session_id AS session_id, job.user_id AS user_id, job.job_runner_name AS job_runner_name, job.job_runner_external_id AS job_runner_external_id, job.destination_id AS destination_id, job.destination_params AS destination_params, job.object_store_id AS object_store_id, job.imported AS imported, job.params AS params, job.handler AS handler, job.preferred_object_store_id AS preferred_object_store_id, job.object_store_id_overrides AS object_store_id_overrides, rank() OVER (PARTITION BY coalesce(job.user_id, job.session_id) ORDER BY job.id) AS rank
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM job LEFT OUTER JOIN galaxy_user ON galaxy_user.id = job.user_id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: WHERE job.state = %(state_1)s AND job.handler = %(handler_1)s AND (job.id NOT IN (SELECT anon_2.id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM (SELECT job.id AS id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM job JOIN job_to_input_dataset ON job.id = job_to_input_dataset.job_id JOIN history_dataset_association ON history_dataset_association.id = job_to_input_dataset.dataset_id JOIN dataset ON dataset.id = history_dataset_association.dataset_id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: WHERE job.state = %(state_2)s AND dataset.state IN (%(state_3_1)s, %(state_3_2)s, %(state_3_3)s, %(state_3_4)s, %(state_3_5)s)) AS anon_2)) AND (job.id NOT IN (SELECT anon_3.id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM (SELECT job.id AS id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: FROM job JOIN job_to_input_library_dataset ON job.id = job_to_input_library_dataset.job_id JOIN library_dataset_dataset_association ON library_dataset_dataset_association.id = job_to_input_library_dataset.ldda_id JOIN dataset ON dataset.id = library_dataset_dataset_association.dataset_id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: WHERE job.state = %(state_4)s AND dataset.state IN (%(state_5_1)s, %(state_5_2)s, %(state_5_3)s, %(state_5_4)s, %(state_5_5)s)) AS anon_3)) AND (job.user_id IS NULL OR galaxy_user.active = true) ORDER BY job.id) AS anon_1 ON job.id = anon_1.id
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: WHERE anon_1.rank <= %(rank_1)s]
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: [parameters: {'state_1': <JobState.NEW: 'new'>, 'handler_1': 'handler_sn06_0', 'state_2': <JobState.NEW: 'new'>, 'state_4': <JobState.NEW: 'new'>, 'rank_1': 32, 'state_3_1': 'new', 'state_3_2': 'upload', 'state_3_3': 'queued', 'state_3_4': 'running', 'state_3_5': 'setting_metadata', 'state_5_1': 'new', 'state_5_2': 'upload', 'state_5_3': 'queued', 'state_5_4': 'running', 'state_5_5': 'setting_metadata'}]
Apr 03 13:21:01 sn06.galaxyproject.eu python[3479151]: (Background on this error at: https://sqlalche.me/e/14/2j85)
Apr 03 13:21:02 sn06.galaxyproject.eu python[3479151]: galaxy.jobs.runners.pulsar DEBUG 2024-04-03 13:21:02,124 [pN:handler_sn06_0,p:3479151,tN:PulsarJobRunner.monitor_thread] 

Sentry log: https://sentry.galaxyproject.org/share/issue/34f8c4da8b29432ba1e53428e617904b/

@sanjaysrikakulam
Copy link
Contributor Author

FYI: @mira-miracoli, @bgruening

@sanjaysrikakulam
Copy link
Contributor Author

sanjaysrikakulam commented Apr 3, 2024

@jdavcs jdavcs self-assigned this Apr 3, 2024
@sanjaysrikakulam
Copy link
Contributor Author

Update:

we noticed the following DB-related errors:

in the job handler logs

psycopg2.OperationalError: connection to server at servername, port 5432 failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

in the postgres server

unexpected EOF on client connection with an open transaction
```

As a result, we observed that Galaxy was maxing out the `max_connections` limit. At 17:30, we increased the `max_connections` and restarted the DB systemd service. Since then, we do not see the error anymore.

It may be correlated or is a causation. 

It's been almost 1 hour since we restarted the job handlers, and I no longer see the errors.

Maybe there was something stale in the DB, and restarting the service fixed it.

@mvdbeek
Copy link
Member

mvdbeek commented Apr 8, 2024

Thanks @sanjaysrikakulam! I think this should be handled with #17913. Let us know if it happens again after updating the branch.

@mvdbeek mvdbeek closed this as completed Apr 8, 2024
@sanjaysrikakulam
Copy link
Contributor Author

Sure! Thank you @mvdbeek!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants