Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outcome queue deleted before Flow completion, but no exit. #620

Open
nkemnitz opened this issue Jan 24, 2024 · 1 comment
Open

Outcome queue deleted before Flow completion, but no exit. #620

nkemnitz opened this issue Jan 24, 2024 · 1 comment

Comments

@nkemnitz
Copy link
Collaborator

nkemnitz commented Jan 24, 2024

exec-dazzling-rose-gecko-of-perception was at 6057/6058 completed invert_field (subchunkable) tasks this morning, and the workers failed due to the missing outcome queue.
Edit: Scheduler was frozen and did not react to Ctrl+C. Had to kill it.

What should happen in this case: The run should fail immediately when the pods keep dying - no need to waste resources with an unrecoverable error.

What also should happen: Not deleting the outcome queue before the last task is actually processed, but I can't replicate it.

2024-01-24T08:50:01Z  ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2024-01-24T08:50:01Z/opt/conda/bin/zetta:8 in <module>2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z5 from zetta_utils.cli.main import cli2024-01-24T08:50:01Z6 if __name__ == '__main__': │
2024-01-24T08:50:01Z7sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
2024-01-24T08:50:01Z  │ ❱ 8sys.exit(cli()) │
2024-01-24T08:50:01Z92024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1137 in __call__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1062 in main2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1668 in invoke2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1404 in invoke2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:763 in invoke2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/cli/main.py:106 in run2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z103if parallel_builder: │
2024-01-24T08:50:01Z104 │ │ zetta_utils.builder.PARALLEL_BUILD_ALLOWED = True2024-01-24T08:50:01Z105 │ │
2024-01-24T08:50:01Z  │ ❱ 106result = zetta_utils.builder.build(spec, parallel=parallel_builder2024-01-24T08:50:01Z107logger.debug(f"Outcome: {pprint.pformat(result, indent=4)}") │
2024-01-24T08:50:01Z108if pdb: │
2024-01-24T08:50:01Z109 │ │ breakpoint() # pylint: disable=forgotten-debug-statement # pr │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:53 in build2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:62 in _build2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:115 in _execute_build_stages2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:93 in _build_object2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:83 in _build_object2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/mazepa/worker.py:63 in run_worker2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z60 │ │ │ │ return_value=None, │
2024-01-24T08:50:01Z61 │ │ │ ) │
2024-01-24T08:50:01Z62 │ │ │ outcome_report = OutcomeReport(task_id=constants.UNKNOWN_T2024-01-24T08:50:01Z  │ ❱ 63 │ │ │ outcome_queue.push([outcome_report]) │
2024-01-24T08:50:01Z64 │ │ │ raise e2024-01-24T08:50:01Z65 │ │ │
2024-01-24T08:50:01Z66 │ │ logger.info(f"Got {len(task_msgs)} tasks.") │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:64 in push2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z61 │ │ │ for e in payloads: │
2024-01-24T08:50:01Z62 │ │ │ │ tq_task = TQTask(serialization.serialize(e)) │
2024-01-24T08:50:01Z63 │ │ │ │ tq_tasks.append(tq_task) │
2024-01-24T08:50:01Z  │ ❱ 64 │ │ │ self._get_tq_queue().insert(tq_tasks, parallel=self.insert2024-01-24T08:50:01Z  ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2024-01-24T08:50:01Z65 │ │
2024-01-24T08:50:01Z66def _extend_msg_lease(self, duration_sec: int, msg: utils.SQSRecei2024-01-24T08:50:01Z67 │ │ utils.change_message_visibility( │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:49 in _get_tq_queue2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z46 │ │
2024-01-24T08:50:01Z47def _get_tq_queue(self) -> Any: │
2024-01-24T08:50:01Z48 │ │ if self._queue is None: │
2024-01-24T08:50:01Z  │ ❱ 49 │ │ │ self._queue = taskqueue.TaskQueue( │
2024-01-24T08:50:01Z/opt/conda/bin/zetta:8 in <module>2024-01-24T08:50:01Z50 │ │ │ │ self.name, │
2024-01-24T08:50:01Z51 │ │ │ │ region_name=self.region_name, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z52 │ │ │ │ endpoint_url=self.endpoint_url, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z5 from zetta_utils.cli.main import cli2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:69 in2024-01-24T08:50:01Z__init__2024-01-24T08:50:01Z6 if __name__ == '__main__': │
2024-01-24T08:50:01Z7sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
2024-01-24T08:50:01Z  │ ❱ 8sys.exit(cli()) │
2024-01-24T08:50:01Z92024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1137 in __call__2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:90 in2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Zinitialize_api2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1062 in main2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:58 in2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1668 in invoke2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z__init__2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:1404 in invoke2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:326 in2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/click/core.py:763 in invoke2024-01-24T08:50:01Zwrapped_f2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/cli/main.py:106 in run2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:406 in __call__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z103if parallel_builder: │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:362 in iter2024-01-24T08:50:01Z104 │ │ zetta_utils.builder.PARALLEL_BUILD_ALLOWED = True2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z105 │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:195 in reraise2024-01-24T08:50:01Z  │ ❱ 106result = zetta_utils.builder.build(spec, parallel=parallel_builder2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z107logger.debug(f"Outcome: {pprint.pformat(result, indent=4)}") │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/concurrent/futures/_base.py:451 in result2024-01-24T08:50:01Z108if pdb: │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z109 │ │ breakpoint() # pylint: disable=forgotten-debug-statement # pr │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/concurrent/futures/_base.py:403 in __get_result2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:53 in build2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:409 in __call__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:62 in _build2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:115 in _execute_build_stages2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:93 in _build_object2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:65 in2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z_get_qurl2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/builder/build.py:83 in _build_object2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/botocore/client.py:534 in _api_call2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/mazepa/worker.py:63 in run_worker2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/botocore/client.py:976 in2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z_make_api_call2024-01-24T08:50:01Z60 │ │ │ │ return_value=None, │
2024-01-24T08:50:01Z61 │ │ │ ) │
2024-01-24T08:50:01Z62 │ │ │ outcome_report = OutcomeReport(task_id=constants.UNKNOWN_T2024-01-24T08:50:01Z  │ ❱ 63 │ │ │ outcome_queue.push([outcome_report]) │
2024-01-24T08:50:01Z64 │ │ │ raise e2024-01-24T08:50:01Z65 │ │ │
2024-01-24T08:50:01Z  ╰──────────────────────────────────────────────────────────────────────────────╯
2024-01-24T08:50:01Z66 │ │ logger.info(f"Got {len(task_msgs)} tasks.") │
2024-01-24T08:50:01Z  QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue)
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  when calling the GetQueueUrl operation: The specified queue does not exist for
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:64 in push2024-01-24T08:50:01Z  this wsdl version.
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z61 │ │ │ for e in payloads: │
2024-01-24T08:50:01Z62 │ │ │ │ tq_task = TQTask(serialization.serialize(e)) │
2024-01-24T08:50:01Z  Exception occured while building "spec" (mapped to "run_worker" from module
2024-01-24T08:50:01Z63 │ │ │ │ tq_tasks.append(tq_task) │
2024-01-24T08:50:01Z  "zetta_utils.mazepa.worker")
2024-01-24T08:50:01Z  │ ❱ 64 │ │ │ self._get_tq_queue().insert(tq_tasks, parallel=self.insert2024-01-24T08:50:01Z65 │ │
2024-01-24T08:50:01Z66def _extend_msg_lease(self, duration_sec: int, msg: utils.SQSRecei2024-01-24T08:50:01Z67 │ │ utils.change_message_visibility( │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:49 in _get_tq_queue2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z46 │ │
2024-01-24T08:50:01Z47def _get_tq_queue(self) -> Any: │
2024-01-24T08:50:01Z48 │ │ if self._queue is None: │
2024-01-24T08:50:01Z  │ ❱ 49 │ │ │ self._queue = taskqueue.TaskQueue( │
2024-01-24T08:50:01Z50 │ │ │ │ self.name, │
2024-01-24T08:50:01Z51 │ │ │ │ region_name=self.region_name, │
2024-01-24T08:50:01Z52 │ │ │ │ endpoint_url=self.endpoint_url, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:69 in2024-01-24T08:50:01Z__init__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:90 in2024-01-24T08:50:01Zinitialize_api2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:58 in2024-01-24T08:50:01Z__init__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:326 in2024-01-24T08:50:01Zwrapped_f2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:406 in __call__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:362 in iter2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:195 in reraise2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/concurrent/futures/_base.py:451 in result2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/concurrent/futures/_base.py:403 in __get_result2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:409 in __call__2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:65 in2024-01-24T08:50:01Z_get_qurl2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/botocore/client.py:534 in _api_call2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z/opt/conda/lib/python3.10/site-packages/botocore/client.py:976 in2024-01-24T08:50:01Z_make_api_call2024-01-24T08:50:01Z  ╰──────────────────────────────────────────────────────────────────────────────╯
2024-01-24T08:50:01Z  QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue)
2024-01-24T08:50:01Z  when calling the GetQueueUrl operation: The specified queue does not exist for
2024-01-24T08:50:01Z  this wsdl version.
2024-01-24T08:50:01Z  Exception occured while building "spec" (mapped to "run_worker" from module
2024-01-24T08:50:01Z  "zetta_utils.mazepa.worker")
@nkemnitz
Copy link
Collaborator Author

The no deletion part is resolved: GC was not enabled for that project. But the original cause for the frozen scheduler is unknown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant