You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sisyphus GCS downloads got stuck in #24 due to a bug: missing a Google Client library. The worker would get stuck, never complete, and never shut itself down. [Why didn't the lack of a library throw an exception?]
That bug is fixed but we ought to bound the problems that can be caused during file downloads and uploads and Docker image pulls, e.g. if the remote server is slow.
Triggers:
A request from Gaia (via Kafka) to terminate the current task.
A timer expires.
Approaches from easiest to most robust:
In Sisyphus, check these triggers before each file transfer or Docker image pull. This is straightforward other than picking the timeout duration and whether it should be per file or total. It would handle most cases but not the bug that caused Sisyphus gets stuck pulling an input file #24.
In Sisyphus, do the file transfers and Docker pull in separate threads and be prepared to kill them on these triggers. The file cleanup code might need to be more careful.
Make Gaia able to delete a stuck worker node, esp. once it becomes responsible for starting and stopping Sisyphus workers.
The text was updated successfully, but these errors were encountered:
Sisyphus GCS downloads got stuck in #24 due to a bug: missing a Google Client library. The worker would get stuck, never complete, and never shut itself down. [Why didn't the lack of a library throw an exception?]
That bug is fixed but we ought to bound the problems that can be caused during file downloads and uploads and Docker image pulls, e.g. if the remote server is slow.
Triggers:
Approaches from easiest to most robust:
The text was updated successfully, but these errors were encountered: