Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout Sisyphus storage ops and Docker pulls #26

Open
1fish2 opened this issue Dec 7, 2019 · 1 comment
Open

timeout Sisyphus storage ops and Docker pulls #26

1fish2 opened this issue Dec 7, 2019 · 1 comment

Comments

@1fish2
Copy link
Collaborator

1fish2 commented Dec 7, 2019

Sisyphus GCS downloads got stuck in #24 due to a bug: missing a Google Client library. The worker would get stuck, never complete, and never shut itself down. [Why didn't the lack of a library throw an exception?]

That bug is fixed but we ought to bound the problems that can be caused during file downloads and uploads and Docker image pulls, e.g. if the remote server is slow.

Triggers:

  1. A request from Gaia (via Kafka) to terminate the current task.
  2. A timer expires.

Approaches from easiest to most robust:

  1. In Sisyphus, check these triggers before each file transfer or Docker image pull. This is straightforward other than picking the timeout duration and whether it should be per file or total. It would handle most cases but not the bug that caused Sisyphus gets stuck pulling an input file #24.
  2. In Sisyphus, do the file transfers and Docker pull in separate threads and be prepared to kill them on these triggers. The file cleanup code might need to be more careful.
  3. Make Gaia able to delete a stuck worker node, esp. once it becomes responsible for starting and stopping Sisyphus workers.
@1fish2
Copy link
Collaborator Author

1fish2 commented Dec 7, 2019

BTW, I'm inclined to implement the first alternative, and not urgently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant