Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vine: temp file replication #3564

Merged
merged 12 commits into from
Dec 20, 2023

Conversation

colinthomas-z80
Copy link
Contributor

@colinthomas-z80 colinthomas-z80 commented Nov 6, 2023

Proposed changes

Related to #3563

At the moment, upon a cache update, the manager will check if a temp file was created.

If that is the case, the manager will look for up to q->temp_replica_count workers to replicate the file to.

It uses the new message, puturl_now, to tell a worker to bring a remote file to the cache without the need to run a task

Post-change actions

Put an 'x' in the boxes that describe post-change actions that you have done.
The more 'x' ticked, the faster your changes are accepted by maintainers.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update Did you update the manual to reflect your changes, if appropriate? This action should be done after your changes are approved but not merged.
  • Type Labels Select github labels for the type of this change: bug, enhancement, etc.
  • Product Labels Select github labels for the product affected: TaskVine, Makeflow, etc.
  • PR RTM Mark your PR as ready to merge.

@colinthomas-z80
Copy link
Contributor Author

Proof of concept

@dthain
Copy link
Member

dthain commented Nov 6, 2023

Please describe the changes above in the nice form provided. :)

@colinthomas-z80
Copy link
Contributor Author

image

Working in a basic case

@dthain
Copy link
Member

dthain commented Nov 6, 2023

Nice! Let me observe that "now" doesn't really mean "synchronously right away" but rather "without waiting for a dependency to require it". The transfer could still be delayed for reasons of capacity, concurrency limits at the worker, etc.

A question: Should all url requests be done "now" ? Is there an upside to having "normal" transfers wait until something requests their presence?

@BarrySlyDelgado what do you think?

@BarrySlyDelgado
Copy link
Contributor

The answer is likely that it depends. I think it is mostly beneficial if you are not worried about resource utilization. For example, with limited disk space downloading files without premeditated necessity could fill up space quickly. However, without any worries about certain resources, if a worker is free to download a specific file from a remote source, it might as well do so as it could save start up time for task executions in the future. In regard to concurrency limitations, scenarios could arise in which a worker may be instructed to download a file "now" reaching a set concurrency limit and could possibly block a task from running though it may be a limited case.

@colinthomas-z80
Copy link
Contributor Author

Had a successful large scale run where worker loss was handled without recovery tasks.

Still need to refine further as it is not distributing the files evenly and definitely overloading specific workers, both as the sender and receiver.

@dthain
Copy link
Member

dthain commented Nov 7, 2023

That's great! Balance is a tricky issue. This is not quite as simple as it first appears...

@dthain
Copy link
Member

dthain commented Nov 7, 2023

Please rebase on master to get a workaround for the OSX build.

@colinthomas-z80
Copy link
Contributor Author

oops

@btovar
Copy link
Member

btovar commented Nov 7, 2023

replicate your pr!

@colinthomas-z80
Copy link
Contributor Author

Reopening after accidental hard reset :)

taskvine/src/manager/vine_manager_put.c Outdated Show resolved Hide resolved
taskvine/src/worker/vine_worker.c Outdated Show resolved Hide resolved
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the right track, I know you are still looking at scheduling/placement issues.

taskvine/src/worker/vine_cache.c Outdated Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
@dthain
Copy link
Member

dthain commented Dec 18, 2023

Status on this one?

@colinthomas-z80
Copy link
Contributor Author

File distribution is looking good. Still need to clean up the interface as was requested

Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last little bit -

taskvine/src/worker/vine_cache.h Show resolved Hide resolved
@colinthomas-z80
Copy link
Contributor Author

I will go the flag route, since we have been discussing ensuring the cache on task startup and we may think of other ideas

@dthain
Copy link
Member

dthain commented Dec 19, 2023

And finally make sure the setting is documented in taskvine.h and find the right place to briefly mention how temp files are duplicated/recovered in the main manual.

@dthain dthain merged commit 14db238 into cooperative-computing-lab:master Dec 20, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants