Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

Closed
colinthomas-z80 opened this issue Oct 16, 2024 · 9 comments

Comments

@colinthomas-z80
Copy link
Contributor

colinthomas-z80 commented Oct 16, 2024

I encountered a problem, for which my naive solution is #3958

The problem situation is this: We have two workers A and B, a task which produces 100 output files, and a second task which consumes those 100 files as inputs. We schedule the second task and the output files are created at worker A. When we schedule the second task, the manager may choose to send it to worker B

The manager will see that worker A possesses input files for the task, and will schedule peer transfers from A to B. Peer transfers are accounted for and managed in the current_transfer_table. However, the current_transfer_table is only updated in vine_put_url after the task has been scheduled. Therefore the manager will have scheduled all 100 input files to be requested from worker A, since each time it checked the current_transfer_table during scheduling it had not yet been populated with scheduled transfers from the same task.

The naive solution I implemented prevented this from happening, but also brought some other implications along. The worker_source_max_transfers policy has only been effective for limiting the amount of multiple workers requesting single files from workers. Tasks have still been free to request a greater amount of files at once from a single worker without limitation.

If the policy is extended to all peer transfers, then it will be impossible to move task data between workers when the amount of outputs is greater than worker_source_max_transfers. The consequence of this will be another form of fixed_location where the tasks may run only where the data exists already.

This calls for a solution to transferring more than worker_source_max_transfer files between two workers.

If we revert to the previous policy, where we were free to schedule as many necessary transfers from a single worker for one task. The point of failure is the socket connect timeout between workers, where the requesting worker will fail to connect to the source and will declare a cache_invalid. One possibility would be to increase the connect timeout from 15 seconds to something else. However this would be a detriment to identifying genuinely unreachable workers.

We may consider the idea of workers limiting their own connections. Such that if the manager tells a worker to retrieve 100 files from a single host, instead of forking 100 transfer processes it will do it sequentially, or limit itself to 3 or 5 transfers from one host at a given time. Or the source worker will only serve 3-5 connections at a time. However if the worker starts queuing transfers and conserving bandwidth on its own, then the manager's policy might become redundant.

Proposed solution:
I propose that I close #3958 and instead implement partial bandwidth consideration to the requesting worker. So a receiving worker will queue transfer requests to a single source and only perform small amounts in parallel batches.

If the worker is made to be considerate of other hosts, then the transfers will eventually complete successfully. The manager is free to keep enforcing the same policy. From the manager's perspective, it will see that transfers occurring from a particular worker greatly exceed worker_source_max_transfers, so it will avoid scheduling any transfers from that source until they complete, which should be desirable.

@colinthomas-z80 colinthomas-z80 changed the title vine: enforcing transfer limits between two workers -- how to handle large numbers of input files vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files Oct 16, 2024
@dthain
Copy link
Member

dthain commented Oct 16, 2024

Ok, so the main change is that the worker will limit concurrent transfers from the same source?

@colinthomas-z80
Copy link
Contributor Author

That is correct

@dthain
Copy link
Member

dthain commented Oct 16, 2024

@BarrySlyDelgado what do you think?

@BarrySlyDelgado
Copy link
Contributor

I'm in favor of a worker limiting its own transfers. Though, it will be interesting to see the performance in conjunction worker_source_max_transfer. If I have this right, if we limit a single worker to receiving 5 files concurrently and have worker_source_max_transfer set to 3 the max amount of files any single worker would be sending would be 15. I think there is room to explore the adequate amount of total transfers and the resulting ratio of max sends : max receives for any single worker.

@colinthomas-z80
Copy link
Contributor Author

#3961 is open, needs some review. I ended up limiting all transfers at the worker, rather than just from a specific source. It would be more complex to keep tabs on every individual source. Also it might be more proper to limit all transfers since the requesting worker could likely get overwhelmed at some point.

The amount I have tested with is a limit of 5 concurrent transfers. It does not appear to slow things down for me even when getting ~100 files from another worker. I think there is a benefit to only opening a few high bandwidth channels at a time.

@dthain
Copy link
Member

dthain commented Nov 4, 2024

Solved?

@colinthomas-z80
Copy link
Contributor Author

Yes I would say so

@dthain
Copy link
Member

dthain commented Nov 5, 2024

Ok, be proactive about closing issues and moving things forward please

@colinthomas-z80
Copy link
Contributor Author

Closed with #3961 as a resolution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants