You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a discussion concerning various disk management problems we have identified some fundamental issues in our disk allocation strategy. Improving our disk allocation strategy should help take care of some immediate problems as well as make our disk management generally more effective in situations where it is a constrained resource. In summary:
Our default strategy of allocating all of the available worker disk to a task is not effective, especially in the advent of the library task, which does not "complete" and release allocated resources in the same way as a regular task.
We should give priority to the user's declared resource requirements. If a user assigns a quantity of disk to a task we should not override their choice.
There are a number of different values maintained on the manager and at the worker indicating how much disk is available, how much is used, current disk allocated, sandbox and cache size. Keeping these values synchronized is some work and introduces the potential for error.
The worker's disk.total value is calculated as cache+sandboxes+available_space. We do not want this value to change, however separate processes and/or files generated by a task outside of the sandbox will cause the value to shrink and cause problems
The text was updated successfully, but these errors were encountered:
Replace the default "whole disk" allocation with a heuristic of some sizeable fraction of remaining space available, with a generous minimum size
Do not override task resource specifications even if proportional_resources thinks its a good idea
Refine the data structures maintaining disk usage statistics and make the manager distinctly aware of cache space used vs individual task sandboxes
Consider disk.total to be immutable at the worker. If it finds that something is consuming unmanaged disk space it will need to take this seriously and perhaps halt operation while it reports new resource quantities to the manager. We will need to address the specific scenario where the manager and worker are running on the same machine, and the manager logs are taking up the disk space. It may be the case that we should set a finite default disk value in the same way vine_factory does.
One amendment to your previous comment: disk_total = cache_size + sum(sandbox_i) + disk_avail is a constraint to be obeyed, not an assignment statement. disk_total is constant and cache_size and sandbox_i are measured, and disk_avail is what's left over. If you prefer an assignment, then this: disk_avail = disk_total - cache_size - sum(sandbox_i)
After a discussion concerning various disk management problems we have identified some fundamental issues in our disk allocation strategy. Improving our disk allocation strategy should help take care of some immediate problems as well as make our disk management generally more effective in situations where it is a constrained resource. In summary:
Our default strategy of allocating all of the available worker disk to a task is not effective, especially in the advent of the library task, which does not "complete" and release allocated resources in the same way as a regular task.
We should give priority to the user's declared resource requirements. If a user assigns a quantity of disk to a task we should not override their choice.
There are a number of different values maintained on the manager and at the worker indicating how much disk is available, how much is used, current disk allocated, sandbox and cache size. Keeping these values synchronized is some work and introduces the potential for error.
The worker's disk.total value is calculated as cache+sandboxes+available_space. We do not want this value to change, however separate processes and/or files generated by a task outside of the sandbox will cause the value to shrink and cause problems
The text was updated successfully, but these errors were encountered: