-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vine: oddly long task retrieving time #4007
Comments
Adding some of my observations to what @JinZhou5042 has described above: I can reiterate, as Jin pointed out, that this is a consistent issue affecting nearly all runs. The only exception is for applications that are very small in scope; otherwise, any at-scale run will face slowdowns towards the end of the run, both in task concurrency and also task completion. The issue doesn't seem to be tied to content of the tasks; i.e., regardless of how complex or simple the calculations are, or how large or small the outputs end up being, the slowdown occurs. An explicit example is when I am doing no calculations at all: if I am simply reading a file, and writing some subset of that data to a second file, I still see slowdown occurring, even though I've done no calculations or data manipulations. Performing my needed computations in smaller chunks helps somewhat, but there's been no real escaping it for some time. Often it is more worthwhile to end the run early, leaving ~5% of the task unfinished, rather than wait for them to finish (sometimes taking double the amount of time needed for the previous 95%). I have been able to confirm in the past that the remaining 5% of tasks do not correlate to larger or more complex tasks, so it seems to be almost coincidence which tasks end up taking longer. |
That is odd. Where are the output files being stored when they come back to the manager? Local disk? AFS? |
From the logs, do we know which task is stuck in retrieving files? What does the debug log say about the files being retrieved, state of the worker etc.? |
I tried both temp and vine files and they have the same pattern. Data is accessed via VAST. In the logs I found a bunch of files get stuck but no further helpful information can be found there. The problem still remains unclear but I am working on it. I will post if there is valid evidence. |
@JinZhou5042 is this still an open problem? I seem to recall that we discussed a slowdown in scheduling that occurred when the manager had a large number of files to account for. |
This no longer seems to be a problem, switching to a random worker selection would solve it. But please hold on at the moment, I will do some tests after the ongoing PRs are merged to make sure everything works well :) |
I think I have now targeted the problem that has been concerning me and @cmoore24-24 for months, which was that the DV5 progresses obviously slower in the end compared to the beginning.
In this run, 8801 tasks, with each running for less than 20s, and each producing a < 35 MB output file. Below is the task execution time distribution. The task execution time is calculated by
time_worker_end - time_worker_start
, which begins when a process actually starts and ends when a process finishes.Here, I requested 12 workers each with 16 cores. With my changes in #4006, resource allocations are perfect, 100% succeeded, no time wasting on resource allocations.
The blue bars define the execution time of a task, while the orange bars are the time spent on outputs retrieval. Two main issues at the first glimpse:
During the same time, I found a number of ready tasks remaining on the manager and are not being dispatched.
However, once I asked for 80 workers each with 4 cores, the concurrency improved significantly, with the task retrieving time decreasing notably:
It makes me wonder if the number of per-worker cores harms either the task retrieving time or the idle slots, because this happens every time I repeated the above experiments.
I am positive to say that there will be a potential performance speedup once this issue is being addressed.
The text was updated successfully, but these errors were encountered: