Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: oddly long task retrieving time #4007

Open
JinZhou5042 opened this issue Dec 12, 2024 · 6 comments
Open

vine: oddly long task retrieving time #4007

JinZhou5042 opened this issue Dec 12, 2024 · 6 comments
Assignees
Labels
bug For modifications that fix a flaw in the code. TaskVine

Comments

@JinZhou5042
Copy link
Member

JinZhou5042 commented Dec 12, 2024

I think I have now targeted the problem that has been concerning me and @cmoore24-24 for months, which was that the DV5 progresses obviously slower in the end compared to the beginning.

In this run, 8801 tasks, with each running for less than 20s, and each producing a < 35 MB output file. Below is the task execution time distribution. The task execution time is calculated by time_worker_end - time_worker_start, which begins when a process actually starts and ends when a process finishes.

image

Here, I requested 12 workers each with 16 cores. With my changes in #4006, resource allocations are perfect, 100% succeeded, no time wasting on resource allocations.

The blue bars define the execution time of a task, while the orange bars are the time spent on outputs retrieval. Two main issues at the first glimpse:

  1. Some big gaps there, what is causing the idle time?
  2. Task retrieving time increases by time, with one straggler blocking for 10 min!
image

During the same time, I found a number of ready tasks remaining on the manager and are not being dispatched.
image

However, once I asked for 80 workers each with 4 cores, the concurrency improved significantly, with the task retrieving time decreasing notably:

image

It makes me wonder if the number of per-worker cores harms either the task retrieving time or the idle slots, because this happens every time I repeated the above experiments.

I am positive to say that there will be a potential performance speedup once this issue is being addressed.

@JinZhou5042 JinZhou5042 self-assigned this Dec 12, 2024
@JinZhou5042 JinZhou5042 added the bug For modifications that fix a flaw in the code. label Dec 12, 2024
@cmoore24-24
Copy link

Adding some of my observations to what @JinZhou5042 has described above:

I can reiterate, as Jin pointed out, that this is a consistent issue affecting nearly all runs. The only exception is for applications that are very small in scope; otherwise, any at-scale run will face slowdowns towards the end of the run, both in task concurrency and also task completion.

The issue doesn't seem to be tied to content of the tasks; i.e., regardless of how complex or simple the calculations are, or how large or small the outputs end up being, the slowdown occurs. An explicit example is when I am doing no calculations at all: if I am simply reading a file, and writing some subset of that data to a second file, I still see slowdown occurring, even though I've done no calculations or data manipulations.

Performing my needed computations in smaller chunks helps somewhat, but there's been no real escaping it for some time. Often it is more worthwhile to end the run early, leaving ~5% of the task unfinished, rather than wait for them to finish (sometimes taking double the amount of time needed for the previous 95%). I have been able to confirm in the past that the remaining 5% of tasks do not correlate to larger or more complex tasks, so it seems to be almost coincidence which tasks end up taking longer.

@dthain
Copy link
Member

dthain commented Dec 12, 2024

That is odd. Where are the output files being stored when they come back to the manager? Local disk? AFS?

@btovar
Copy link
Member

btovar commented Dec 14, 2024

From the logs, do we know which task is stuck in retrieving files? What does the debug log say about the files being retrieved, state of the worker etc.?

@JinZhou5042
Copy link
Member Author

I tried both temp and vine files and they have the same pattern. Data is accessed via VAST. In the logs I found a bunch of files get stuck but no further helpful information can be found there.

The problem still remains unclear but I am working on it. I will post if there is valid evidence.

@dthain
Copy link
Member

dthain commented Jan 13, 2025

@JinZhou5042 is this still an open problem? I seem to recall that we discussed a slowdown in scheduling that occurred when the manager had a large number of files to account for.

@JinZhou5042
Copy link
Member Author

This no longer seems to be a problem, switching to a random worker selection would solve it. But please hold on at the moment, I will do some tests after the ongoing PRs are merged to make sure everything works well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. TaskVine
Projects
Development

No branches or pull requests

4 participants