vine: oddly long task retrieving time #4007

JinZhou5042 · 2024-12-12T17:44:45Z

I think I have now targeted the problem that has been concerning me and @cmoore24-24 for months, which was that the DV5 progresses obviously slower in the end compared to the beginning.

In this run, 8801 tasks, with each running for less than 20s, and each producing a < 35 MB output file. Below is the task execution time distribution. The task execution time is calculated by time_worker_end - time_worker_start, which begins when a process actually starts and ends when a process finishes.

Here, I requested 12 workers each with 16 cores. With my changes in #4006, resource allocations are perfect, 100% succeeded, no time wasting on resource allocations.

The blue bars define the execution time of a task, while the orange bars are the time spent on outputs retrieval. Two main issues at the first glimpse:

Some big gaps there, what is causing the idle time?
Task retrieving time increases by time, with one straggler blocking for 10 min!

During the same time, I found a number of ready tasks remaining on the manager and are not being dispatched.

However, once I asked for 80 workers each with 4 cores, the concurrency improved significantly, with the task retrieving time decreasing notably:

It makes me wonder if the number of per-worker cores harms either the task retrieving time or the idle slots, because this happens every time I repeated the above experiments.

I am positive to say that there will be a potential performance speedup once this issue is being addressed.

The text was updated successfully, but these errors were encountered:

cmoore24-24 · 2024-12-12T19:15:57Z

Adding some of my observations to what @JinZhou5042 has described above:

I can reiterate, as Jin pointed out, that this is a consistent issue affecting nearly all runs. The only exception is for applications that are very small in scope; otherwise, any at-scale run will face slowdowns towards the end of the run, both in task concurrency and also task completion.

The issue doesn't seem to be tied to content of the tasks; i.e., regardless of how complex or simple the calculations are, or how large or small the outputs end up being, the slowdown occurs. An explicit example is when I am doing no calculations at all: if I am simply reading a file, and writing some subset of that data to a second file, I still see slowdown occurring, even though I've done no calculations or data manipulations.

Performing my needed computations in smaller chunks helps somewhat, but there's been no real escaping it for some time. Often it is more worthwhile to end the run early, leaving ~5% of the task unfinished, rather than wait for them to finish (sometimes taking double the amount of time needed for the previous 95%). I have been able to confirm in the past that the remaining 5% of tasks do not correlate to larger or more complex tasks, so it seems to be almost coincidence which tasks end up taking longer.

dthain · 2024-12-12T19:19:04Z

That is odd. Where are the output files being stored when they come back to the manager? Local disk? AFS?

btovar · 2024-12-14T14:00:18Z

From the logs, do we know which task is stuck in retrieving files? What does the debug log say about the files being retrieved, state of the worker etc.?

JinZhou5042 · 2024-12-17T16:57:18Z

I tried both temp and vine files and they have the same pattern. Data is accessed via VAST. In the logs I found a bunch of files get stuck but no further helpful information can be found there.

The problem still remains unclear but I am working on it. I will post if there is valid evidence.

dthain · 2025-01-13T14:58:53Z

@JinZhou5042 is this still an open problem? I seem to recall that we discussed a slowdown in scheduling that occurred when the manager had a large number of files to account for.

JinZhou5042 · 2025-01-14T15:37:39Z

This no longer seems to be a problem, switching to a random worker selection would solve it. But please hold on at the moment, I will do some tests after the ongoing PRs are merged to make sure everything works well :)

JinZhou5042 added the TaskVine label Dec 12, 2024

JinZhou5042 added this to TaskVine/Dask HEP Integration Dec 12, 2024

JinZhou5042 self-assigned this Dec 12, 2024

JinZhou5042 added the bug For modifications that fix a flaw in the code. label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: oddly long task retrieving time #4007

vine: oddly long task retrieving time #4007

JinZhou5042 commented Dec 12, 2024 •

edited

Loading

cmoore24-24 commented Dec 12, 2024

dthain commented Dec 12, 2024

btovar commented Dec 14, 2024

JinZhou5042 commented Dec 17, 2024

dthain commented Jan 13, 2025

JinZhou5042 commented Jan 14, 2025

vine: oddly long task retrieving time #4007

vine: oddly long task retrieving time #4007

Comments

JinZhou5042 commented Dec 12, 2024 • edited Loading

cmoore24-24 commented Dec 12, 2024

dthain commented Dec 12, 2024

btovar commented Dec 14, 2024

JinZhou5042 commented Dec 17, 2024

dthain commented Jan 13, 2025

JinZhou5042 commented Jan 14, 2025

JinZhou5042 commented Dec 12, 2024 •

edited

Loading