You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote this very long draft over the holidays in December. I had hoped to summarize it and pull out some more actionable conclusions, but haven't had the time. I apologize that this is very long.
I don't agree with plenty of things I've written here. I think there's more to this topic I'm missing.
I'm posting this only because I think the middle part, with videos of how queuing vs old co-assignment have such different orders of traversing the graph, are interesting. The fact that queuing follows much closer to global priority order could be an unanticipated reason why it helps so much.
Co-assignment is the idea of assigning initial tasks to the same worker if they'll be used together by a later task. It's intended to avoid future data transfers, therefore making workloads run faster and use less memory.
Co-assignment was previously achieved (without queuing) by just sorting initial tasks by priority, and splitting them up contiguously between workers. With 1000 initial tasks and 4 workers, the first worker would get tasks 1-250, then next 251-500, and so on. For its simplicity, this "divvying-up" approach did a remarkably good job at putting tasks onto the same worker if they would be combined later.
We gave this up for queuing. The whole point of queuing (and reason it massively reduces memory use) is to not pre-assign tasks to workers, so they don't have extra work—which means we can't pre-batch-up the graph as easily.
We recently tried out a graph-based heuristic for picking which tasks would be good "co-assignment groups" (#7298), but we decided not to move ahead with. The heuristic didn't work well enough and would be complicated to implement.
But I was little disappointed, so I realized there are actually a number of other simple ways to implement co(ish)-assignment alongside queuing. I've tried them out over the last few days, and found some interesting results that make me question whether non-structural co-assignment is even a good idea in principle.
Simple co-assignment negates queuing's benefits
I've tried a couple ways of adding "simple" co-assignment to queuing. "Simple" meaning this divvy-up-the-graph strategy. (It turns out it's actually pretty easy to have workers divvy up the queue: worker i basically pops the task at position i/N in the queue in sorted order.) But the memory usage of test_anom_mean—a physical science workload that's been the canonical example of the "need for memory backpressure", and which went down by 50% memory use with queuing—goes back to its old levels once you add simple co-assignment to queuing:
Dark blue is current queuing. Yellow is prior to queuing (with co-assignment). The red and teal have queuing as well as co-assignment—yet they look just like when queuing is off!
It seems like co-assignment actually makes memory use significantly worse compared to naive scheduling? How could this be? We're running the same number of root tasks at once, and we should have to copy them around less—how could that be bad?
Watching the graph
To make sense of this, let's watch how the scheduler progresses through the graph in a few cases:
Current queuing, no co-assignment
anom-mean-queue.mp4
Co-assignment, no queuing
anom-mean-inf-coassign.mp4
These are really, really different patterns!
With priority-ordered queuing, there's a clear "bottom to top" wave of work through the graph:
With co-assignment, progress is fragmented. Work starts from many places (that's the point: spread the workers out so they leave each other a nice long run of contiguous tasks to process), but it takes longer for late parts of the graph to complete. (I don't see red in the layer four from the right until ~6sec, compared to ~3sec with queuing.) Keep in mind that data late in the graph is smaller, since it's aggregated, so trading big tasks on the left for small ones on the right as soon as possible is advantageous for memory. Just as the initial tasks to the left start fragmented, the completion of later tasks on the right is also more fragmented:
Now, here's the graph using a queuing + co-assigment implementation:
anom-mean-divvy.mp4
This feels to me a lot more like the second case above (co-assignment, no queuing) than the first.
What does co-assignment actually mean?
With co-assignment, we want workers to be more independent, so they have to talk to each other less. So we want each to be working towards a different output task, instead of all producing inputs that will feed into one output task. That means we're working towards more things at the same time—we're more parallel. And as always, increasing parallelism increases memory use. It takes more memory to do more things at the same time.
Another way to think about it is that co-assignment is only relevant when downstream tasks combine multiple input tasks: say, 4 inputs feed into one downstream. The goal with co-assignment is that each worker is responsbile for a separate downstream task. So with N workers, you're working towards N downstream tasks, which means 4N input tasks at once.
Without co-assignment, queuing means you'll only work on N input tasks at once. A quarter the memory! You'll also only work towards N/4 downstream tasks at once. But because all N workers are working towards those N/4, you'll theoretically finish each batch of N/4 in a quarter the time—ignoring transfer overhead, of course. Then, you can move onto the next N/4.
(Of course, you do need 4 inputs to end up on the same worker at the same time, so the peak per-worker memory use won't be lower. I'm still struggling to reconcile this fact.)
Note that in all this, we're talking about cases where the "tree size" is larger than the number of threads per worker; that is, more inputs feed into a single downstream task than a worker has threads. In that case, aiming for co-assignment is certainly going to increase total cluster memory use.
This is an important realization, since it also has implications for speculative task assignment. The hope is that STA would decrease both runtime and memory usage. The idea there is to submit whole subsets of the graph to workers, including downstream tasks. This would make workers significantly more independent (and would naturally cause co-assignment).
But what we see is that more worker independence (fewer transfers) also means more responsibility per worker: there's less cooperation, so each worker has to handle more work on its own, meaning more memory use. It's reasonable to think STA could have the same effect. In either case, we'd have one downstream task, or cogroup, or stream of work—whatever we want to call it—per worker, each of which might include >1 input, meaning >N input tasks active at once.
Best-effort co-assignment?
So traversing the graph in a different order to leave room for co-assignment clearly increases memory use. Could we still traverse the graph in the same priority-order way we do right now, but pick workers in a way that reduces transfers even slightly?
Currently, we do pretty much the worst-possible thing for co-assignment. If we have 4-threaded workers, and trees of width 4, all 4 inputs will usually end up on different workers. Couldn't we at least get those inputs on the same worker most of the time?
That's easy enough for the first total_nthreads tasks, but the more important question (as discussed in #7298) is: when a worker completes a task and a thread opens, which task do we pick next for it? If we choose anything besides the next-highest-priority task on the queue, now we're traversing through the graph in a different order. Fundamentally, if only one slot opens up at a time, any form of coassignment would require not following priority order.
Okay, so could we have multiple slots open at once? Then, we could pop a few consecutive tasks off the queue at a time and assign them in a batch. For that to happen without overproduction, we'd have to intentionally undersaturate the worker and wait for multiple threads open up, then assign a few tasks in a batch.
A quick experiment with this strategy confirms it "works": transfers are modestly decreased, but the memory pattern is still low like current queuing. However, it's also slower in most cases, probably because of all that time we let workers idle. (I might try a slight policy change, as well as compare differently-sized workers, but it does seem intuitively like this will be slower.)
The text was updated successfully, but these errors were encountered:
I wrote this very long draft over the holidays in December. I had hoped to summarize it and pull out some more actionable conclusions, but haven't had the time. I apologize that this is very long.
I don't agree with plenty of things I've written here. I think there's more to this topic I'm missing.
I'm posting this only because I think the middle part, with videos of how queuing vs old co-assignment have such different orders of traversing the graph, are interesting. The fact that queuing follows much closer to global priority order could be an unanticipated reason why it helps so much.
Co-assignment is the idea of assigning initial tasks to the same worker if they'll be used together by a later task. It's intended to avoid future data transfers, therefore making workloads run faster and use less memory.
Co-assignment was previously achieved (without queuing) by just sorting initial tasks by priority, and splitting them up contiguously between workers. With 1000 initial tasks and 4 workers, the first worker would get tasks 1-250, then next 251-500, and so on. For its simplicity, this "divvying-up" approach did a remarkably good job at putting tasks onto the same worker if they would be combined later.
We gave this up for queuing. The whole point of queuing (and reason it massively reduces memory use) is to not pre-assign tasks to workers, so they don't have extra work—which means we can't pre-batch-up the graph as easily.
We recently tried out a graph-based heuristic for picking which tasks would be good "co-assignment groups" (#7298), but we decided not to move ahead with. The heuristic didn't work well enough and would be complicated to implement.
But I was little disappointed, so I realized there are actually a number of other simple ways to implement co(ish)-assignment alongside queuing. I've tried them out over the last few days, and found some interesting results that make me question whether non-structural co-assignment is even a good idea in principle.
Simple co-assignment negates queuing's benefits
I've tried a couple ways of adding "simple" co-assignment to queuing. "Simple" meaning this divvy-up-the-graph strategy. (It turns out it's actually pretty easy to have workers divvy up the queue: worker
i
basically pops the task at positioni/N
in the queue in sorted order.) But the memory usage oftest_anom_mean
—a physical science workload that's been the canonical example of the "need for memory backpressure", and which went down by 50% memory use with queuing—goes back to its old levels once you add simple co-assignment to queuing:Dark blue is current queuing. Yellow is prior to queuing (with co-assignment). The red and teal have queuing as well as co-assignment—yet they look just like when queuing is off!
This is surprising. When we added co-assignment a year ago, it seemed to reduce memory use, including for this particular example.
It seems like co-assignment actually makes memory use significantly worse compared to naive scheduling? How could this be? We're running the same number of root tasks at once, and we should have to copy them around less—how could that be bad?
Watching the graph
To make sense of this, let's watch how the scheduler progresses through the graph in a few cases:
Current queuing, no co-assignment
anom-mean-queue.mp4
Co-assignment, no queuing
anom-mean-inf-coassign.mp4
These are really, really different patterns!
With priority-ordered queuing, there's a clear "bottom to top" wave of work through the graph:
With co-assignment, progress is fragmented. Work starts from many places (that's the point: spread the workers out so they leave each other a nice long run of contiguous tasks to process), but it takes longer for late parts of the graph to complete. (I don't see red in the layer four from the right until ~6sec, compared to ~3sec with queuing.) Keep in mind that data late in the graph is smaller, since it's aggregated, so trading big tasks on the left for small ones on the right as soon as possible is advantageous for memory. Just as the initial tasks to the left start fragmented, the completion of later tasks on the right is also more fragmented:
Now, here's the graph using a queuing + co-assigment implementation:
anom-mean-divvy.mp4
This feels to me a lot more like the second case above (co-assignment, no queuing) than the first.
What does co-assignment actually mean?
With co-assignment, we want workers to be more independent, so they have to talk to each other less. So we want each to be working towards a different output task, instead of all producing inputs that will feed into one output task. That means we're working towards more things at the same time—we're more parallel. And as always, increasing parallelism increases memory use. It takes more memory to do more things at the same time.
Another way to think about it is that co-assignment is only relevant when downstream tasks combine multiple input tasks: say, 4 inputs feed into one downstream. The goal with co-assignment is that each worker is responsbile for a separate downstream task. So with N workers, you're working towards N downstream tasks, which means 4N input tasks at once.
Without co-assignment, queuing means you'll only work on N input tasks at once. A quarter the memory! You'll also only work towards N/4 downstream tasks at once. But because all N workers are working towards those N/4, you'll theoretically finish each batch of N/4 in a quarter the time—ignoring transfer overhead, of course. Then, you can move onto the next N/4.
(Of course, you do need 4 inputs to end up on the same worker at the same time, so the peak per-worker memory use won't be lower. I'm still struggling to reconcile this fact.)
Note that in all this, we're talking about cases where the "tree size" is larger than the number of threads per worker; that is, more inputs feed into a single downstream task than a worker has threads. In that case, aiming for co-assignment is certainly going to increase total cluster memory use.
This is an important realization, since it also has implications for speculative task assignment. The hope is that STA would decrease both runtime and memory usage. The idea there is to submit whole subsets of the graph to workers, including downstream tasks. This would make workers significantly more independent (and would naturally cause co-assignment).
But what we see is that more worker independence (fewer transfers) also means more responsibility per worker: there's less cooperation, so each worker has to handle more work on its own, meaning more memory use. It's reasonable to think STA could have the same effect. In either case, we'd have one downstream task, or cogroup, or stream of work—whatever we want to call it—per worker, each of which might include >1 input, meaning >N input tasks active at once.
Best-effort co-assignment?
So traversing the graph in a different order to leave room for co-assignment clearly increases memory use. Could we still traverse the graph in the same priority-order way we do right now, but pick workers in a way that reduces transfers even slightly?
Currently, we do pretty much the worst-possible thing for co-assignment. If we have 4-threaded workers, and trees of width 4, all 4 inputs will usually end up on different workers. Couldn't we at least get those inputs on the same worker most of the time?
That's easy enough for the first
total_nthreads
tasks, but the more important question (as discussed in #7298) is: when a worker completes a task and a thread opens, which task do we pick next for it? If we choose anything besides the next-highest-priority task on the queue, now we're traversing through the graph in a different order. Fundamentally, if only one slot opens up at a time, any form of coassignment would require not following priority order.Okay, so could we have multiple slots open at once? Then, we could pop a few consecutive tasks off the queue at a time and assign them in a batch. For that to happen without overproduction, we'd have to intentionally undersaturate the worker and wait for multiple threads open up, then assign a few tasks in a batch.
A quick experiment with this strategy confirms it "works": transfers are modestly decreased, but the memory pattern is still low like current queuing. However, it's also slower in most cases, probably because of all that time we let workers idle. (I might try a slight policy change, as well as compare differently-sized workers, but it does seem intuitively like this will be slower.)
The text was updated successfully, but these errors were encountered: