an example that shows the need for memory backpressure #2602

rabernat · 2019-03-18T20:03:27Z

In my work with large climate datasets, I often concoct calculations that cause my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms.

The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example:

import dask.array as dsa

# create some random data
# assume chunk structure is not under my control, because it originates
# from the way the data is laid out in the underlying files
shape = (500000, 100, 500)
chunks = (100, 100, 500)
data = dsa.random.random(shape, chunks=chunks)

# now rechunk the data to permit me to do some computations along different axes
# this aggregates chunks along axis 0 and dis-aggregates along axis 1
data_rc = data.rechunk((1000, 1, 500))
FACTOR = 15


def my_custom_function(f):
    # a pretend custom function that would do a bunch of stuff along
    # axis 0 and 2 and then reduce the data heavily
    return f.ravel()[::15][None, :]

# apply that function to each chunk
c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0)
res = data_rc.map_blocks(my_custom_function, dtype=data.dtype,
                         drop_axis=[1, 2], new_axis=[1], chunks=(1, c1))

res.compute()

(Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.)

When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before my_custom_function ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working.

This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing:

for n in range(500):
    res[n].compute()

and evaluating my computation in serial.

I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is backpressure. I see this term has come up in #641, and also in this blog post by @mrocklin regarding streaming data.

I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms.

Dask version 1.1.4

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-03-18T20:31:19Z

Thanks for the writeup and the motivation @rabernat . In general I agree with everything that you've written.

I'll try to lay out the challenge from a scheduling perspective. The worker notices that it is running low on memory. The things that it can do are:

Run one of the many tasks it has sitting around
Stop running those tasks
Write data to disk
Kick tasks back to the scheduler for rescheduling

Probably it should run some task, but it doesn't know which tasks generate data, and which tasks allow it to eventually release data. In principle we know that operations like from_hdf5 are probably bad for memory and operations like sum are probably good, but we probably can't pin these names into the worker itself.

One option that I ran into recently is that we could slowly try to learn which tasks cause memory to arrive and which tasks cause memory to be released. This learning would happen on the worker. This isn't straightforward because there are many tasks running concurrently and their results on the system will be confused (there is no way to tie a system metric like CPU time or memory use to a particular Python function). Some simple model might give a decent idea over time though.

We do something similar (though simpler) with runtime. We maintain an exponentially weighted moving average of task run time, grouped by task prefix name (like from-hdf5), and use this for scheduling heuristics.

This approach would also be useful for other resource constraints, like network use (it'd be good to have a small number of network-heavy tasks like from-s3 running at once), and the use of accelerators like GPUs (the primary cause of my recent interest).

If someone wanted to try out the approach above my suggestion would be to ...

Create a periodic callback on the worker that checked the memory usage of the process with some frequency
Look at the tasks running (self.executing) and the memory growth since the last time and adjust some model for each of those tasks' prefixes (see key_split)
That model might be very simple, like the number of times that memory has increased while seeing that function run. Greater than 0 means that memory increased more often than decreased and vice versa.
Look at the policies in Worker.memory_monitor and maybe make a new one
- Maybe we go through self.ready and reprioritize?
- Maybe we set a flag so when we pop tasks from self.ready we only accept those that we think reduce memory use?
- ...

There are likely other solutions to this whole problem. But this might be one.

martindurant · 2019-03-18T20:36:55Z

there is no way to tie a system metric like CPU time or memory use to a particular Python function

but we do measure the memory usage of the inputs and outputs of functions that have run (not the internal transient memory), and also know whether running of a function ought to free the memory used by its inputs. If measurements were done on the prefix basis mentioned, there could be a reasonable guess at the memory implications of running a given task.

rabernat · 2019-03-18T20:42:42Z

Thanks a lot for your quick response.

One quick clarification question... You say that a central challenge is that the scheduler

doesn't know which tasks generate data, and which tasks allow it to eventually release data

In my mental model, this is obvious from the input and output array shapes. If a task has an input which is 1000x1000 and an output which is 10x10, it is a net sink of memory. The initial nodes of the graph are always sources. I assumed that the graph must know about these input and output shapes, and the resulting memory footprint, even if it has no idea how long each task will take. But perhaps I misunderstand how much the scheduler knows about the tasks it is running.

Wouldn't it be easier to expose this information to the scheduler than it would be to rig up an adaptive monitoring solution?

mrocklin · 2019-03-18T20:59:58Z

but we do measure the memory usage of the inputs and outputs of functions that have run (not the internal transient memory), and also know whether running of a function ought to free the memory used by its inputs. If measurements were done on the prefix basis mentioned, there could be a reasonable guess at the memory implications of running a given task.

That's a good point, and it's much easier to measure :)

In my mental model, this is obvious from the input and output array shapes. If a task has an input which is 1000x1000 and an output which is 10x10, it is a net sink of memory. The initial nodes of the graph are always sources. I assumed that the graph must know about these input and output shapes, and the resulting memory footprint, even if it has no idea how long each task will take. But perhaps I misunderstand how much the scheduler knows about the tasks it is running.

The scheduler doesn't know about shapes or dtypes of your arrays. It only knows that it is running a Python function, and that that function produces some outputs. You're thinking about Dask array, not Dask.

@martindurant 's suggestion is probably enough for your needs though.

martindurant · 2019-03-18T21:03:31Z

The scheduler doesn't know about shapes or dtypes of your arrays

I suppose the client does, at least for arrays, so there could be another route to pass along the expected output memory size of some tasks. In the dataframe case, the client might know enough, and in the general case, there could be a way for users to specify expected output size.

guillaumeeb · 2019-03-18T21:04:35Z

Hi everyone,

This is a subject of interest to me too. But I thought that Dask tried already to minimize memory footprint as explained here. Obviously, this is not what @rabernat is observing, and I've seen the same behavior as him on similar use cases.

Couldn't we just give an overall strategy for the Scheduler to use? Like work on the depth of the graph first?

mrocklin · 2019-03-18T21:09:05Z

But I thought that Dask tried already to minimize memory footprint as explained here.
Couldn't we just give an overall strategy for the Scheduler to use? Like work on the depth of the graph first?

The scheduler does indeed run the graph depth first to the extent possible (actually, it's a bit more complex than this, but your depth-first intuition is correct). Things get odd though if

Some tasks that we would want to run are blocked by other things, like data transfer times. Dask might move on to other tasks
Some tasks that are available we might not want to run, even if we can. We'd prefer to sacrifice parallelism and wait rather than allocate more memory
Some tasks in our graph may be much more expensive in terms of memory than others, and so depth first may not be enough of a constraint to choose the optimal path

sjperkins · 2019-03-19T07:48:22Z

These issues also impact my use cases.

One thing I've considered is rather than creating a graph purely composed of parallel reductions:

a     b  c   d   e  f    g  h
  \  /    \ /     \ /    \ /
    \     /         \    /
     \  /             \ /
       \              /
         \           /
           \        /
             \     /
               \ /

is to define some degree of parallelism (2 in this example) and then create two "linked lists", where the results of a previous operation are aggregated with those of the current.

a    e
|    |
b    f
|    |
c    g
|    |
d    h
  \ /

I haven't tried this out yet, so not sure to what extent it would actually help, but its on my todo list to mock the above up with dummy ops.

I suppose the client does, at least for arrays, so there could be another route to pass along the expected output memory size of some tasks. In the dataframe case, the client might know enough, and in the general case, there could be a way for users to specify expected output size.

In this context it's probably worth re-raising Task Annotations again: dask/dask#3783, which in a complete form would allow annotating a task with an estimate of it's memory output. I started on it in #2180 but haven't found time to push it forward.

abergou · 2019-03-19T15:16:57Z

This also affects my work-flow so I would be curious to see progress on it. On our end, writing out to disk doesn't help since disk is also a finite resource. We have a few workarounds such as adding artificial dependencies into a graph or submitting a large graph in pieces.

rabernat · 2019-03-22T14:36:49Z

Thanks everyone for the helpful discussion. It sounds like @mrocklin and @martindurant have identified a concrete idea to try which could improve the situation. This issue is a pretty high priority for us in Pangeo, so we would be very happy to test any prototype implementation of this idea.

TomAugspurger · 2019-04-17T15:03:50Z

I’ll be on paternity leave for at least the next two weeks (maybe longer, depending on how things are going). If this is still open then I’ll tackle it then.

rabernat · 2019-05-22T13:52:22Z

Hi folks...I'm just pinging this issue to remind us that it remains a crucial priority for Pangeo.

TomAugspurger · 2019-05-22T13:54:14Z

👍 it's on my list for tomorrow :)

TomAugspurger · 2019-05-23T13:31:14Z

https://nbviewer.jupyter.org/gist/TomAugspurger/2df7828c22882d336ad5a0722fbec842 has a few initial thoughts. The first problem is that the rechunking from along just axis 0 to just axis 1 cause a tough, global communication pattern. The output of my_custom_function will need input from every original chunk.

So for this specific problem it may be an option to preserve the chunks along axis 0, and only add additional chunks along axis 1. This leads to a better communication pattern.

But, a few issues with that

It may not work for the real problem (@rabernat do you have a "real world" example available publicly somewhere?). We should still attempt a proper solution to the problem.
It increases the number of tasks, which may bump up against other scheduler limitations.

I'll keep looking into this, assuming that a different chunking pattern isn't feasible.

(FYI @mrocklin the HTML array repr came in handy).

martindurant · 2019-05-23T13:57:08Z

@TomAugspurger , the general problem remains interesting, though, of using the expected output size and known set of things that could be dropped of a given tasks as an additional heuristic in determining when/where to schedule that task.

TomAugspurger · 2019-05-23T14:07:45Z

Yep.

I notice now that my notebook oversimplified things. The original example still has chunks along the first and second axis before the map_blocks (which means the communication shouldn't be entirely global). I'll update things.

rabernat · 2019-05-23T15:16:26Z

This is a "real world" example that can be run from ocean.pangeo.io which I believe illustrates the core issue. It is challenging because the chunks are big, leading to lots of memory pressure.

import intake
cat = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml')
ds = cat.GODAS.to_dask()
print(ds)
salt_clim = ds.salt.groupby('time.month').mean(dim='time')
print(salt_clim)

# launch dask cluster
from dask.distributed import Client
from dask_kubernetes import KubeCluster
# using more workers might alleviate the problem, but that is not a satisfactory workaround
cluster = KubeCluster(n_workers=5)
client = Client(cluster)

# computation A
# compute just one month of the climatology
# it works fine, indicating that the computation could be done in serial
salt_clim[0].load()

# computation B
# now load the whole thing
# workers quickly run out of memory, spill to disk, and even get killed
salt_clim.load()

TomAugspurger · 2019-05-24T14:08:07Z

Thanks @rabernat. I'll try it out on the pangeo binder once I have something working locally.

I think (hope) I have a workable solution. My basic plan is

Track the cumulative output .nbytes of tasks at the prefix-level (easy)
When deciding how to schedule under load, look up / compute the relative size
of the input to the output, again at the prefix level. Prioritize tasks that look
like they'll reduce memory usage. (seems harder. I don't know if we use
prefix-level relationships anywhere else in scheduling decisions).

So if we currently have high memory usage, and we see a set of tasks like

      a1   a2  # 5 bytes, 5 bytes   |   x1  x2   # 5 bytes, 5 bytes
        \ /                         |    \ /
         b     # 1 byte             |     y      # 20 bytes

We would (ideally) schedule b before y, since it tends to have a net
decrease in memory usage. Or if we have tasks running that we know will free up
some memory, we might wait to schedule y since it tends to increase memory.
I'm sure there are things I'm missing, but this should provide a good start.

martindurant · 2019-05-24T14:11:56Z

@TomAugspurger , that's exactly how I was picturing things. Additionally, you may need to know whether calculating b, for example, actually releases a1 and a2.

TomAugspurger · 2019-05-24T14:19:47Z

you may need to know whether calculating b, for example, actually releases a1 and a2.

Yeah, I've been looking through the recommendations -> release logic now. I'm wondering if we can also keep a counter of "completing this task recommended that we release this many bytes". That seems a bit easier to track than knowing whether a1 and a2 were actually released.

rabernat · 2019-05-24T14:26:33Z

As a user of dask.array, it confuses me that the graph itself doesn't have this information in it already. In my mind, all of my dask.array operations know about the shape and dtype of their inputs and outputs. Consequently, the memory impact of a task is known a-priori.

I do understand that this information is not encoded into the dask graph--all the graph knows about is tasks and their dependencies. But, to my naive interpretation, it seems harder to try to "learn" the memory impact of tasks based on past experience than it does to pass this information through the graph itself, in the form of task annotation.

TomAugspurger · 2019-05-24T14:47:57Z

As a user of dask.array, it confuses me that the graph itself doesn't have this information in it already. In my mind, all of my dask.array operations know about the shape and dtype of their inputs and outputs. Consequently, the memory impact of a task is known a-priori.

As you say, TaskAnnotations mentioned by @sjperkins in #2602 (comment) would let us pass that information through. I think that approach should be considered, but I have two reasons for pursing the "learning" approach within distributed for now

I suspect it's easier to get a rough implementation of this in place. Task annotations will (I think) be a larger change to dask.
There always be cases where the user doesn't or can't provide the expected memory usage when the graph is constructed. Say as a result of a boolean filter, or when using delayed on a source that can't easily provide the memory usage information.

I will take another look at the task annotation issue though.

TomAugspurger · 2019-05-24T14:59:53Z

A promising note, the additional LOC for the tracking seems to be ~3 :) I'm just keeping a task_net_nbytes that tracks the difference between task input and output at the prefix-level. With that, we correctly see that

random sample is a net increaser of memory
rechunking has no effect on memory (again, we aren't measuring transient memory here, just final).
my_custom_function is a net reducer of memory

--------------------------------------------------------------------------------
# task_net_nbytes after all the computation is done.
{'my_custom_function': -18666656,
 'random_sample': 20000000,
 'rechunk-merge': 0,
 'rechunk-split-rechunk-merge': 0}

The scheduling changes will be harder, but this seems promising so far.

mrocklin · 2019-05-24T16:45:33Z

@TomAugspurger said

We would (ideally) schedule b before y, since it tends to have a net
decrease in memory usage

Sounds reasonable. As a warning, there might be other considerations fighting for attention here. Scheduling is an interesting balancing game.

@martindurant said

Additionally, you may need to know whether calculating b, for example, actually releases a1 and a2.

This is a nice point. One naive approach would be to divide the effect by the number of dependents of the dependency-to-be-released.

mrocklin · 2019-05-24T16:45:57Z

Also, I'm very glad to have more people thinking about the internal scheduling logic!

TomAugspurger · 2019-05-24T17:00:43Z

@mrocklin do you know off hand whether the changes to the scheduling logic are likely to be on the worker, scheduler, or both? I can imagine both situations occurring, so I suspect the answer is "both".

mrocklin · 2019-05-24T19:15:28Z

I would say either. Dask has task priorities in three places: 1. As provided by the client with dask.order and user priorities 2. As modified by the scheduler, with first-in-first-out for successive compute calls 3. As modified again by the worker, loosely preferring first-in-first out, but largely deferring to the high priorities What you're proposing seems like it plays at the first level, similar to dask.order. You're going to want to be careful not to overwhelm the user provided priorities. I think that you're going to hit some interesting situations where bytes handling and dask.order disagree. In some ways bytes handling is smarter on a short term basis, but might bite you by being short sighted. It will be interesting to see what happens. In terms of scheduler vs worker, the scheduler would be nicer because you can learn the bytes size across the cluster as a whole, rather than having to relearn on every worker. Learning on the worker is a bit nicer because you'll be able to update priorities more rapidly and it's always nice to move any potentially costly logic away from the scheduler. My sense is that it might not matter much where you implement the logic to start.

…

On Fri, May 24, 2019 at 12:00 PM Tom Augspurger ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> do you know off hand whether the changes to the scheduling logic are likely to be on the worker, scheduler, or both? I can imagine both situations occurring, so I suspect the answer is "both". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2602?email_source=notifications&email_token=AACKZTB3OWMOUTWRN6H6ZGLPXANLXA5CNFSM4HEYFY22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWF7DOI#issuecomment-495710649>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTEDQHV77JAGZKMYR7LPXANLXANCNFSM4HEYFY2Q> .

mrocklin · 2019-05-24T19:28:02Z

I could also imagine us not asking the client to run dask.order at all, and doing all task prioritization on the workers. This would allow us to integrate the byte sizes into the dask.order computation (and remove it from the single-threaded client-scheduler processing). This is probably something to explore as future work though.

TomAugspurger · 2019-05-28T18:26:57Z

I could also imagine us not asking the client to run dask.order at all, and doing all task prioritization on the workers.

FWIW, dask.order does correctly order this workload. The final task of each parallel branch comes before the first task of the next branch.

So this looks more like a problem with reordering / waiting to execute some tasks, if we notice that we

are at high memory
Have a set of ready tasks that are likely to reduce memory usage.

JSKenyon · 2021-06-30T07:19:07Z

This does look like a huge improvement! I look forward to trying it out on my problems.

dcherian · 2021-06-30T16:20:11Z

EXciting!

This is essential, because the small slice of f we were returning was just a view of f's memory. And that meant all of f's memory had to stick around—so the memory-reducing function was not reducing memory at all!
This is unintuitive behavior, but I wonder how often it happens unknowingly with Dask, and if it plays a role in real-world cases. I wonder if we could add logic to Dask to warn you of this situation, where the memory backing an array result is, say, 2x larger than memory needed for the number of elements in that array, and suggest you make a copy.

@gjoseph92 This happens all the time! See dask/dask#3595 (especially dask/dask#3595 (comment)) @mrocklin proposed a solution here: dask/dask#3595 . I use the map_blocks(np.copy) trick frequently

gjoseph92 · 2021-06-30T16:32:14Z

@dcherian thanks for the xref! dask/dask#3595 (comment) is exactly what I was thinking. (It wouldn't help with user code in map_blocks like this case specifically, but since that's advanced, maybe that's okay.) I might like to take this on.

mrocklin · 2021-06-30T17:20:02Z

From a Coiled resourcing perspective, +1 on spending time on the getitem fix. That seems cheap to implement and medium-high value.

…

On Wed, Jun 30, 2021 at 9:32 AM Gabe Joseph ***@***.***> wrote: @dcherian <https://github.com/dcherian> thanks for the xref! dask/dask#3595 (comment) <dask/dask#3595 (comment)> is exactly what I was thinking. (It wouldn't help with user code in map_blocks like this case specifically, but since that's advanced, maybe that's okay.) I might like to take this on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2602 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTBBHQSYYGDLSQQDCITTVNBJRANCNFSM4HEYFY2Q> .

mrocklin · 2021-06-30T23:59:11Z

I'm closing this out. If folks have other issues here though then I'm happy to reopen.

@rabernat , thank you for opening this originally. My apologies that it took so long to come to a good solution.

rabernat · 2021-07-30T11:12:14Z

Which dask release do users need to be running to get this feature? My impression is 2021.07.0, correct?

mrocklin · 2021-07-30T12:46:56Z

I think anything 2021.07 or greater should work. We're also releasing 2021.07.2 today if you wanted to wait a day :)

…

On Fri, Jul 30, 2021 at 6:12 AM Ryan Abernathey ***@***.***> wrote: Which dask release do users need to be running to get this feature? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2602 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCMKTQFEGPRUEWISCTT2KCJRANCNFSM4HEYFY2Q> .

rabernat · 2021-08-02T16:19:26Z

I am working on a blog post to advertise the exciting progress we have made on this problem. As part of this, I have created an example notebook that can be run in Pangeo Binder (including with Dask Gateway) with two different versions of Dask which attempts to reproduce the issue and its resolution.

New Dask version (2021.07.1): https://binder.pangeo.io/v2/gh/pangeo-gallery/default-binder/39f7202?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgist.github.com%252Frabernat%252F39d8b6a396e076d168c24167b8871c4b%26urlpath%3Dtree%252F39d8b6a396e076d168c24167b8871c4b%252Fanomaly_std.ipynb%26branch%3Dmaster

Older Dask version (2020.12.1): https://binder.pangeo.io/v2/gh/pangeo-gallery/default-binder/b8d1c53?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgist.github.com%252Frabernat%252F39d8b6a396e076d168c24167b8871c4b%26urlpath%3Dtree%252F39d8b6a396e076d168c24167b8871c4b%252Fanomaly_std.ipynb%26branch%3Dmaster

The notebook uses the following cluster settings, which is the biggest cluster we can reasonably share publicly on Pangeo Binder:

nworkers = 30
worker_memory = 8
worker_cores = 1

It includes both the "canonical anomaly-mean example example" with synthetic data from #2602 (comment) (referenced by @gjoseph92 in #2602 (comment)) as well as a similar real-world example that uses cloud data.

I have found that the critical performance issues are largely resolved even in Dask 2020.12.1 when I use options.environment = {"MALLOC_TRIM_THRESHOLD_": "0"}. For the synthetic data example, the older Dask version (pre #4967) is just a few s slower. For the real-world data example, the new version is significantly faster (2 min vs. 3 min), but nowhere close to the 6x increases reported above. But if I don't set MALLOC_TRIM_THRESHOLD_, both clusters crash. This leads me to conclude that, for these workloads, MALLOC_TRIM_THRESHOLD_ is much more important than #4967.

My questions for the group are:

Is this interpretation correct? Or am I missing something?
If it is correct that MALLOC_TRIM_THRESHOLD_ makes the biggest differences for these climatology-anomaly-style workloads, is there an alternative workflow that would better highlight the improvements of Co-assign root-ish tasks #4967 for the blog post?
Is there a downside to setting MALLOC_TRIM_THRESHOLD_=0 on all of our clusters?

(edit below)

@dougiesquire's climactic mean example from #2602 (comment)

20-worker cluster with 2-CPU, 20GiB workers

main + MALLOC_TRIM_THRESHOLD_=0: gave up after 30min and 1.25TiB spilled to disk (!!)

Co-assign root-ish tasks #4967 + MALLOC_TRIM_THRESHOLD_=0: 230s

I am now trying to run this example, which should presumably reproduce the issue better.

rabernat · 2021-08-02T17:21:28Z

Latest edit:

I ran a slightly modified version of the "dougiesquire's climactic mean example" and added it to the gist. The only real change I made was to also reduce over the ensemble dimension in order to reduce the total size of the final result--otherwise you end up with a 20GB array that can't fit into the notebook memory.

climactic mean example code

import pandas as pd

size = (28, 237, 48, 21, 90, 144)
chunks = (1, 1, 48, 21, 90, 144)
arr = dsa.random.random(size, chunks=chunks)
arr

items = dict(
    ensemble = np.arange(size[0]),
    init_date = pd.date_range(start='1960', periods=size[1]),
    lat = np.arange(size[2]).astype(float),
    lead_time = np.arange(size[3]),
    level = np.arange(size[4]).astype(float),
    lon = np.arange(size[5]).astype(float),
)
dims, coords = zip(*list(items.items()))

array = xr.DataArray(arr, coords=coords, dims=dims)
dset = xr.Dataset({'data': array})
display(dset)

# reduce of "init_date" and "ensemble" to reduce final memory requirements.
result = dset['data'].groupby("init_date.month").mean(dim=["init_date", "ensemble"])
%time result.compute()

Using MALLOC_TRIM_THRESHOLD_=0 and comparing Dask 2020.12.0 vs. 2021.07.1, I found 6min 49s vs. 4min 13s. This is definitely an improvement, but very different from @gjoseph92's results ("main + MALLOC_TRIM_THRESHOLD_=0: gave up after 30min and 1.25TiB spilled to disk (!!)")

So I can definitely see evidence of an incremental improvement, but I feel like I'm still missing something.

RichardScottOZ · 2021-08-02T23:15:52Z

Thanks Ryan. Have not tested this on anything seriously large, but hopefully will soon.

dcherian · 2021-11-18T15:59:07Z

but I feel like I'm still missing something.

This thread is a confusing mix of many issues. The slice vs copy issue is real, but doesn't affect groupby problems since xarray indexes out groups with a list-of-ints (always a copy) [except for resampling which uses slices].

result = dset['data'].groupby("init_date.month")

This example (from @dougiesquire I think) has chunksize=1 along init_date which is daily frequency. Xarray's groupby construct a nice graph for this case so it executes well (it'll pick out all chunks with january data and apply dask.array.mean; this is embarrassingly parallel for different groups). If it doesn't execute well, that's up to dask+distributed to schedule it properly. Note that the writeup in ocean-transport/coiled_collaboration#17 only discusses this example.

The earlier example:

cat = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml')
ds = cat.GODAS.to_dask()
print(ds)
salt_clim = ds.salt.groupby('time.month').mean(dim='time')

ds.salt has chunksize=4 along time with a monthly mean in each timestep so there are 4 months in a block

Array	Chunk
11.31 GB	96.08 MB
(471, 40, 417, 360)	(4, 40, 417, 360)
119 Tasks	118 Chunks
float32	numpy.ndarray

So groupby("time.month") splits every block into 4 and we end up with a quadratic-ish shuffling-type workload that tends to work poorly (though I haven't tried this on latest dask)

The best way to reduce the GODAS dataset should be this strategy: https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts i.e. we index out months 1-4 which always occur together and map-reduce that. Repeat for all other cohorts (months 5-8, 9-12) and then concatenate for the final result.

One could test this with flox.xarray.xarray_reduce(ds.salt, ds.time.dt.month, func="mean", method="cohorts") and potentially throw in engine="numba" for extra fun =) Who wants to try it?!

gjoseph92 · 2022-10-12T11:41:05Z

Though this issue is closed, I imagine that most of you following it are still interested in memory usage in dask, and might still be having problems with it. It's possible that closing this with #4967 was premature, but I'm hoping that #6614 actually addresses the core problem in this issue.

If you are (or aren't) having problems with workloads running out of memory, please try out the new worker-saturation config option!

Information on how to set it is here. And please comment on the discussion to share how it goes:

Share your experiences with `worker-saturation` config to reduce memory usage #7128

These are benchmarking results showing significant reductions in peak memory use:

@rabernat's original example from this issue uses 50% less memory (but is slower)
@TomNicholas's Geospatial-type workload showing two common scheduler failures at once #6571 uses 50% less memory (and is faster since it doesn't spill)
@JSKenyon's example requiring co-assignment uses 30% less memory. But since this loses co-assignment, it also is slower.

We're especially interested in hearing how the runtime-vs-memory tradeoff feels to people. So please try it out and report back!

Lastly, thank you all for your collaboration and persistence in working on these issues. It's frustrating when you need to get something done, and dask isn't working for you. The examples everyone has shared here have been invaluable in working towards a solution, so thanks to everyone who's taken their time to keep engaging on this.

RichardScottOZ · 2023-01-08T03:56:55Z

Thanks @gjoseph92

mrocklin transferred this issue from dask/dask Apr 10, 2019

mrocklin closed this as completed in fc47318 Jun 30, 2021

ncclementi mentioned this issue Aug 6, 2021

Replace use of operator.getitem() to use .copy() for small slices. dask/dask#8008

Closed

gjoseph92 mentioned this issue Aug 17, 2021

Workers run twice as many root tasks as they should, causing memory pressure #5223

Closed

mrocklin mentioned this issue Aug 22, 2021

Memory prioritization on workers #5250

Open

ncclementi mentioned this issue Sep 13, 2021

Pangeo examples summary ocean-transport/coiled_collaboration#17

Open

rabernat mentioned this issue Oct 4, 2021

workers exceeding max_mem setting pangeo-data/rechunker#100

Open

github-actions bot mentioned this issue May 2, 2022

[Discourse] Setting MALLOC_TRIM_THRESHOLD_ on a LocalCluster coiled/dask-community#936

Open

TomNicholas mentioned this issue May 17, 2022

Ease memory pressure by deprioritizing root tasks? #6360

Open

TomNicholas mentioned this issue Jun 13, 2022

Geospatial-type workload showing two common scheduler failures at once #6571

Closed

gjoseph92 mentioned this issue Jun 23, 2022

Design and prototype for root-ish task deprioritization by withholding tasks on the scheduler #6560

Closed

fjetter mentioned this issue Jun 24, 2022

Initial set of automated performance benchmarks (non-H2O) coiled/benchmarks#191

Closed

gjoseph92 mentioned this issue Aug 26, 2022

Failing test_climatic_mean coiled/benchmarks#253

Open

gjoseph92 mentioned this issue Oct 4, 2022

Less cluster memory coiled/benchmarks#338

Closed

gjoseph92 mentioned this issue Oct 28, 2022

Turn on queuing by default #7213

Closed

11 tasks

gjoseph92 mentioned this issue Feb 16, 2023

Thoughts on task co-assignment #7555

Open

gjoseph92 mentioned this issue Feb 24, 2023

dask.order over-prioritizes root tasks in some situations dask/dask#9995

Closed

scottstanie mentioned this issue May 10, 2023

Daskify isce-framework/tophu#12

Merged

an example that shows the need for memory backpressure #2602

an example that shows the need for memory backpressure #2602

Comments

rabernat commented Mar 18, 2019 • edited by TomAugspurger Loading

mrocklin commented Mar 18, 2019

martindurant commented Mar 18, 2019

rabernat commented Mar 18, 2019

mrocklin commented Mar 18, 2019

martindurant commented Mar 18, 2019

guillaumeeb commented Mar 18, 2019

mrocklin commented Mar 18, 2019

sjperkins commented Mar 19, 2019

abergou commented Mar 19, 2019

rabernat commented Mar 22, 2019

TomAugspurger commented Apr 17, 2019

rabernat commented May 22, 2019

TomAugspurger commented May 22, 2019

TomAugspurger commented May 23, 2019

martindurant commented May 23, 2019

TomAugspurger commented May 23, 2019 • edited Loading

rabernat commented May 23, 2019

TomAugspurger commented May 24, 2019 • edited Loading

martindurant commented May 24, 2019

TomAugspurger commented May 24, 2019

rabernat commented May 24, 2019 • edited Loading

TomAugspurger commented May 24, 2019

TomAugspurger commented May 24, 2019

mrocklin commented May 24, 2019

mrocklin commented May 24, 2019

TomAugspurger commented May 24, 2019

mrocklin commented May 24, 2019 via email

mrocklin commented May 24, 2019

TomAugspurger commented May 28, 2019

JSKenyon commented Jun 30, 2021

dcherian commented Jun 30, 2021

gjoseph92 commented Jun 30, 2021

mrocklin commented Jun 30, 2021 via email

mrocklin commented Jun 30, 2021

rabernat commented Jul 30, 2021 • edited Loading

mrocklin commented Jul 30, 2021 via email

rabernat commented Aug 2, 2021 • edited Loading

@dougiesquire's climactic mean example from #2602 (comment)

rabernat commented Aug 2, 2021 • edited Loading

RichardScottOZ commented Aug 2, 2021

dcherian commented Nov 18, 2021 • edited Loading

gjoseph92 commented Oct 12, 2022

RichardScottOZ commented Jan 8, 2023

rabernat commented Mar 18, 2019 •

edited by TomAugspurger

Loading

TomAugspurger commented May 23, 2019 •

edited

Loading

TomAugspurger commented May 24, 2019 •

edited

Loading

rabernat commented May 24, 2019 •

edited

Loading

rabernat commented Jul 30, 2021 •

edited

Loading

rabernat commented Aug 2, 2021 •

edited

Loading

rabernat commented Aug 2, 2021 •

edited

Loading

dcherian commented Nov 18, 2021 •

edited

Loading