Concat on Disk Tutorial #18

selmanozleyen · 2023-07-25T13:08:56Z

Hi,

This is how I started the notebook. @ivirshup @ilan-gold

review-notebook-app · 2023-07-25T13:09:00Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

selmanozleyen · 2023-08-30T13:02:03Z

hi @ivirshup , I would say the notebook is ready however, I am planing to ask you if we should change the parameter to take a string for memory_limit as dask does. For example

concat_on_disk(infiles,outfile,...,max_loaded_elem=1_000_000)

instead of this we would have

concat_on_disk(infiles,outfile,...,sparse_mem_limit='600mb')

Another motivation for this is I wasn't comfortable with memory measurements at that time and the parameter was only for the theoretical number of elements (counts them to the limit even if they are zero) now I can measure the actual loaded elements and their size etc and thus can take a size like dask. I am writing this since this enhancement would also change the notebook.

concat-on-disk.ipynb

flying-sheep

Pretty nice prose, except for one spot, you explain very well why and how things are done.

There’s some code style improvements possible, but nothing severe.

The only bigger change would be to use an actually maintained memory profiler instead of an unmaintained one.

selmanozleyen · 2023-09-14T18:53:13Z

Hi @flying-sheep , sorry for the delay, I was moving my development environment and MacOS feels strange atm.

I fixed the points you mentioned. However I think the numbers aren't correct. Specifically the Dask ones. I will have a look at them then ping you again.

flying-sheep · 2023-10-02T14:42:00Z

@selmanozleyen did you get a chance to look at them?

Regarding dask, you should probably use the dedicated memray integration it offers:

I asume for the most accurate results, we’d need to do

tracer_kwargs = dict(trace_python_allocators=True, native_traces=True, follow_fork=True)

if not is_sparse:
    with (
        memray_workers(OUTDIR, report_args=False, **tracer_kwargs),
        memray_scheduler(OUTDIR, report_args=False, **tracer_kwargs),
    ):
        concat_on_disk(**concat_kwargs)
else:
    with memray.Tracker(OUTDIR / "test-profile.memray", **tracer_kwargs):
        concat_on_disk(**concat_kwargs)

max_mem = 0
for stat_file in OUTDIR.glob("*.memray"):
    with memray.FileReader(stat_file) as reader:
        max_mem += reader.metadata.peak_memory

/edit: I run into two problems when I run the notebook on my macBook:

datasets_aligned is empty. Any idea why?
dataset_max_mem(max_arg=1_000_000, datasets=datasets_unaligned, array_type="sparse") runs for minutes (or forever), the other calls in <20s. Any idea why?

selmanozleyen · 2023-10-05T15:42:13Z

@flying-sheep , sorry when refactoring the code I saw the shape argument and assumed I was using the tuple I could've accessed through X.shape. Somehow this lead to very strange but still working behavior :D. It should be fixed by now.

Last time I came to the conclusion the numbers were somehow accurate but I will also apply your suggestion.

flying-sheep · 2023-10-05T16:14:16Z

Well, on my macBook, the code doesn’t seem to work quite as intended (see last paragraph of previous comment), otherwise I could finish it up.

Regarding the numbers, maybe we should explain them.

datasets_aligned should use almost no memory of course, as no data has to be loaded into memory
for datasets_unaligned, I don’t believe that max_arg='4000MiB', ..., array_type='dense' uses almost no memory. Either that’s wrongly reported (due to not using the above dask memray APIs) or the max_arg doesn’t work intuitively.

selmanozleyen · 2023-10-05T16:26:07Z

I checked the numbers and they are higher in reality. Thanks for the suggestion. I will update the notebook and change the numbers.

Well, on my macBook, the code doesn’t seem to work quite as intended (see last paragraph of previous comment), otherwise I could finish it up.

I see, but because of the shape problem all the datasets were treated as if they are aligned, I thought this lead to a performance degradation.

Update: When I lower the limit for dask my system crashes. It is very slow to debug with big files right now so I just commited the update.

selmanozleyen · 2023-10-05T21:25:07Z

Hi @flying-sheep,

Thanks a lot for the input. I see that I made a mistake by only changing max_loaded_elems on aligned dataset. On aligned dataset it isn't even used. I think it is very slow because

For it to take effect we would need to set it to a very small value (since the dataset consists of many small elements)
To set the ideal parameter might be hard since I didn't compare the real size but only the total elements.
Ideal parameter here would be something that splits the data to two for each element of the list.

I updated the notebook with additional information regarding this.

To benchmark the performance of this case properly we would need to create a special dataset which has unaligned and dissimilar list of elements. Or something with large elements also.

flying-sheep · 2023-10-06T07:30:03Z

concat-on-disk.ipynb

@@ -0,0 +1,855 @@
+{


dataset_max_mem(max_arg="2000MiB", datasets=datasets_aligned, array_type="dense");
[…]
Peak Memory: 2740 MiB
[…]
Peak Memory: 3450 MiB

this uses a lot of memory, shouldn’t it be almost free?

Reply via ReviewNB

flying-sheep · 2023-10-06T07:30:03Z

concat-on-disk.ipynb

@@ -0,0 +1,855 @@
+{


dataset_max_mem(max_arg="2000MiB", datasets=datasets_unaligned, array_type="dense");
[…]
Peak Memory: 2931 MiB
[…]
Peak Memory: 3152 MiB

this is about the same amount of memory

Reply via ReviewNB

But the total sizes are different since we can concatenate more with unaligned arrays the unaligned has more files. However I am still looking into why it uses so much when aligned.

flying-sheep

OK! great we’re getting close. One more issue I see is that there seems to be no difference in memory consumption between aligned and unaligned.

My understanding was that things only have to be loaded into memory when we reindex unaligned datasets. Am I wrong or is there a bug?

selmanozleyen · 2023-10-06T07:49:07Z

You mean for Dask right? I am not really sure if the numbers show the overhead of creating a worker or something so I wasn't surprised with high numbers. What you are saying is conceptually correct, but I don't think the chunk sizes align as I didn't consider that when creating the datasets.

When the chunks don't align there is rechunking which loads the whole array into the memory. I will just make a small update to see if this is the case. If it is it would be an even better demonstration!

Update: I used this but the results didn't change. @ivirshup do you have any idea why? Should we expect no load to memory for dense arrays as well when it is aligned?

def write_chunked(func, store, k, elem, dataset_kwargs, iospec):
    """Write callback that chunks X and layers"""

    def set_chunks(d, chunks=None):
        """Helper function for setting dataset_kwargs. Makes a copy of d."""
        d = dict(d)
        if chunks is not None:
            d["chunks"] = chunks
        else:
            d.pop("chunks", None)
        return d

    if iospec.encoding_type == "array":
        if 'layers' in k or k.endswith('X'):
            dataset_kwargs = set_chunks(dataset_kwargs, (25, elem.shape[1])) # also tried (1000,1000)
        else:
            dataset_kwargs = set_chunks(dataset_kwargs, None)

    func(store, k, elem, dataset_kwargs=dataset_kwargs)

def write_data_to_zarr(X, shape_type, array_name, outdir, file_id):
    outfile = outdir / f"{file_id:02d}_{shape_type}_{array_name}.zarr"
    adata = create_adata(X)
    z = zarr.open_group(outfile, mode="w")
    

    write_dispatched(z, "/", adata, callback=write_chunked)
    zarr.consolidate_metadata(z.store)
    return f"wrote {X.shape[0]}x{X.shape[1]}_{array_name} -> {str(outfile)}\n"

flying-sheep · 2023-10-06T08:38:06Z

You mean for Dask right? I am not really sure if the numbers show the overhead of creating a worker or something so I wasn't surprised with high numbers.

That shouldn’t go into the gigabytes. I would think tens of megabytes overhead or less.

selmanozleyen · 2023-10-06T19:06:24Z

@flying-sheep, When changing a line from the function to this (in addition to chunked writing)

darrays = (da.from_array(a, chunks=(1000,1000) for a in arrays)

The results are way better

Dataset: dense 0
Concatenating 6 files with sizes:
['668MiB', '896MiB', '890MiB', '668MiB', '668MiB', '924MiB']
Total size: 4716MiB
Concatenation finished
Peak Memory: 362 MiB
--------------------------------------------------
Dataset: dense 1
Concatenating 6 files with sizes:
['668MiB', '902MiB', '899MiB', '668MiB', '668MiB', '907MiB']
Total size: 4714MiB
Concatenation finished
Peak Memory: 356 MiB
--------------------------------------------------

So this makes it clear that the problem is about determining the chunk sizes.

selmanozleyen · 2023-10-08T16:55:36Z

concat-on-disk.ipynb

      "Concatenating 6 files with sizes:\n",
-      "['668MiB', '827MiB', '668MiB', '920MiB', '875MiB', '668MiB']\n",
-      "Total size: 4630MiB\n",
+      "['668MiB', '668MiB', '919MiB', '668MiB', '932MiB', '932MiB']\n",
+      "Total size: 4789MiB\n",
      "Concatenation finished\n",
-      "Peak Memory: 2740 MiB\n",
+      "Peak Memory: 388 MiB\n",
      "--------------------------------------------------\n",
      "Dataset: dense 1\n",
      "Concatenating 6 files with sizes:\n",
-      "['912MiB', '823MiB', '668MiB', '668MiB', '668MiB', '864MiB']\n",
-      "Total size: 4606MiB\n",
+      "['859MiB', '668MiB', '885MiB', '668MiB', '892MiB', '668MiB']\n",
+      "Total size: 4641MiB\n",
      "Concatenation finished\n",
-      "Peak Memory: 3450 MiB\n",
+      "Peak Memory: 344 MiB\n",
      "--------------------------------------------------\n"


These results are after doing the changes here: scverse/anndata#1169

As @ivirshup said: you also added write_chunked, are you sure the anndata changes are necessary?

Yes I tried with it only first. If you try running the notebook on main branch of anndata you will get the old results.

update notebooks

a550bbf

selmanozleyen added 2 commits August 29, 2023 21:42

content is ready only explaining remains

aa5374e

added explanations

62410bf

ivirshup requested a review from flying-sheep September 7, 2023 13:49

flying-sheep reviewed Sep 7, 2023

View reviewed changes

Discard changes to anndata_dask_array.ipynb

40466e4

flying-sheep requested changes Sep 7, 2023

View reviewed changes

reviews done except results seem sus

a9684bd

selmanozleyen added 2 commits September 15, 2023 13:20

refactor

aaa6f7a

fixed the shape arg blunder

2c37f3c

flying-sheep added 3 commits October 2, 2023 17:11

sort imports

b377485

fix deps array

8d9da30

fix shape bug

58e3656

ivirshup mentioned this pull request Oct 4, 2023

Add concat-on-disk tutorial scverse/anndata#1157

Draft

3 tasks

very strange shape bug fixed

ffe397c

selmanozleyen added 3 commits October 5, 2023 18:32

fixed measurement method

c292369

got the results

3a0dcce

update about the max_loaded_elems param

e2a3013

flying-sheep reviewed Oct 6, 2023

View reviewed changes

flying-sheep requested changes Oct 6, 2023

View reviewed changes

selmanozleyen mentioned this pull request Oct 6, 2023

Fix: Dense OOC Concat With Aligned Elements Memory Issue scverse/anndata#1169

Merged

results with the chunking change

b12e539

selmanozleyen commented Oct 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concat on Disk Tutorial #18

Concat on Disk Tutorial #18

selmanozleyen commented Jul 25, 2023 •

edited

Loading

review-notebook-app bot commented Jul 25, 2023

selmanozleyen commented Aug 30, 2023

flying-sheep left a comment •

edited

Loading

selmanozleyen commented Sep 14, 2023

flying-sheep commented Oct 2, 2023 •

edited

Loading

selmanozleyen commented Oct 5, 2023

flying-sheep commented Oct 5, 2023 •

edited

Loading

selmanozleyen commented Oct 5, 2023 •

edited

Loading

selmanozleyen commented Oct 5, 2023

flying-sheep Oct 6, 2023 •

edited

Loading

flying-sheep Oct 6, 2023 •

edited

Loading

selmanozleyen Oct 6, 2023

flying-sheep left a comment

selmanozleyen commented Oct 6, 2023 •

edited

Loading

flying-sheep commented Oct 6, 2023

selmanozleyen commented Oct 6, 2023 •

edited

Loading

selmanozleyen Oct 8, 2023

flying-sheep Oct 9, 2023

selmanozleyen Oct 9, 2023

Concat on Disk Tutorial #18

Are you sure you want to change the base?

Concat on Disk Tutorial #18

Conversation

selmanozleyen commented Jul 25, 2023 • edited Loading

review-notebook-app bot commented Jul 25, 2023

selmanozleyen commented Aug 30, 2023

flying-sheep left a comment • edited Loading

Choose a reason for hiding this comment

selmanozleyen commented Sep 14, 2023

flying-sheep commented Oct 2, 2023 • edited Loading

selmanozleyen commented Oct 5, 2023

flying-sheep commented Oct 5, 2023 • edited Loading

selmanozleyen commented Oct 5, 2023 • edited Loading

selmanozleyen commented Oct 5, 2023

flying-sheep Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

flying-sheep Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

selmanozleyen Oct 6, 2023

Choose a reason for hiding this comment

flying-sheep left a comment

Choose a reason for hiding this comment

selmanozleyen commented Oct 6, 2023 • edited Loading

flying-sheep commented Oct 6, 2023

selmanozleyen commented Oct 6, 2023 • edited Loading

selmanozleyen Oct 8, 2023

Choose a reason for hiding this comment

flying-sheep Oct 9, 2023

Choose a reason for hiding this comment

selmanozleyen Oct 9, 2023

Choose a reason for hiding this comment

selmanozleyen commented Jul 25, 2023 •

edited

Loading

flying-sheep left a comment •

edited

Loading

flying-sheep commented Oct 2, 2023 •

edited

Loading

flying-sheep commented Oct 5, 2023 •

edited

Loading

selmanozleyen commented Oct 5, 2023 •

edited

Loading

flying-sheep Oct 6, 2023 •

edited

Loading

flying-sheep Oct 6, 2023 •

edited

Loading

selmanozleyen commented Oct 6, 2023 •

edited

Loading

selmanozleyen commented Oct 6, 2023 •

edited

Loading