Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Shared memory for NBLAST #87

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

schlegelp
Copy link
Collaborator

@schlegelp schlegelp commented Apr 1, 2022

With larger NBLASTs, memory consumption becomes an issue. During the NBLAST each child process holds only the scores it works on in memory. On completion, however, that data has to be sent to the parent process which takes time and causes a memory spike (I'm pretty sure it makes a copy). Subsequently, we need to stitch the scores into one big matrix which again takes time and memory.

This PR is work in progress but the basic idea is this:

  1. Use multiprocessing.shared_memory.SharedMemory to reserve some memory
  2. The parent process wraps that reserved memory buffer as numpy array
  3. Each NBLASTing child process is given a reference to the memory buffer and writes the scores it works on directly to that buffer. Because the writes are non-overlapping, we don't have to bother with a queue.

Advantages:

  • no data sent back from the child processes
  • scores only ever exist once
  • no annoying post-NBLAST stitching required (this will also be super useful for nblast_smart)

Disadvantages:

The memory management is a bit more annoying because unlike your run-of-the-mill numpy array the SharedMemory needs to be explicitly released by calling SharedMemory.close(). If one doesn't do this before exiting the session, there is a UserWarning. For now I solved this by hooking into atexit but I'm pretty sure that if the user simply deletes the array/dataframe (i.e. something like del scores) the memory is not freed. That bit needs a bit of exploration - perhaps we can implement some custom __del__ method or something.

I have so far only implemented this for navis.nblast - all other functions in this branch still use the old approach. Some quick and dirty benchmarks show some decent speed up, in particular for large NBLASTs. Haven't done proper memory profiling yet.

@clbarnes any thoughts on this? In particular on the somewhat opaque issue with memory not being freed.

Code reviews welcome (the whole NBLAST module is becoming fairly complex)!

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 1, 2022

Could we use a SharedMemoryManager to handle the cleanup? The context manager form should make it fairly convenient. Doing a single copy of the shared numpy array at the end (to get it out of the shared buffer before it's unlinked) should be a negligible overhead.

I'd probably write a wrapper class around that with a copy method.

@schlegelp
Copy link
Collaborator Author

Yes context manager + making a copy of the array would solve that issue. I was thinking mostly of scenarios where the NBLAST score matrix is almost larger than the available memory in which case a copy might tip it over the edge.

Btw: here is a benchmark of NBLASTing the 25k hemibrain neurons against each other. I did cheat by only running a small fraction of the pairwise comparisons (hence this test finished after ~5mins) but the array sizes should have been the same as in a real NBLAST.

nblast_no_shared
nblast_shared

Not sure how much I trust the profiler though: these drops in memory consumption are rather odd - are those spin dumps?

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 1, 2022

If you don't have the memory to copy the score matrix in memory then you can just do whatever you want to it inside the context manager, I suppose. If you're that tight on memory then you definitely want to be able to manually free the memory (by jumping out of a context manager) rather than waiting for GC to get it in atexit, I think.

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 1, 2022

This looks like it would be a good fit: https://pypi.org/project/shared-ndarray2/ , but it's py3.8+ only.

Those memory drops might be the garbage collection running, I guess?

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 1, 2022

You could also avoid the copies all being done at the end by writing them in to the dataframe as they come out:

    scores = pd.DataFrame(np.zeros((len(query_dps), len(target_dps)),
                                    dtype=this.dtype),
                          index=query_dps.id, columns=target_dps.id)

    with ProcessPoolExecutor(max_workers=n_cores) as pool:
        # Each nblaster is passed to its own process
        futures = [pool.submit(this.multi_query_target,
                               q_idx=this.queries,
                               t_idx=this.targets,
                               scores=scores) for this in nblasters]
        for f in concurrent.futures.as_completed(futures):
            res = f.result()
            scores.loc[res.index, res.columns] = res.values

I have a suspicion pool.map may be a little better than a list of pool.submit, because it can start pulling results out before the last tasks have even been submitted (although I don't know if it's actually implemented that way).

@schlegelp
Copy link
Collaborator Author

schlegelp commented Apr 2, 2022

Had a look at shared-ndarray2: this would reduce the boiler plate a little but under the hood it does the same as the current code, I think. More importantly: what I hadn't realised is that, multiprocessing.shared_memory (which shared-ndarray2 is also using) is new in Python 3.8. So this PR would require us to bump the minimum Python version (or keep the current implementation) as backup.

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 4, 2022

I've voiced my opinions on bumping the minimum version 😅 It doesn't break existing versions of navis, which people are welcome to keep using on their old python versions (this can be noted in the docs); it just means that new improvements and features in navis can make use of new improvements and features in python, rather than the library having to work around limitations which have already been solved in the language. Most of the scientific ecosystem does seem to follow NEP29 so it's not an outlandish stance for navis to take.

@schlegelp
Copy link
Collaborator Author

For what it's worth: the main hold up was reticulate catching up and it is now at least using 3.8 - I think they also made it easy to install newer versions with e.g. pyenv. Happy to bump to 3.8 but to be honest except for the shared memory, 3.8 doesn't really add much :D

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 4, 2022

Just raised a PR on reticulate rstudio/reticulate#1185 which (I think) explicitly runs tests against python 3.8, 3.9, and 3.10; passes for me https://github.com/clbarnes/reticulate/actions/runs/2090867132 !

@schlegelp
Copy link
Collaborator Author

Nice! Wait... so what are they running their tests against now?

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 4, 2022

Seems to be whatever python happens to be lying around in the CI environment. Fortunately a certain subset of the python ABI doesn't change very quickly so I wasn't expecting any issues with the newer versions, but it just didn't look like they were tested explicitly.

@clbarnes
Copy link
Collaborator

clbarnes commented Apr 6, 2022

Reticulate now explicitly tests against 3.8-10!

@schlegelp
Copy link
Collaborator Author

schlegelp commented Aug 16, 2024

Note to self: we might be better off using multiprocessing.sharedctypes.RawArray. It's basically the same thing as SharedMemory without the need to unlink/close - but also without any guard rails...

Minimal example:

>>> import os 
>>> import psutil
>>> import ctypes
>>> import numpy as np
>>> from multiprocessing.sharedctypes import RawArray
>>> from multiprocessing import Process

>>> # Initialize the RawArray
>>> ARRAY_SHAPE = (10_000, 10_000)
>>> ARRAY_SIZE = int(np.prod(ARRAY_SHAPE))
>>> shared_array = RawArray(ctypes.c_double, ARRAY_SIZE)

>>> # Wrap shared array as numpy array in parent thread
>>> shared_array_np = np.ndarray(ARRAY_SHAPE, dtype=np.float64, buffer=shared_array)

>>> def test_write(buff):
>>>    shared_array = np.ndarray(ARRAY_SHAPE, dtype=np.float64, buffer=buff)
>>>    shared_array[0, :10] = 1  # write a few values

>>> # Pass the shared array to the child process
>>> p = Process(target=test_write, args=(shared_array,))
>>> p.run()
>>> del p   # delete process to remove reference to shared_array
>>> print(shared_array_np)
array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

>>> # Check memory footprint of parent process
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.84Gb

>>> # Importantly, we can delete the buffer variable and the numpy array is just fine 
>>> del shared_array 
>>> import gc
>>> gc.collect()
>>> print(shared_array_np)
array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

>>> # Memory is also still reserved
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.84Gb

>>> # Only after actually deleting the `shared_array_np` is the memory released 
>>> del shared_array_np
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.04Gb

Note: when running the above in iPython/Jupyter, memory might not be released right away because of a reference to the array/buffer in the history (check globals()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants