Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when loading NWB2 files with default options #494

Open
gouwens opened this issue Jan 21, 2021 · 1 comment
Open

Memory leak when loading NWB2 files with default options #494

gouwens opened this issue Jan 21, 2021 · 1 comment
Labels

Comments

@gouwens
Copy link
Collaborator

gouwens commented Jan 21, 2021

Describe the bug
Loading NWB files using the create_ephys_data_set() function or subclasses of EphysNWBData lead to a memory leak that becomes a problem when processing many files.

To Reproduce
I'm using the muppy package to profile the memory usage (conda install pympler).

Here's an example that illustrates the problem.

# Setup
from ipfx.stimulus import StimulusOntology
import allensdk.core.json_utilities as ju
from ipfx.dataset.mies_nwb_data import MIESNWBData
from ipfx.dataset.labnotebook import LabNotebookReaderIgorNwb
from pympler import muppy, summary

ontology = StimulusOntology(ju.read(StimulusOntology.DEFAULT_STIMULUS_ONTOLOGY_FILE))

# example nwb2 file
nwb_file = '/allen/programs/celltypes/production/mousecelltypes/prod176/Ephys_Roi_Result_628543361/nwb2_Scnn1a-Tg2-Cre;Ai14-346639.04.02.01.nwb'

# function to load & return a data set object
def load_data_set(nwb_path, ontology, load_into_memory):
    labnotebook = LabNotebookReaderIgorNwb(nwb_file)
    data_set = MIESNWBData(
        nwb_file=nwb_path,
        notebook=labnotebook,
        ontology=ontology,
        load_into_memory=load_into_memory
    )
    return data_set

When you run this using load_into_memory = True (the default), you see an accumulation of _io.BytesIO objects after each call. Note that the create_ephys_data_set() function does not even allow a user to set load_into_memory - it automatically gets set to True.

for i in range(5):
    ds = load_data_set(nwb_file, ontology, load_into_memory=True)
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    # Prints out a summary of the large objects
    summary.print_(sum1)

Output is:

                         types |   # objects |   total size
============================== | =========== | ============
                          list |      103841 |     59.01 MB
                           str |      316778 |     29.51 MB
                   _io.BytesIO |           1 |     18.11 MB
                          dict |       42652 |     13.11 MB
                 numpy.ndarray |          45 |      3.52 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                         tuple |       32384 |      2.32 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           set |        5249 |      1.66 MB
                           int |       53083 |      1.45 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                          cell |       11276 |    616.66 KB
                       weakref |        6962 |    598.30 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      118789 |     66.32 MB
                   _io.BytesIO |           2 |     36.22 MB
                           str |      332939 |     30.69 MB
                          dict |       47406 |     14.30 MB
                 numpy.ndarray |          49 |      7.04 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                         tuple |       32981 |      2.36 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           set |        7329 |      2.14 MB
                           int |       58487 |      1.60 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |        8225 |    706.84 KB
                          cell |       11277 |    616.71 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      133737 |     74.37 MB
                   _io.BytesIO |           3 |     54.32 MB
                           str |      349100 |     31.86 MB
                          dict |       52160 |     15.64 MB
                 numpy.ndarray |          53 |     10.56 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                           set |        9409 |      2.61 MB
                         tuple |       33578 |      2.39 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           int |       63897 |      1.75 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |        9488 |    815.38 KB
                          cell |       11278 |    616.77 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      148685 |     82.42 MB
                   _io.BytesIO |           4 |     72.43 MB
                           str |      365261 |     33.04 MB
                          dict |       56914 |     16.82 MB
                 numpy.ndarray |          57 |     14.08 MB
                          code |       23345 |      3.22 MB
                           set |       11489 |      3.09 MB
                          type |        3240 |      2.85 MB
                         tuple |       34175 |      2.43 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           int |       69318 |      1.91 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |       10751 |    923.91 KB
                          cell |       11279 |    616.82 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      163633 |     91.29 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      381422 |     34.21 MB
                                dict |       61668 |     18.01 MB
                       numpy.ndarray |          61 |     17.60 MB
                                 set |       13569 |      3.57 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       34772 |      2.47 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                                 int |       74739 |      2.06 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
                             weakref |       12014 |      1.01 MB
  hdmf.build.builders.DatasetBuilder |        2590 |    667.73 KB

Note the increasing number of _io.BytesIO objects and memory usage each time through the loop.

If you instead set load_into_memory = False and run this loop immediately afterwards, you see that _io.BytesIO stays at 5 (due to the earlier run) but does not increase further.

for i in range(5):
    ds = load_data_set(nwb_file, ontology, load_into_memory=False)
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    # Prints out a summary of the large objects
    summary.print_(sum1)

Output is:

                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      178751 |    100.19 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      397670 |     35.46 MB
                       numpy.ndarray |          65 |     21.12 MB
                                dict |       66604 |     19.55 MB
                                 set |       15649 |      4.05 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       36570 |      2.59 MB
                                 int |       80181 |      2.21 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
              parso.python.tree.Name |       14451 |      1.21 MB
                             weakref |       13472 |      1.13 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        3108 |    801.28 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      193699 |    109.98 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      413830 |     36.70 MB
                       numpy.ndarray |          69 |     24.64 MB
                                dict |       71358 |     20.74 MB
                                 set |       17729 |      4.52 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       37167 |      2.63 MB
                                 int |       85627 |      2.37 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       14735 |      1.24 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        3626 |    934.83 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      208647 |    119.76 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      429990 |     37.94 MB
                       numpy.ndarray |          73 |     28.15 MB
                                dict |       76112 |     21.92 MB
                                 set |       19809 |      5.00 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       37764 |      2.67 MB
                                 int |       91073 |      2.52 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       15998 |      1.34 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        4144 |      1.04 MB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      223595 |    130.59 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      446150 |     39.19 MB
                       numpy.ndarray |          77 |     31.67 MB
                                dict |       80866 |     23.18 MB
                                 set |       21889 |      5.48 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       38361 |      2.70 MB
                                 int |       96519 |      2.68 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       17261 |      1.45 MB
              parso.python.tree.Name |       14451 |      1.21 MB
  hdmf.build.builders.DatasetBuilder |        4662 |      1.17 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      238543 |    141.41 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      462310 |     40.43 MB
                       numpy.ndarray |          81 |     35.19 MB
                                dict |       85620 |     24.36 MB
                                 set |       23969 |      5.96 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                                 int |      101965 |      2.83 MB
                               tuple |       38958 |      2.74 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       18524 |      1.55 MB
  hdmf.build.builders.DatasetBuilder |        5180 |      1.30 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB

I'm not sure why the load_into_memory option still exists for the data set creation. In the earlier, NWB1 version of the code, there were a bunch of h5py.File() calls that slowed the code way down when a string was passed as its argument. That's why the contents of the NWB1 file was optionally loaded into memory - it sped things up considerably without changing all the existing h5py.File() calls since that could take a BytesIO object as well as a file path.

However, now the code uses pynwb for accessing the NWB files instead of h5py, so I don't think creating a BytesIO object gains anything (and seems to cause a memory leak). The code where that happens is here:

if load_into_memory:

I'm working around the issue at the moment by editing create_ephys_data_set() to take load_into_memory as a parameter that it passes on, but I think the code can probably be removed instead (and then other code wouldn't have to change, as well).

Environment (please complete the following information):

  • OS & version: Centos 7
  • Python version 3.7.7
  • AllenSDK version 1.5.1
@gouwens gouwens added the bug label Jan 21, 2021
@tmchartrand
Copy link
Collaborator

Ah, pretty sure I've experienced this too - nice work tracking it down @gouwens .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants