Memory leak when loading NWB2 files with default options #494

gouwens · 2021-01-21T21:16:25Z

Describe the bug
Loading NWB files using the create_ephys_data_set() function or subclasses of EphysNWBData lead to a memory leak that becomes a problem when processing many files.

To Reproduce
I'm using the muppy package to profile the memory usage (conda install pympler).

Here's an example that illustrates the problem.

# Setup
from ipfx.stimulus import StimulusOntology
import allensdk.core.json_utilities as ju
from ipfx.dataset.mies_nwb_data import MIESNWBData
from ipfx.dataset.labnotebook import LabNotebookReaderIgorNwb
from pympler import muppy, summary

ontology = StimulusOntology(ju.read(StimulusOntology.DEFAULT_STIMULUS_ONTOLOGY_FILE))

# example nwb2 file
nwb_file = '/allen/programs/celltypes/production/mousecelltypes/prod176/Ephys_Roi_Result_628543361/nwb2_Scnn1a-Tg2-Cre;Ai14-346639.04.02.01.nwb'

# function to load & return a data set object
def load_data_set(nwb_path, ontology, load_into_memory):
    labnotebook = LabNotebookReaderIgorNwb(nwb_file)
    data_set = MIESNWBData(
        nwb_file=nwb_path,
        notebook=labnotebook,
        ontology=ontology,
        load_into_memory=load_into_memory
    )
    return data_set

When you run this using load_into_memory = True (the default), you see an accumulation of _io.BytesIO objects after each call. Note that the create_ephys_data_set() function does not even allow a user to set load_into_memory - it automatically gets set to True.

for i in range(5):
    ds = load_data_set(nwb_file, ontology, load_into_memory=True)
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    # Prints out a summary of the large objects
    summary.print_(sum1)

Output is:

                         types |   # objects |   total size
============================== | =========== | ============
                          list |      103841 |     59.01 MB
                           str |      316778 |     29.51 MB
                   _io.BytesIO |           1 |     18.11 MB
                          dict |       42652 |     13.11 MB
                 numpy.ndarray |          45 |      3.52 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                         tuple |       32384 |      2.32 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           set |        5249 |      1.66 MB
                           int |       53083 |      1.45 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                          cell |       11276 |    616.66 KB
                       weakref |        6962 |    598.30 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      118789 |     66.32 MB
                   _io.BytesIO |           2 |     36.22 MB
                           str |      332939 |     30.69 MB
                          dict |       47406 |     14.30 MB
                 numpy.ndarray |          49 |      7.04 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                         tuple |       32981 |      2.36 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           set |        7329 |      2.14 MB
                           int |       58487 |      1.60 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |        8225 |    706.84 KB
                          cell |       11277 |    616.71 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      133737 |     74.37 MB
                   _io.BytesIO |           3 |     54.32 MB
                           str |      349100 |     31.86 MB
                          dict |       52160 |     15.64 MB
                 numpy.ndarray |          53 |     10.56 MB
                          code |       23345 |      3.22 MB
                          type |        3240 |      2.85 MB
                           set |        9409 |      2.61 MB
                         tuple |       33578 |      2.39 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           int |       63897 |      1.75 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |        9488 |    815.38 KB
                          cell |       11278 |    616.77 KB
                         types |   # objects |   total size
============================== | =========== | ============
                          list |      148685 |     82.42 MB
                   _io.BytesIO |           4 |     72.43 MB
                           str |      365261 |     33.04 MB
                          dict |       56914 |     16.82 MB
                 numpy.ndarray |          57 |     14.08 MB
                          code |       23345 |      3.22 MB
                           set |       11489 |      3.09 MB
                          type |        3240 |      2.85 MB
                         tuple |       34175 |      2.43 MB
    parso.python.tree.Operator |       21544 |      2.14 MB
                           int |       69318 |      1.91 MB
        parso.python.tree.Name |       14451 |      1.21 MB
  parso.python.tree.PythonNode |       15229 |      1.05 MB
                       weakref |       10751 |    923.91 KB
                          cell |       11279 |    616.82 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      163633 |     91.29 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      381422 |     34.21 MB
                                dict |       61668 |     18.01 MB
                       numpy.ndarray |          61 |     17.60 MB
                                 set |       13569 |      3.57 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       34772 |      2.47 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                                 int |       74739 |      2.06 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
                             weakref |       12014 |      1.01 MB
  hdmf.build.builders.DatasetBuilder |        2590 |    667.73 KB

Note the increasing number of _io.BytesIO objects and memory usage each time through the loop.

If you instead set load_into_memory = False and run this loop immediately afterwards, you see that _io.BytesIO stays at 5 (due to the earlier run) but does not increase further.

for i in range(5):
    ds = load_data_set(nwb_file, ontology, load_into_memory=False)
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    # Prints out a summary of the large objects
    summary.print_(sum1)

Output is:

                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      178751 |    100.19 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      397670 |     35.46 MB
                       numpy.ndarray |          65 |     21.12 MB
                                dict |       66604 |     19.55 MB
                                 set |       15649 |      4.05 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       36570 |      2.59 MB
                                 int |       80181 |      2.21 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
              parso.python.tree.Name |       14451 |      1.21 MB
                             weakref |       13472 |      1.13 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        3108 |    801.28 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      193699 |    109.98 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      413830 |     36.70 MB
                       numpy.ndarray |          69 |     24.64 MB
                                dict |       71358 |     20.74 MB
                                 set |       17729 |      4.52 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       37167 |      2.63 MB
                                 int |       85627 |      2.37 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       14735 |      1.24 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        3626 |    934.83 KB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      208647 |    119.76 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      429990 |     37.94 MB
                       numpy.ndarray |          73 |     28.15 MB
                                dict |       76112 |     21.92 MB
                                 set |       19809 |      5.00 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       37764 |      2.67 MB
                                 int |       91073 |      2.52 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       15998 |      1.34 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
  hdmf.build.builders.DatasetBuilder |        4144 |      1.04 MB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      223595 |    130.59 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      446150 |     39.19 MB
                       numpy.ndarray |          77 |     31.67 MB
                                dict |       80866 |     23.18 MB
                                 set |       21889 |      5.48 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                               tuple |       38361 |      2.70 MB
                                 int |       96519 |      2.68 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       17261 |      1.45 MB
              parso.python.tree.Name |       14451 |      1.21 MB
  hdmf.build.builders.DatasetBuilder |        4662 |      1.17 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB
                               types |   # objects |   total size
==================================== | =========== | ============
                                list |      238543 |    141.41 MB
                         _io.BytesIO |           5 |     90.54 MB
                                 str |      462310 |     40.43 MB
                       numpy.ndarray |          81 |     35.19 MB
                                dict |       85620 |     24.36 MB
                                 set |       23969 |      5.96 MB
                                code |       23345 |      3.22 MB
                                type |        3240 |      2.85 MB
                                 int |      101965 |      2.83 MB
                               tuple |       38958 |      2.74 MB
          parso.python.tree.Operator |       21544 |      2.14 MB
                             weakref |       18524 |      1.55 MB
  hdmf.build.builders.DatasetBuilder |        5180 |      1.30 MB
              parso.python.tree.Name |       14451 |      1.21 MB
        parso.python.tree.PythonNode |       15229 |      1.05 MB

I'm not sure why the load_into_memory option still exists for the data set creation. In the earlier, NWB1 version of the code, there were a bunch of h5py.File() calls that slowed the code way down when a string was passed as its argument. That's why the contents of the NWB1 file was optionally loaded into memory - it sped things up considerably without changing all the existing h5py.File() calls since that could take a BytesIO object as well as a file path.

However, now the code uses pynwb for accessing the NWB files instead of h5py, so I don't think creating a BytesIO object gains anything (and seems to cause a memory leak). The code where that happens is here:

ipfx/ipfx/dataset/ephys_nwb_data.py

Line 94 in 75a3ea7

if load_into_memory:

I'm working around the issue at the moment by editing create_ephys_data_set() to take load_into_memory as a parameter that it passes on, but I think the code can probably be removed instead (and then other code wouldn't have to change, as well).

Environment (please complete the following information):

OS & version: Centos 7
Python version 3.7.7
AllenSDK version 1.5.1

The text was updated successfully, but these errors were encountered:

tmchartrand · 2021-01-22T19:13:36Z

Ah, pretty sure I've experienced this too - nice work tracking it down @gouwens .

gouwens added the bug label Jan 21, 2021

kasbaker mentioned this issue Feb 2, 2021

Memory leak due to LRU cache in method of EphysNWBData #495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when loading NWB2 files with default options #494

Memory leak when loading NWB2 files with default options #494

gouwens commented Jan 21, 2021 •

edited

Loading

tmchartrand commented Jan 22, 2021

Memory leak when loading NWB2 files with default options #494

Memory leak when loading NWB2 files with default options #494

Comments

gouwens commented Jan 21, 2021 • edited Loading

tmchartrand commented Jan 22, 2021

gouwens commented Jan 21, 2021 •

edited

Loading