You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Loading NWB files using the create_ephys_data_set() function or subclasses of EphysNWBData lead to a memory leak that becomes a problem when processing many files.
To Reproduce
I'm using the muppy package to profile the memory usage (conda install pympler).
Here's an example that illustrates the problem.
# Setupfromipfx.stimulusimportStimulusOntologyimportallensdk.core.json_utilitiesasjufromipfx.dataset.mies_nwb_dataimportMIESNWBDatafromipfx.dataset.labnotebookimportLabNotebookReaderIgorNwbfrompymplerimportmuppy, summaryontology=StimulusOntology(ju.read(StimulusOntology.DEFAULT_STIMULUS_ONTOLOGY_FILE))
# example nwb2 filenwb_file='/allen/programs/celltypes/production/mousecelltypes/prod176/Ephys_Roi_Result_628543361/nwb2_Scnn1a-Tg2-Cre;Ai14-346639.04.02.01.nwb'# function to load & return a data set objectdefload_data_set(nwb_path, ontology, load_into_memory):
labnotebook=LabNotebookReaderIgorNwb(nwb_file)
data_set=MIESNWBData(
nwb_file=nwb_path,
notebook=labnotebook,
ontology=ontology,
load_into_memory=load_into_memory
)
returndata_set
When you run this using load_into_memory = True (the default), you see an accumulation of _io.BytesIO objects after each call. Note that the create_ephys_data_set() function does not even allow a user to set load_into_memory - it automatically gets set to True.
foriinrange(5):
ds=load_data_set(nwb_file, ontology, load_into_memory=True)
all_objects=muppy.get_objects()
sum1=summary.summarize(all_objects)
# Prints out a summary of the large objectssummary.print_(sum1)
Note the increasing number of _io.BytesIO objects and memory usage each time through the loop.
If you instead set load_into_memory = False and run this loop immediately afterwards, you see that _io.BytesIO stays at 5 (due to the earlier run) but does not increase further.
foriinrange(5):
ds=load_data_set(nwb_file, ontology, load_into_memory=False)
all_objects=muppy.get_objects()
sum1=summary.summarize(all_objects)
# Prints out a summary of the large objectssummary.print_(sum1)
I'm not sure why the load_into_memory option still exists for the data set creation. In the earlier, NWB1 version of the code, there were a bunch of h5py.File() calls that slowed the code way down when a string was passed as its argument. That's why the contents of the NWB1 file was optionally loaded into memory - it sped things up considerably without changing all the existing h5py.File() calls since that could take a BytesIO object as well as a file path.
However, now the code uses pynwb for accessing the NWB files instead of h5py, so I don't think creating a BytesIO object gains anything (and seems to cause a memory leak). The code where that happens is here:
I'm working around the issue at the moment by editing create_ephys_data_set() to take load_into_memory as a parameter that it passes on, but I think the code can probably be removed instead (and then other code wouldn't have to change, as well).
Environment (please complete the following information):
OS & version: Centos 7
Python version 3.7.7
AllenSDK version 1.5.1
The text was updated successfully, but these errors were encountered:
Describe the bug
Loading NWB files using the
create_ephys_data_set()
function or subclasses ofEphysNWBData
lead to a memory leak that becomes a problem when processing many files.To Reproduce
I'm using the
muppy
package to profile the memory usage (conda install pympler
).Here's an example that illustrates the problem.
When you run this using
load_into_memory = True
(the default), you see an accumulation of_io.BytesIO
objects after each call. Note that thecreate_ephys_data_set()
function does not even allow a user to setload_into_memory
- it automatically gets set toTrue
.Output is:
Note the increasing number of
_io.BytesIO
objects and memory usage each time through the loop.If you instead set
load_into_memory = False
and run this loop immediately afterwards, you see that_io.BytesIO
stays at 5 (due to the earlier run) but does not increase further.Output is:
I'm not sure why the
load_into_memory
option still exists for the data set creation. In the earlier, NWB1 version of the code, there were a bunch ofh5py.File()
calls that slowed the code way down when a string was passed as its argument. That's why the contents of the NWB1 file was optionally loaded into memory - it sped things up considerably without changing all the existingh5py.File()
calls since that could take aBytesIO
object as well as a file path.However, now the code uses
pynwb
for accessing the NWB files instead ofh5py
, so I don't think creating aBytesIO
object gains anything (and seems to cause a memory leak). The code where that happens is here:ipfx/ipfx/dataset/ephys_nwb_data.py
Line 94 in 75a3ea7
I'm working around the issue at the moment by editing
create_ephys_data_set()
to takeload_into_memory
as a parameter that it passes on, but I think the code can probably be removed instead (and then other code wouldn't have to change, as well).Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: