Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix multi worker and pip installed hdf5plugin #15

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

k-chaney
Copy link

I spent some time digging into the issues surrounding two primary problems:

  1. pip having a harder time with the hdf5 plugin. This was ultimately fixed by loading in the correct order and setting an environment variable that h5py looks for.
  • pip_blosc_fix.py
    • This file provides a potential fix for
      h5py not scraping the correct directory
      for the plugins. By default it attempts
      to search in the hdf5 default (but this
      may not exist).
  1. I was seeing issues with more than one worker. I ran into this before in h5py and it was solved by having the child processes open their own versions of the file.
  • dataset/sequence.py
    • fix the issues surrounding single file
      descriptor being inherited by the child
      processes. We wait to open the hdf5 files
      until we know we are in process accessing
      individual items

- dataset/sequence.py
  - fix the issues surrounding single file
    descriptor being inherited by the child
    processes. We wait to open the hdf5 files
    until we know we are in process accessing
    individual items

- pip_blosc_fix.py
  - This file provides a potential fix for
    h5py not scraping the correct directory
    for the plugins. By default it attempts
    to search in the hdf5 default (but this
    may not exist).
@magehrig
Copy link
Contributor

magehrig commented May 26, 2021

Hi @k-chaney

Thanks for your contribution! Could you tell me more about those issues:

  1. How should this script be used for people having similar issues with pip? Just modify the path to the h5 file and execute it once in their pip environment? Is there a way for me to replicate this issue?
  2. I was actually expecting problems with h5 and multi-processing but did not encounter any of them so far. Could you tell me what exactly was the problem in your case (slow, crash, ...) ?

@k-chaney
Copy link
Author

For context, I have been installing packages through pip as conda is slow for my purposes.

For 1, I get this error when I install through pip and use your code as is:

ken@node-3090-3:~/research/EvDL$ python3 train.py --model_name LitUNet --gpus=1 --batch_size=2 --num_workers=2 --dataset DSEC_Subset
Num input channels: 10                                                         
GPU available: True, used: True                                                                                                                                                                                                                                                                                              
TPU available: False, using: 0 TPU cores                                       
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]                                                                                                                     
2021-05-26 11:46:16.294761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0                                                                                                                                                                            
                                                                                                                                                              
  | Name | Type | Params                                                                                                                                                                                                                                                                                                     
------------------------------                                                                                                                                                                                                                                                                                               
0 | unet | UNet | 13.4 M                                                                                                                                                                                                                                                                                                     
------------------------------                                                                                                                                                                                                                                                                                               
13.4 M    Trainable params                                                                                                                                                                                                                                                                                                   
0         Non-trainable params                                                                                                                                
13.4 M    Total params                                                                                                                                                                                                                                                                                                       
53.453    Total estimated model params size (MB)                                
/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)                                               
Epoch 0:   0%|                                                                                                                                                        | 0/134 [00:00<?, ?it/s]Traceback (most recent call last):                                                                                             
  File "train.py", line 83, in <module>                                                                                                                       
    trainer.fit(autoencoder, train_loader)                                                                                                                                                                                                                                                                                   
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit                                                   
    self._run(model)                                                                                                                                                                                                                                                                                                         
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()                                                                                                                                                                                                                                                                                                          
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)                                                                                                                                                                                                                                                                                    
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training                                
    self.training_type_plugin.start_training(trainer)                                                                                                                                                                                                                                                                        
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training             
    self._results = trainer.run_stage()                                                                                                                                                                                                                                                                                      
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()                                                                                                                                                                                                                                                                                                  
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train                                             
    self.train_loop.run_training_epoch()                                                                                                                                                                                                                                                                                     
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 481, in run_training_epoch                              
    for batch_idx, (batch, is_last_batch) in train_dataloader:                                                                                                                                                                                                                                                               
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
    value = next(iterator)                                                                                                                                                                                                                                                                                                   
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 530, in prefetch_iterator
    last = next(it)                                                                                                                                                                                                                                                                                                          
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
    return self.request_next_batch(self.loader_iters)                                                                                                                                                                                                                                                                        
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)                                                                                                                                                                                                                                                                 
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 84, in apply_to_collection                               
    return function(data, *args, **kwargs)                                                                                                                                                                                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__                                                          
    data = self._next_data()                                                                                                                                                                                                                                                                                                 
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)                                                                                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data                                                    
    data.reraise()                                                                                                                                                                                                                                                                                                           
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 429, in reraise                                                                         
    raise self.exc_type(msg)                                                                                                                                  
OSError: Caught OSError in DataLoader worker process 0.                                                                                                       
Original Traceback (most recent call last):                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop                                                   
    data = fetcher.fetch(index)                                                                                                                                                                                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch                                                            
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>                                                       
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                              
  File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 134, in __getitem__                                                                
    self.__open_h5f()                                                                                                                                         
  File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 90, in __open_h5f                                                                  
    self.event_slicers[location] = EventSlicer(h5f_location)                                                                                                  
  File "/mnt/beegfs/home/ken/research/EvDL/datasets/utils/eventslicer.py", line 31, in __init__                                                                
    self.ms_to_idx = np.asarray(self.h5f['ms_to_idx'], dtype='int64')                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/_asarray.py", line 83, in asarray                                                                                                                                                                                                                                  
    return array(a, dtype, copy=False, order=order)                                                                                                           
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                                                                       
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper                                                                                       
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 772, in __array__                                                                   
    self.read_direct(arr)                                                                                                                                                                                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 733, in read_direct                                                                                                                                                                                                                                
    self.id.read(mspace, fspace, dest, dxpl=self._dxpl)                         
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper         
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper         
  File "h5py/h5d.pyx", line 182, in h5py.h5d.DatasetID.read                     
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw                                                                                                                                                                                                                                                                   
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)

This led me down the rabbit hole of figuring out how hdf5 handles plugins (and the environment that handles this). However, upon more poking and prodding to reproduce my solution might have been overly complex. It looks like the minimal fix is just:

import hdf5plugin
import h5py

This could be added into your code directly (as it shouldn't have side effects). I did a quick grep of the library code and it appears as though you were relying upon hdf5 to automatically grab the plugin. This works inside the conda environment, but not a pip environment. With this fixed it led me to the next portion.

For 2, these are the errors that I saw when I go through a pip installation and have more than 1 worker. Note that this doesn't happen with a conda install.

ken@node-3090-3:~/research/EvDL$ python3 train.py --model_name LitUNet --gpus=1 --batch_size=2 --num_workers=2 --dataset DSEC_Subset
Num input channels: 10
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2021-05-26 11:38:24.671640: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

  | Name | Type | Params
------------------------------
0 | unet | UNet | 13.4 M
------------------------------
13.4 M    Trainable params
0         Non-trainable params
13.4 M    Total params
53.453    Total estimated model params size (MB)
/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                        | 0/134 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 83, in <module>
    trainer.fit(autoencoder, train_loader)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 481, in run_training_epoch
    for batch_idx, (batch, is_last_batch) in train_dataloader:
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
    value = next(iterator)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 530, in prefetch_iterator
    last = next(it)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 84, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 143, in __getitem__
    event_data = self.event_slicers[location].get_events(ts_start, ts_end)
  File "/mnt/beegfs/home/ken/research/EvDL/datasets/utils/eventslicer.py", line 67, in get_events
    time_array_conservative = np.asarray(self.events['t'][t_start_ms_idx:t_end_ms_idx])
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 573, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 182, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (Blosc decompression error)

In my experience with hdf5 (I was in charge of converting MVSEC), these sorts of errors are related to having the same file descriptor being shared between processes. The solution to this is simply to open the hdf5 files from within the child process (i.e. the getitem function)

I will do more digging to see what the difference in the installations are. On the surface they seem very similar, but more digging will most likely result in the reason that conda works out of the box and pip does not.

@magehrig
Copy link
Contributor

Very interesting, thanks.

I think then that it would make sense that I adapt the documentation for the pip installation. In case of the code, I believe that it is sufficient to catch the import error of the hdf5plugin and inform the user that install the hdf5plugin is required for a pip installation but otherwise not. E.g.

try:
    import hdf5plugin
except ImportError:
    print("Install the hdf5plugin if you are using pip instead of conda: https://pypi.org/project/hdf5plugin/")

@magehrig magehrig linked an issue May 26, 2021 that may be closed by this pull request
@Tobias-Fischer
Copy link

Hi @k-chaney - I just came across this issue. Have you tried using https://github.com/mamba-org/mamba which is a fast drop-in replacement for conda?

@shiba24
Copy link

shiba24 commented Nov 10, 2021

Just quick check-in to share my experience:

  • Note on the h5 file with multiprocessing

I agree with the second issue, opening the same h5 file between processes is troublesome (like this stackoverflow ). The quick fix is use num_workers=0 or something like @k-chaney suggested, I guess.
(In my case, the error is TypeError: h5py objects cannot be pickled rather than OSError.
Btw, in my case, the OSError is from my Mac. In my Ubuntu + docker environment I don't get that error - which means it works with num_workers > 0 even opening the h5 files outside __getitem__. but I'm not digging into it further.)

  • Note on the pip

However for the first issue, pip is working perfectly with hdf5plugin in my environment.
(Personally speaking I don't like conda because it messes up my environment.)

I use:

  • python 3.9x
  • Both working on M1 mac (without docker) and Ubuntu 20.x (inside docker - but I guess the docker does not matter for this pip for hdf5plugin issue.)
  • venv
  • poetry 1.1.11 (but I think this is optional, I don't need this to run my script)

I'd recommend to use venv if your pip has some problem and if you use the system python.
Hope this could help you.

Shintaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accessing data with blosc-hdf5-plugin in windows 10
4 participants