Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tarfiles #27

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open

Tarfiles #27

wants to merge 54 commits into from

Conversation

bganglia
Copy link
Contributor

@bganglia bganglia commented Aug 6, 2020

This pull request adds support for opening from tarfiles, zipfiles, and folders with any combination of tarfiles and/or zipfiles, whether in DICOM format or regular image formats such as PNG, and whether using a serial or parallel dataloader.

TODO:

  • Test cache
  • Tidy code
  • Use/test TarDataset with the rest of the classes
    • MIMIC_Dataset (tar)
    • NIH_Dataset (tar)
    • NLM_Dataset (zip)
    • PC_Dataset (tar)
    • RSNA_Pneumonia_Dataset (folder)
    • NIH_Google_Dataset (tar)
    • CheX_Dataset (zip)
    • Openi_Dataset (tar)
    • COVID-19 Dataset? (folder)
    • NLMTB_Dataset x 2 (zip)
  • Cache imgid -> path_in_archive mappings

@bganglia bganglia closed this Aug 6, 2020
@bganglia bganglia reopened this Aug 8, 2020
@bganglia bganglia marked this pull request as draft August 8, 2020 05:45
def __init__(self, imgpath):
if imgpath.endswith(".tar"):
self.tarred = tarfile.open(imgpath)
self.tar_paths = self.tarred.getmembers()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you said this takes a lot of time. I see how this approach is super robust. I did some tests for time and it seems like just 30 seconds for the MIMIC data. What about caching this using a dict based on the file path? It would speed things up if multiple objects are created? But it seems like a reasonable price to pay.
The alternative is to skip it and assume a file structure but that is not as nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, using a dict, the second load is around 10x faster on my machine.

The dict could also be pickled so there is only one slow load. That option would lead to issues if someone wanted to change the tarfile, although I don't know why they would do that.

Copy link
Member

@ieee8023 ieee8023 Aug 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing hashing and caching a file could be nice but also could be annoying to debug and create more issues (like if there are no write permissions).


class TarDataset(Dataset):
def __init__(self, imgpath):
if imgpath.endswith(".tar"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using tarfile.is_tarfile ? This will allow for compressed files which people may want to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks

#print(img_path)
img = imread(img_path)
img = self.get_image(imgid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added this for NIH_Dataset but it doesn't extend TarDataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, let me fix that

Copy link
Member

@ieee8023 ieee8023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it the other way. All the datasets should just become TarDatasets right? This feature makes it much easier to work with all these datasets.

@bganglia
Copy link
Contributor Author

bganglia commented Aug 8, 2020

Ah, ok. I agree, it was just a temporary thing because I have only tested this with MIMIC so far. I can fix it the other way and make a quick test

@bganglia
Copy link
Contributor Author

bganglia commented Aug 10, 2020

I was running into some weird multiple inheritance issues when I tried creating a ZipDataset class and then tried having datasets that can be initialized from both.

I think that it would be easier to extend to zipfiles, etc. if instead of having a TarDataset class, there was a "storage interface" object that takes a file path in the constructor and has a .get_image() method returning a numpy array. Then the dataloaders could all inherit from a class that picks the right type of storage interface, like this:
https://github.com/bganglia/torchxrayvision/blob/34daddb2551821cce5b5e5d13f7c4b7bba10b56e/torchxrayvision/datasets.py#L267

Instead of having a self.get_image() method, the datasets could just call self.interface.get_image().

@bganglia bganglia marked this pull request as ready for review August 28, 2020 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants