Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement hdf5 dataset reader op #356

Merged
merged 3 commits into from
May 31, 2024
Merged

implement hdf5 dataset reader op #356

merged 3 commits into from
May 31, 2024

Conversation

mosheraboh
Copy link
Collaborator

No description provided.

SagiPolaczek
SagiPolaczek previously approved these changes May 30, 2024
Copy link
Collaborator

@SagiPolaczek SagiPolaczek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if relevant, but just mentioning we have fuse.utils.file_io.file_io.load_hdf5() and fuse.utils.file_io.file_io.save_hdf5_safe() - maybe you'll find it relevant (tho I see the code here is cleaner)


class OpReadHDF5(OpBase):
"""
Op reading data from hd5f based dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hd5f typo :)

# store input
self._data_filename = data_filename
self._columns_to_extract = columns_to_extract
self._rename_columns = rename_columns if rename_columns is not None else {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(avoiding mutable default values)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use frozen dict here as well

self._key_index = key_index
self._key_column = key_column

self._h5 = h5py.File(self._data_filename, "r")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that I was concern that we load the entire file into memory, but seems not:
https://stackoverflow.com/questions/40449659/does-h5py-read-the-whole-file-into-memory

if self._columns_to_extract is None:
self._columns_to_extract = list(self._h5.keys())

self._num_samples = len(self._h5[self._columns_to_extract[0]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this line doesn't load it into memory

Comment on lines +154 to +155
def num_samples(self) -> int:
return self._num_samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where you use it

@mosheraboh mosheraboh merged commit 7ed012b into master May 31, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants