Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement hdf5 dataset reader op #356

Merged
merged 3 commits into from
May 31, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions fuse/data/ops/ops_read.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from typing import Hashable, List, Optional, Dict, Union
from fuse.utils.file_io.file_io import read_dataframe
import pandas as pd
import h5py

from fuse.data import OpBase
from fuse.utils.ndict import NDict
Expand Down Expand Up @@ -113,3 +114,50 @@ def get_all_keys(self) -> List[Hashable]:
:return: list of dataframe index values
"""
return list(self.data.keys())


class OpReadHDF5(OpBase):
"""
Op reading data from hd5f based dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hd5f typo :)

"""

def __init__(
self,
data_filename: Optional[str] = None,
columns_to_extract: Optional[List[str]] = None,
rename_columns: Optional[Dict[str, str]] = None,
key_index: str = "data.sample_id",
key_column: str = "sample_id",
):
"""
:param data_filename: path to hdf5 file
:param columns_to_extract: list of columns to extract - dataset keys to extract. When None (default) all columns are extracted
:param rename_columns: rename columns
:param key_index: name of value in sample_dict which will be used as the key/index
:param key_column: name of the column which use as key/index. In case of None, the original dataframe index will be used to extract the values for a single sample.
"""
# store input
self._data_filename = data_filename
self._columns_to_extract = columns_to_extract
self._rename_columns = rename_columns if rename_columns is not None else {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(avoiding mutable default values)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use frozen dict here as well

self._key_index = key_index
self._key_column = key_column

self._h5 = h5py.File(self._data_filename, "r")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that I was concern that we load the entire file into memory, but seems not:
https://stackoverflow.com/questions/40449659/does-h5py-read-the-whole-file-into-memory


if self._columns_to_extract is None:
self._columns_to_extract = self._h5.keys()

self._num_samples = len(self._h5[self._columns_to_extract[0]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this line doesn't load it into memory


def num_samples(self) -> int:
return self._num_samples
Comment on lines +153 to +154
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where you use it


def __call__(self, sample_dict: NDict) -> Union[None, dict, List[dict]]:

index = sample_dict[self._key_index]
for column in self._columns_to_extract:
key_to_store = self._rename_columns.get(column, column)
sample_dict[key_to_store] = self._h5[column][index]

return sample_dict
Loading