Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

Load features for classification #104

Merged
merged 32 commits into from
Oct 3, 2024
Merged

Load features for classification #104

merged 32 commits into from
Oct 3, 2024

Conversation

moerlemans
Copy link
Contributor

@moerlemans moerlemans commented Aug 8, 2024

The changes in this PR make it possible to load features (h5 or Zarr) from a database.

By changing the backend of the TiledWsiDataset the features can be loaded in a similar way we do with images. At the moment I'm assuming the classification task only, but it should be extendable to segmentation tasks as well. (In segmentation tasks you might want to have multiple "tiles" of features per WSI, in classification you would only want one set of features per WSI).

LMK what you think

@moerlemans moerlemans self-assigned this Aug 8, 2024
@moerlemans moerlemans added the enhancement New feature or request label Aug 8, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for_wsi_classification should also accept a list of pretransforms so we can just add pre_transforms in the configs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be generalized the same way the loss factory is

@moerlemans moerlemans linked an issue Aug 12, 2024 that may be closed by this pull request
@moerlemans moerlemans linked an issue Sep 11, 2024 that may be closed by this pull request
@moerlemans moerlemans marked this pull request as ready for review September 26, 2024 12:18
@moerlemans
Copy link
Contributor Author

Next to the available test I've also confirmed that I can extract features with this model and then use them in a classification pipeline as expected.

Copy link
Contributor

@jonasteuwen jonasteuwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments to begin with.

I think we could use two functions:

read_region_at_scale and read_region_at_level. The latter can be used for instance for features, and the first for images/masks. The first can use the SlideImage backend

Comment on lines +152 to +156
elif data_format == DataFormat.FEATURE:
size = (num_samples, 1)
mpp = 1.0
tile_size = (1, 1)
tile_overlap = (0, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the mpp equal to the factor of the tile size and mpp.

Copy link
Contributor Author

@moerlemans moerlemans Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand what you mean here, but these settings are mainly here so that the reader gets them correctly. I can change the mpp as long as the reader also reads at that specific mpp. 1.0 was chosen now for ease

Comment on lines 368 to 369
data_format=DataFormat.COMPRESSED_IMAGE if compression != "none" else DataFormat.IMAGE,
color_profile=color_profile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't 'none' need to be an Enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I've looked at it and here the dataformat should always be Image, because the compression here does not make the image a compressed image.

Comment on lines +305 to +307
if self._stitching_mode != StitchingMode.CROP:
raise NotImplementedError("Stitching mode other than CROP is not supported for features.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to check that the overlap is 0, or at least need to think about what this means for the features, and explain it somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check. ATM I don't see how features with overlap should work so we do not allow it.

Comment on lines +324 to +328

def read_region(self, location: tuple[int, int], level: int, size: tuple[int, int]) -> pyvips.Image:
"""
Reads a region in the stored h5 file. This function stitches the regions as saved in the cache file. Doing this
it takes into account:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention somewhere the difference between this function and the feature function (e.g. interpolation etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments to the function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be generalized the same way the loss factory is

Comment on lines 93 to 109

@classmethod
def for_wsi_classification(
cls, data_description: DataDescription, requires_target: bool = True
) -> PreTransformTaskFactory:
transforms: list[PreTransformCallable] = []

transforms.append(SampleNFeatures(n=1000))

if not requires_target:
return cls(transforms)
return cls(transforms, data_description, requires_target)

index_map = data_description.index_map
if index_map is None:
raise ConfigurationError("`index_map` is required for classification models when the target is required.")

label_keys = data_description.label_keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite this in a more factory design, check how this is done with losses and metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved the function and added docstring, but I am not sure how I would improve on the factory design more.

Comment on lines +142 to +151
raise ValueError(f"Expected features to have a width dimension of 1, got {h}.")

n_random_indices = (
np.random.choice(w, self.n, replace=False) if w > self.n else np.random.choice(w, self.n, replace=True)
)

# Extract the selected columns (indices) from the image
# Create a new image from the selected indices
# todo: this can probably be done without a for-loop quicker
selected_columns = [features.crop(idx, 0, 1, h) for idx in n_random_indices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to seed on something to make sure that each run gives reproducible values. Maybe the path is a good seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the seed_everything function in the entrypoints should handle this right?

Comment on lines 182 to 187
class SelectSpecificLabels:
def __init__(self, keys: list[str] | str):
if isinstance(keys, str):
keys = [keys]
self._keys = keys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +292 to +297
using_features = False

if tile.bands > 4:
# assuming that more than four bands/channels means that we are handling features
using_features = True
tile_ = tile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use pyvips when we have features. In that case torch.Tensor of numpy arrays should be what we are using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dlup dataset will always return a pyvips image and it does work as currently implemented, but indeed we could think about if we want to change that.

@jonasteuwen
Copy link
Contributor

Can you add a full example on how you run the extraction? Adding it here is fine, afterwards we can see how to add it to documentation.

@jonasteuwen jonasteuwen merged commit b0248e2 into main Oct 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Copy masks, annotations, caches and/or features with CLI tool Add ZarrFileFeatureWriter with chunks=False
2 participants