Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Masked Dataset #151

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

JakobEliasWagner
Copy link
Collaborator

@JakobEliasWagner JakobEliasWagner commented Jul 24, 2024

Feature: Masked Dataset

Description

Not all datasets are consisent in the number of sensors and evaluations. Simulations or measurements are not only performed on a multitude of different grids, but may also contain different numbers of samples in both function spaces/sets. To reflect this the MaskedOperatorDataset class is introduced. It is able to handle datasets with this property.

This PR introduces two new operator classes MaskedOperatorDataset. The MaskedOperatorDataset is able to process datasets with varying number of sensors or evaluations.

Which issue does this PR tackle?

  • The OperatorDataset class is able to only handle uniform evaluation- and sensor-numbers.

How does it solve the problem?

  • Implements MaskedOperatorDataset class to allow for masked sensors and evaluations.
  • Moves the transformation method to the dataset base class.

How are the changes tested?

  • Introduced 10 new unit tests.

Notes

  • The current masking strategy involves to pad all samples to the size of the biggest sample.
    • Other strategies are also possible: cut to the smallest sample by random choice, find a middle-ground between both techniques.
    • In datasets where the numbers vary strongly this can be inefficient.
  • Currently only works for one-dimensional size tensors.
  • _get_item_ method is a method specific to the OperatorDataset and MaskedOperatorDataset classes for separation.

Checklist for Contributors

  • Scope: This PR tackles exactly one problem.
  • Conventions: The branch follows the feature/title-slug convention.
  • Conventions: The PR title follows the Bugfix: Title convention.
  • Coding style: The code passes all pre-commit hooks.
  • Documentation: All changes are well-documented.
  • Tests: New features are tested and all tests pass successfully.
  • Changelog: Updated CHANGELOG.md for new features or breaking changes.
  • Review: A suitable reviewer has been assigned.

Checklist for Reviewers:

  • The PR solves the issue it claims to solve and only this one.
  • Changes are tested sufficiently and all tests pass.
  • Documentation is complete and well-written.
  • Changelog has been updated, if necessary.

@JakobEliasWagner JakobEliasWagner self-assigned this Jul 24, 2024
@JakobEliasWagner JakobEliasWagner added the enhancement New feature or request label Jul 24, 2024
from continuiti.transforms import Transform
from continuiti.operators.shape import OperatorShapes, TensorShape


class OperatorDatasetBase(td.Dataset, ABC):
"""Abstract base class of a dataset for operator training."""

shapes: OperatorShapes
def __init__(self, shapes: OperatorShapes, n_observations: int) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, shapes: OperatorShapes, n_observations: int) -> None:
def __init__(self, shapes: OperatorShapes, n_observations: int):

"""Applies class transformations to four tensors.

Args:
src:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
src:
src: List of tuples containing a tensor and a transformation to apply to it.

Comment on lines +40 to +41
continue
out.append(transformation(src_tensor))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
continue
out.append(transformation(src_tensor))
else:
out.append(transformation(src_tensor))

self, x: torch.Tensor, u: torch.Tensor, y: torch.Tensor, v: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Applies class transformations to four tensors.
return tensors[0], tensors[1], tensors[2], tensors[3]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return tensors[0], tensors[1], tensors[2], tensors[3]
return tuple(tensors)

"""A dataset for operator training containing masks in addition to tensors describing the mapping.

Data, especially described on unstructured grids, can vary in the number of evaluations or sensors. Even
measurements of phenomena do not always contain the same number of sensors and or evaluations. This dataset is able
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
measurements of phenomena do not always contain the same number of sensors and or evaluations. This dataset is able
measurements of phenomena do not always contain the same number of sensors and/or evaluations. This dataset is able

Comment on lines +280 to +282
assert not any(
[torch.any(torch.isinf(mi)) for mi in member]
), "Expects domain to be truncated in finite space."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this assertion necessary? Someone might come up with a good reason for using infs in the data, do we have to prevent that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we're using inf for padding. However, does it hurt to have more infs (non-masked) in the dataset?

padding_value=torch.inf,
).transpose(1, 2)
values_padded = pad_sequence(
[vi.transpose(0, 1) for vi in values], batch_first=True, padding_value=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[vi.transpose(0, 1) for vi in values], batch_first=True, padding_value=0
[vi.transpose(0, 1) for vi in values],
batch_first=True,
padding_value=0,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we pad once with inf and once with 0? Seems arbitrary

Comment on lines +293 to +296
mask = member_padded != torch.inf
member_padded[
~mask
] = 0 # mask often applied by adding a tensor with -inf values in masked locations (e.g. in scaled dot product).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is different here if why just used 0 for padding in l. 287?


return sample["x"], sample["u"], sample["y"], sample["v"]
return tensors[0], tensors[1], tensors[2], tensors[3], ipt_mask, opt_mask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return tensors[0], tensors[1], tensors[2], tensors[3], ipt_mask, opt_mask
return *tuple(tensors), ipt_mask, opt_mask

dataloader = DataLoader(dataset, batch_size=self.batch_size)

for x, u, y, v, ipt_mask, opt_mask in dataloader:
assert True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do more here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants