Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Set file.checksum_md5 non optional #841

Merged
merged 1 commit into from
Oct 10, 2024
Merged

Conversation

tnatt
Copy link
Collaborator

@tnatt tnatt commented Oct 8, 2024

PR to set file.checksum_md5 non optional. It was non optional in the legacy schema, and was introduced in https://github.com/equinor/fmu-dataio/pull/512/files, probably by mistake.

Closes #835

Copy link
Collaborator

@mferrera mferrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making this field required is important, but I'm having second thoughts about an empty string being a valid checksum. I think there is also a place for stricter checks here.

A valid md5 checksum usually has 32 characters, 0-9a-f. We should probably make this a validation check, as an invalid checksum compromises data quality

Maybe for reminding me, and for posterity, under what circumstances are we giving an empty string checksum?

@tnatt
Copy link
Collaborator Author

tnatt commented Oct 9, 2024

I think making this field required is important, but I'm having second thoughts about an empty string being a valid checksum. I think there is also a place for stricter checks here.

A valid md5 checksum usually has 32 characters, 0-9a-f. We should probably make this a validation check, as an invalid checksum compromises data quality

Maybe for reminding me, and for posterity, under what circumstances are we giving an empty string checksum?

The empty string will only occur in the case that a user directly uses generate_metadata() and explicitly sets compute_md5=False (default is True). And then dumps this metadata themselves to disk. Most users will use this function indirectly through the export function. The export function does not have this argument, and we hardcode it to be True. Meaning all exports will have checksum_md5 in the metadata.

metadata = self.generate_metadata(obj, compute_md5=True, **kwargs)

Before we can really ensure we don't create an empty string in the metadata we need to deprecate the compute_md5 argument in generate_metadata(). ..and I don't really see any big problem to make the change to always compute a checksum, and then add a deprecation warning saying the functionality is no longer supported and that the argument will be removed in the future. I don't think any users actively want to keep the checksum out of the metadata.

What do you think 🙂 ?

@mferrera
Copy link
Collaborator

mferrera commented Oct 9, 2024

Yes, let's do it 👍 there is no good reason to not generate a checksum. And in 8 months time if we change our mind about that and land back on this PR, at least we will have some context to our decision 😄

@perolavsvendsen
Copy link
Member

Could be that this is needed when no files are actually materialized to disk? I.e. in our aggregation service (which at the moment does not use fmu-dataio but absolutely should). @equinor-ruaj may have more context?

@mferrera
Copy link
Collaborator

mferrera commented Oct 9, 2024

@tnatt pointed out on Slack that we have #552 which touches on this issue. Resolving that would also probably result in a reasonable wall-time performance boost.

@tnatt tnatt merged commit 1a1054e into equinor:main Oct 10, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

file.checksum_md5 is optional but lacks default value
3 participants