Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ichksum, checksum and etag #2021

Open
markoferme opened this issue Dec 21, 2021 · 3 comments
Open

ichksum, checksum and etag #2021

markoferme opened this issue Dec 21, 2021 · 3 comments

Comments

@markoferme
Copy link

How is the checksum computed, if you call the ichksum command on a file in a s3 cacheless resource?
Is the file downloaded to the irods server(or read from s3) and then the checksum is computed?

Is the etag information, which is computed during the iput to the s3 resource stored anywhere in the icat catalogue and could be retrieved in the after put hooks? Or must all of this be done manually?

While transferring large files (>10gb), the computation can take a lot of time(couple of minutes). Even by using the iput -P command, the experience for the user is, that the command just hangs, since there is no information on what is going on.

My idea is to use the etag generated on the s3 server as the checksum alternative, but its not clear to me, if this info is returned and saved to the icat after a successful upload, or what would be the best way to retrieve it using the irods rule engine.

Any help or pointers would be appreciated!

@markoferme markoferme changed the title ichksum, cheksum and etag ichksum, checksum and etag Dec 21, 2021
@trel
Copy link
Member

trel commented Dec 21, 2021

iRODS must currently get its checksum information from the iRODS catalog (if already provided/calculated/stored), or... by reading and calculating on the bytes of the replica being queried. So yes, a file would have to be downloaded from S3 to calculate the checksum if it had not been calculated and stored earlier.

iRODS doesn't currently have any ETag support... different S3-compatible storage vendors calculate and store ETag information differently - so there has not been an effort to generalize and provide that functionality.

This request, however, is related to irods/irods#3127 - and then the implementation of how to get/calculate/provide checksums would come from each plugin technology.

Please say a bit more about your use case - perhaps there is a way to get the same result with a different or alternative mechanism or workflow.

@markoferme
Copy link
Author

We are using s3 plugin in cache-less mode for large files (10gb+++). >1tb files are not uncommon. Running checksum on such data would fail since there is not enough storage space on the resource server to complete the file transfer.

We are using Eudat's B2Handle rulebase to generate pids to uploaded files. The code computes a checksum, after a successful upload and adds it to the handle metadata.

To avoid long checksum computation but still assigning some validation metadata, the idea is, to add the ETag data to the handle metadata and skip computing the checksum, if the storage type is s3.
Is such a thing possible? If the Etag data generated by s3 would be added to the icat, that should not be a problem. But is it?

How would we access this information?
The only other option for now is to skip adding a checksum to the handle metadata for s3 stored files.

Thank you for your reply.

@trel
Copy link
Member

trel commented Feb 9, 2024

Two years later...

I think the best approach might be to store the claimed checksum/ETag value in an iRODS AVU. iRODS would not be involved in directly calculating or validating that information. But iRODS should be viewed as a trusted messenger to hold the inserted value until another tool needs to use/validate/consume the value from the AVU.

Not sure if this is still a needed use case - or if you already solved it some other way. Regardless, it would be helpful to hear if you have found a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants