-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check blob hash #942
base: main
Are you sure you want to change the base?
Check blob hash #942
Conversation
@microsoft-github-policy-service agree |
This seems like an overall better approach, as I've been running into issues with the md5 approach when I switch environments (since the md5 files remain). I'd love for @tonybaloney to take a look since he wrote the original md5 approach. |
I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly. Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks: I see a few approaches:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly.
Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks:
Azure/azure-storage-python#411
I see a few approaches:
- For large documents, download the file and manually compute the MD5 hash locally. That has the obvious drawback that you have to download a large document, but download speeds are at least typically better than upload speeds.
- Compute the MD5 hash before uploading the document, and manually specify it. That is what the issue suggests, but has the drawback that it will be harder for existing developers to use this code. They will get an error unless they re-upload the document.
- Fallback to local MD5 hash for those documents. That seems confusing though.
I think approach 1 makes the most sense. If you like, I can extend the PR. |
I discussed this with @tonybaloney and we suggest a combination of the approaches, and also removing our reliance on the built-in MD5 given its flakiness. So that means:
That way, this approach will work for developers that have checked out the repository before, and also won't incur unneeded downloads for developers using large files in the future. That's a larger change though, so let us know if you don't have the time to take it on. Thank you so much! |
Purpose
This feature eliminates the need for local md5 hash files. The hash values of the local documents are compared with the documents in the remote blob storage. This also facilitates the automatic upload of documents later, as no md5 files need to be committed.
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
How to Test
./scripts/prepdocs.sh
or./scripts/prepdocs.ps1
once to uploade docs./scripts/prepdocs.sh
or./scripts/prepdocs.ps1
again to see that the docs will be skippedWhat to Check
Verify that the following are valid