Check blob hash #942

Nicqu · 2023-11-10T13:49:31Z

Purpose

This feature eliminates the need for local md5 hash files. The hash values of the local documents are compared with the documents in the remote blob storage. This also facilitates the automatic upload of documents later, as no md5 files need to be committed.

Check for existing files with the hash from the blob storage instead of local files
Overwrite documents if the hash value has changed (so you always have the latest version of the document)

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[x] Documentation content changes
[ ] Other... Please describe:

How to Test

run ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1 once to uploade docs
run ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1 again to see that the docs will be skipped

What to Check

Verify that the following are valid

no local hash files will be created

Nicqu · 2023-11-10T13:50:35Z

@microsoft-github-policy-service agree

pamelafox · 2023-11-14T20:27:11Z

This seems like an overall better approach, as I've been running into issues with the md5 approach when I switch environments (since the md5 files remain). I'd love for @tonybaloney to take a look since he wrote the original md5 approach.

docs/data_ingestion.md

pamelafox · 2023-11-21T17:20:29Z

I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly.

Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks:
Azure/azure-storage-python#411

I see a few approaches:

For large documents, download the file and manually compute the MD5 hash locally. That has the obvious drawback that you have to download a large document, but download speeds are at least typically better than upload speeds.
Compute the MD5 hash before uploading the document, and manually specify it. That is what the issue suggests, but has the drawback that it will be harder for existing developers to use this code. They will get an error unless they re-upload the document.
Fallback to local MD5 hash for those documents. That seems confusing though.

pamelafox

I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly.

Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks:
Azure/azure-storage-python#411

I see a few approaches:

For large documents, download the file and manually compute the MD5 hash locally. That has the obvious drawback that you have to download a large document, but download speeds are at least typically better than upload speeds.
Compute the MD5 hash before uploading the document, and manually specify it. That is what the issue suggests, but has the drawback that it will be harder for existing developers to use this code. They will get an error unless they re-upload the document.
Fallback to local MD5 hash for those documents. That seems confusing though.

Nicqu · 2023-11-29T12:28:16Z

I think approach 1 makes the most sense. If you like, I can extend the PR.

pamelafox · 2023-11-29T21:45:00Z

I discussed this with @tonybaloney and we suggest a combination of the approaches, and also removing our reliance on the built-in MD5 given its flakiness. So that means:

When first uploading a file, compute MD5 locally and set it as a custom property on the blob.
When looking at blob properties to decide whether to re-upload, if it doesn't have the MD5, then download the file, compute the MD5 and re-upload it.

That way, this approach will work for developers that have checked out the repository before, and also won't incur unneeded downloads for developers using large files in the future.

That's a larger change though, so let us know if you don't have the time to take it on. Thank you so much!

Nicqu added 4 commits November 10, 2023 14:32

add blob storage hash check

9132e62

overwrite existing docs

6033edf

update documentation

7f34f47

add test

3672a83

pamelafox requested a review from tonybaloney November 10, 2023 14:42

Nicqu and others added 2 commits November 10, 2023 16:23

update liniting

ec35d97

Merge branch 'Azure-Samples:main' into feature/check-blob-hash

01b4985

Merge branch 'main' into feature/check-blob-hash

19f1251

pamelafox reviewed Nov 21, 2023

View reviewed changes

docs/data_ingestion.md Outdated Show resolved Hide resolved

Update docs/data_ingestion.md

e4dd0b5

pamelafox requested changes Nov 21, 2023

View reviewed changes

Update print statements

0996e07

Merge branch 'main' into feature/check-blob-hash

c753875

pamelafox mentioned this pull request Jun 21, 2024

MD5 at DataLake is not available for ACL Solution #1742

Open

phoevos mentioned this pull request Jul 9, 2024

prepdocs: Documents are skipped on redeployment #1779

Open

4 tasks

pamelafox mentioned this pull request Aug 22, 2024

Search Index Not Updating After Successful Deployment #1927

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check blob hash #942

Check blob hash #942

Nicqu commented Nov 10, 2023

Nicqu commented Nov 10, 2023

pamelafox commented Nov 14, 2023

pamelafox commented Nov 21, 2023

pamelafox left a comment

Nicqu commented Nov 29, 2023

pamelafox commented Nov 29, 2023

Check blob hash #942

Are you sure you want to change the base?

Check blob hash #942

Conversation

Nicqu commented Nov 10, 2023

Purpose

Does this introduce a breaking change?

Pull Request Type

How to Test

What to Check

Nicqu commented Nov 10, 2023

pamelafox commented Nov 14, 2023

pamelafox commented Nov 21, 2023

pamelafox left a comment

Choose a reason for hiding this comment

Nicqu commented Nov 29, 2023

pamelafox commented Nov 29, 2023