fix: updated file downloading to download only files that do not exis… #74

eredzik · 2024-09-17T18:42:40Z

Summary

Downloads only partitions of the file that are not existing - it helps when I am on slow corporate vpn and trying to download semi-big dataset

Checklist

[v] You agree with our CLA
[v] Included tests (or is not applicable).
[v] Updated documentation (or is not applicable).
[v] Used pre-commit hooks to format and lint the code.

…t yet

CLAassistant · 2024-09-17T18:42:46Z

All committers have signed the CLA.

nicornk · 2024-09-18T10:18:13Z

Hi @eredzik, thanks for the contribution. Can you explain a little bit more about how this change helps you in your workflow?

What if the file changed in the dataset?
I get the idea to skip downloading files from the same transaction that are already present but if view is a branch this would not always be the case.

did you explore the load_dataset function from the CachedFoundryClient?

thanks!

eredzik · 2024-09-23T16:16:02Z

Hey @nicornk,

issue I am facing currently is that I have to work within company network VPN to access resources and it makes my network very slow (under 1mbs). It makes it very problematic to download any dataset if there is any network error - code doesn't handle any of those errors. Changes in this pr introduce (1) resuming download if not all partitions are downloaded and (2) download partitions to temp folder and move it to cache only after finishing the download process (in the released version of the code if any error occured partitions are left in broken state and dataset has to be deleted manually).

I checked load_dataset function and it uses the function I modify in this pr.

nicornk · 2024-09-23T18:27:19Z

I think it’s a good idea to have a new function that can handle partially downloaded datasets. This function would need to compare the current files with the one from the foundry dataset based on path and file size. Since foundry does not return a hash there is a risk of keeping outdated files in case the path and file size is the same.

Having said that I do think that you could achieve this today if you use the s3 compatible API in combination with aws s3 sync cli command. aws s3 sync will only download files that are not fully downloaded.

aws s3 sync s3://ri.foundry.main.dataset.e366880f-7aff-1234-9236-d3bd862cc809 /path/to/target --profile foundry

fix: updated file downloading to download only files that do not exis…

8cc1c68

…t yet

feat: make file downloading safer

8796fec

Merge branch 'emdgroup:main' into main

e67b85a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: updated file downloading to download only files that do not exis… #74

fix: updated file downloading to download only files that do not exis… #74

eredzik commented Sep 17, 2024

CLAassistant commented Sep 17, 2024 •

edited

Loading

nicornk commented Sep 18, 2024 •

edited

Loading

eredzik commented Sep 23, 2024

nicornk commented Sep 23, 2024 •

edited

Loading

fix: updated file downloading to download only files that do not exis… #74

Are you sure you want to change the base?

fix: updated file downloading to download only files that do not exis… #74

Conversation

eredzik commented Sep 17, 2024

Summary

Checklist

CLAassistant commented Sep 17, 2024 • edited Loading

nicornk commented Sep 18, 2024 • edited Loading

eredzik commented Sep 23, 2024

nicornk commented Sep 23, 2024 • edited Loading

CLAassistant commented Sep 17, 2024 •

edited

Loading

nicornk commented Sep 18, 2024 •

edited

Loading

nicornk commented Sep 23, 2024 •

edited

Loading