Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: updated file downloading to download only files that do not exis… #74

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

eredzik
Copy link

@eredzik eredzik commented Sep 17, 2024

Summary

Downloads only partitions of the file that are not existing - it helps when I am on slow corporate vpn and trying to download semi-big dataset

Checklist

  • [v] You agree with our CLA
  • [v] Included tests (or is not applicable).
  • [v] Updated documentation (or is not applicable).
  • [v] Used pre-commit hooks to format and lint the code.

@CLAassistant
Copy link

CLAassistant commented Sep 17, 2024

CLA assistant check
All committers have signed the CLA.

@nicornk
Copy link
Contributor

nicornk commented Sep 18, 2024

Hi @eredzik, thanks for the contribution. Can you explain a little bit more about how this change helps you in your workflow?

What if the file changed in the dataset?
I get the idea to skip downloading files from the same transaction that are already present but if view is a branch this would not always be the case.

did you explore the load_dataset function from the CachedFoundryClient?

thanks!

@eredzik
Copy link
Author

eredzik commented Sep 23, 2024

Hey @nicornk,

issue I am facing currently is that I have to work within company network VPN to access resources and it makes my network very slow (under 1mbs). It makes it very problematic to download any dataset if there is any network error - code doesn't handle any of those errors. Changes in this pr introduce (1) resuming download if not all partitions are downloaded and (2) download partitions to temp folder and move it to cache only after finishing the download process (in the released version of the code if any error occured partitions are left in broken state and dataset has to be deleted manually).

I checked load_dataset function and it uses the function I modify in this pr.

@nicornk
Copy link
Contributor

nicornk commented Sep 23, 2024

I think it’s a good idea to have a new function that can handle partially downloaded datasets. This function would need to compare the current files with the one from the foundry dataset based on path and file size. Since foundry does not return a hash there is a risk of keeping outdated files in case the path and file size is the same.

Having said that I do think that you could achieve this today if you use the s3 compatible API in combination with aws s3 sync cli command. aws s3 sync will only download files that are not fully downloaded.

aws s3 sync s3://ri.foundry.main.dataset.e366880f-7aff-1234-9236-d3bd862cc809 /path/to/target --profile foundry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants