Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multipart upload instead of a temporary file #47

Open
gilbsgilbs opened this issue Dec 20, 2018 · 4 comments · May be fixed by #61
Open

Use multipart upload instead of a temporary file #47

gilbsgilbs opened this issue Dec 20, 2018 · 4 comments · May be fixed by #61

Comments

@gilbsgilbs
Copy link

Hi,

If I understand the current implementation correctly, it creates a temporary file on disk, writes everything to it, and just when the file is closed, uploads it to S3. This implementation is not applicable for very large files and is probably slightly suboptimal in some circumstances. For these reasons, I think it would be beneficial should s3fs could take advantage of multipart uploads.

There's a lot of gotchas and boilerplate associated with multipart uploads though. One possibly simpler implementation could be to delegate the hassle to a third-party library that already does the proxying stuff, such as smart_open (great library by the way). Needless to say that all this could be opt-in.

Thank you so much for all the great work on pyfilesystem2 and s3fs 👍

@willmcgugan
Copy link
Member

It is a good idea. I'll keep this issue up as a reminder, but I'm not sure when I'll have time to look at it just yet.

@gilbsgilbs
Copy link
Author

Thanks for your prompt response @willmcgugan, I really appreciate. If somebody or myself want to pick this and submit a PR, would you:

  • accept that it relies on third party libraries?
  • prefer it to be opt-in only? / accept that it completely replaces the tempfile implementation?

@willmcgugan
Copy link
Member

I would be delighted to accept a PR...

accept that it relies on third party libraries?

Third party libs would be okay, as long as they are of a good quality and don't have too many dependancies themselves. But I'd prefer to avoid C extensions if possible.

prefer it to be opt-in only? / accept that it completely replaces the tempfile implementation?

It's rare to have one method being better than another in every way, so I'd need to consider the pros and cons. For instance, if you write a line at a time, that will be quite efficient with the current tmpfile, but I imagine it would be very slow to do a request for every line. But if there is some buffering on top of that, it might not be an issue.

If you need it for something commercial, I am available for contract work.

@gilbsgilbs
Copy link
Author

If you need it for something commercial, I am available for contract work.

Thanks for proposing. Actually, I would have needed it for commercial stuff, but it's too late already. I did things otherwise (and it was not huge enough to spend more time on it anyways). I may contribute in my spare time for fun though, yet not sure if and not sure when, just like you 😉 . Cheers.

@mrk-its mrk-its linked a pull request Nov 15, 2019 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants