Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for copying directories recursively #160

Open
3 tasks
carlspring opened this issue Dec 31, 2020 · 3 comments
Open
3 tasks

Implement support for copying directories recursively #160

carlspring opened this issue Dec 31, 2020 · 3 comments
Labels
feature request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@carlspring
Copy link
Owner

carlspring commented Dec 31, 2020

Task Description

We need to implement support for copying directories in the org.carlspring.cloud.storage.s3fs.S3FileSystemProvider class, as this currently only works for regular files and does not check if the paths are directories in order to recurse into them.

Tasks

The following tasks will need to be carried out:

  • Study the code of the org.carlspring.cloud.storage.s3fs.S3FileSystemProvider and propose the most efficient way to do this.
  • Implement the necessary changes.
  • Implement test cases.

Help

@carlspring carlspring added help wanted Extra attention is needed good first issue Good for newcomers feature request labels Dec 31, 2020
@carlspring carlspring changed the title Implement support for copying directories Implement support for copying directories recursively Jan 1, 2021
@edmang
Copy link
Contributor

edmang commented Feb 2, 2021

The copy object seems to be limited by 5GB according to the doc: https://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectsExamples.html
however, it seems that there is no copy with Collection, the copy is done 1 by 1

@steve-todorov
Copy link
Collaborator

Details

You are pointing to their "old" documentation - the new one is here.

It is correct you cannot recursively copy using the S3 API, however it is possible to use batch operations for this:

To copy more than one Amazon S3 object with a single request, you can use Amazon S3 batch operations. You provide S3 Batch Operations with a list of objects to operate on. S3 Batch Operations calls the respective API to perform the specified operation. A single Batch Operations job can perform the specified operation on billions of objects containing exabytes of data.

However, I don't think this would be easy or straightforward to implement as #163 Delete objects recursively.

You cannot get the entire list of objects, because there is a limit of 1000 objects (as I mentioned in #163). It looks like the you could maybe use the ListObjectsV2PaginatorsIntegrationTest as base for paginated object listing.

Possible issues I foresee are:

In S3 you can virtually unlimited number of recursive "objects" (files or directories) in a tree-like structure:

  • This means we should definitely use async operations to speed up things.
  • We should use reactor where we can use back-pressure techniques to avoid abusing resources and allow for automatic retry on error.
  • Since there is a 5GB file size limit per copied object - we would need to different strategies which are triggered depending on the file size:
    • SingleObjectCopyStragety - when the file is <= 5GB
    • MultipartObjectCopyStrategy - when the file is >= 5GB (and uses the multipart upload API as mentioned in the first link)

The first two points are actually valid for #163 as well, but I guess we can create a follow-up after this task.

// cc @carlspring

@edmang
Copy link
Contributor

edmang commented Feb 15, 2021

@steve-todorov, thank you for your suggestion !
I have a little question about the use of ListObjectsV2Request in ListObjectsV2PaginatorsIntegrationTest (let forget about the async aspect for the moment :D ).
Since S3 does not have "folder" (everything is a /path/to/the/file), hence we dont need to visit all "folders" ? We only need the
Iterator on files to copy ? (In this case, the ListObjectsV2PaginatorsIntegrationTest will actually do the job)

However, when I check #163 , I do not only delete the file, but the folders too, that is why I made my own visitAllFiles method (in order to delete the leaves files firstly, and the folder after and so on...)

Do you think I should keep my own visitAllFiles or I should use the ListObjectsV2Request ? (in the latter case, maybe I should also reconsidered the #163 ?)

thanks :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants