-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursively downloading S3 directory restored from Glacier doesn't work #4747
Comments
Hi @ludirm93 , we do not plan to support automatic Glacier restore and we have removed the glacierAutoRetrieval option. See #4511 (comment) for the discussion on this point. Instead, I recommend that you write a custom process to restore any Glacier objects all at once, to minimize the amount of time waiting for the Glacier restore |
Hi @bentsherman nextflow is not able to download even the pre-restored because in nxf_s3_download() function, it would require an aws cli option "--force-glacier-transfer" Worst case, I would need to make a duplicate copy of my data in S3 and keep it in standard storage |
That is strange, it should be able to download the S3 object like normal. How exactly did you restore it from Glacier? |
For the salmon output folder tree shown above, I individually executed restore-object command for each file. aws s3api restore-object --bucket bucket --key path/salmon/xyz/aux_info/ambig_info.tsv --restore-request '{"Days":17,"GlacierJobParameters":{"Tier":"Standard"}}' When I try to download the files outside of nextflow, to test if it is downloadable, aws s3 cp s3://bucket/path/salmon/xyz/quant.sf ./ I need to provide entire salmon output directory xyz as an input. aws s3 cp s3://bucket/path/salmon/xyz/ ./ --recursive does not work. aws proposes to add "--force-glacier-transfer" option in recursive download of a prefix The nxt_s3_download() function used to get these restored files is mentioned below: nxf_s3_download() { It recognizes my input path to be a directory and thus goes with recursive download option which in this case won't work because it needs --force-glacier-transfer I do not know if there is a way to pass this to nxf_s3_download() |
Additionally,
So another solution without passing --force-glacier-transfer could be, if [[ $is_dir == 1 ]]; then |
So this problem only happens when recursively downloading a directory, not with files |
Thanks for re-opening the issue. |
Seems like if the aws cli can recursively list the directory but not download it, it is a bug with the aws cli. In the meantime, you can work around this issue by specifying the files in the directory as inputs rather than the directory itself. It's slightly more verbose but much better in the long run. |
@bentsherman - it's not quite a bug (it's how they intend it to be- https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html), but I agree it's a really strange implementation choice. Just wanted to say I'm hitting the exact same issue as @Mridul-Chaudhary , but it's not quite clear to be how to switch from using directories to listing all files in a directory and using that instead. Thanks, |
There is a way, but it ain't pretty if the input directory is deeply nested. Instead of providing the directory as a single input, you can enumerate all of the files in the directory so that they are downloaded individually instead of a single recursive download I can understand the AWS CLI not wanting to recursively restore Glacier objects, but I don't understand why they are restricting objects that are already restored. Those objects should behave like regular objects |
Totally agree with you on the AWS CLI implementation, and I was as surprised as you are to find they implemented it this way, but it is what it is. Do you have any example code you could maybe point me to around this workaround? I'm just not sure how it would work 1) generating a list of these files in nextflow clode and 2) passing this list of files of variable length and order to a nextflow process. Thanks! |
It's easiest if the directory is produced by a process, which I assume is the case. Using the OP directory structure, you would normally declare it like this: output:
path('xyz') Instead you must declare as a tuple of paths for each group of files: output:
tuple path('xyz/aux_info/*'), path('xyz/libParams/*'), path('xyz/logs/*'), path('xyz/*') Though even that might not be specific enough. Given the wide variety of file types, you might need to declare each file individually 🤢 In general I don't recommend declaring entire directories as a single output as shown in the first example, because it doesn't tell you anything about the directory contents. I realize it's more convenient though. We might be able to get around all of this with #4553 , which would allow you to define a record type that enumerates all of those files, then you just declare the record as an input/output and the files are staged individually, which happens to solve this issue. record Xyz {
XyzAuxInfo aux_info
Path cmd_info
Path lib_format_counts
// ...
}
record XyzAuxInfo {
Path ambig_info
Path expected_bias
// ...
}
process foo {
input:
Xyz xyz
script:
"""
cat ${xyz.aux_info.ambig_info}
"""
} This is aspirational, not something that can be done today, just wanted to note it here for future reference. |
Thanks @bentsherman ! That makes to me. I also just want to point out to @Mridul-Chaudhary and others that may have this issue, that there's another workaround that works as well. After restoring the files, you can remove them from glacier entirely using a command like:
After running that command the files are entirely removed from glacier and put into standard storage, meaning the next time you want to download them you don't need to use the |
I am also not a fan of the way this was implemented in the AWS CLI (good luck getting any changes made there). If you follow the discussion on this thread (aws/aws-cli#1699 (comment)) it at least explain why they ended up implementing the For me I am running demultiplexing on an entire sequencing run data directory. The sequencing data sat in s3 glacier storage for a while and I restored it because I needed to run demux on it again. The way nxf_s3_download() is written means that I cannot download the dir and I get the glacier error. I think given the design decision of the AWS CLI, the right thing to do here is to implement the Thanks |
My preference is that it is implemented in the AWS scope. That way it is a simple configuration change. The reason I don't think it should be default is the same reason the --force-glacier-transfer flag was implemented to begin with; that is, it requires a HeadObject vs a GetObjects request. I like @Mridul-Chaudhary suggestion here: |
New feature
For rnasplice pipeline, I wanted to provide my previous salmon output as an input as suggested here
So I prepared input sheet like this:
sample,condition,salmon_results
xyz,case,s3://bucket/path/salmon/xyz
where xyz directory is in Glacier storage class but has been restored
xyz/
├── aux_info
│ ├── ambig_info.tsv
│ ├── expected_bias.gz
│ ├── fld.gz
│ ├── meta_info.json
│ ├── observed_bias_3p.gz
│ └── observed_bias.gz
├── cmd_info.json
├── lib_format_counts.json
├── libParams
│ └── flenDist.txt
├── logs
│ └── salmon_quant.log
├── quant.genes.sf
└── quant.sf
Each of the file inside xyz has been individually restored and has been tested for download. It does not require --force-glacier-transfer option.
But my sample-sheet above forces nxf_s3_download() function to attempt recursive download on xyz directory which is not downloadable without specifying --force-glacier-transfer option
Example:
aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive does not work
Error:
warning: Skipping file s3://bucket/path/salmon/xyz/aux_info/ambig_info.tsv. Object is of storage class GLACIER. Unable to perform download operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 download help for additional parameter options to ignore or force these transfers
aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive --force-glacier-transfer works
Usage scenario
Provide restored glacier directory paths as input
Suggest implementation
Similar to a change made here , --quiet was changed to --only-show-errors , --force-glacier-transfer option could be made a default in case the s3 object is not in standard storage
Or,
Introduce an optional setting similar to aws.client.glacierAutoRetrieval, say aws.client.glacierForceTransfer, wihch when true will add --force-glacier-transfer to nxf_s3_download() function
The text was updated successfully, but these errors were encountered: