Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursively downloading S3 directory restored from Glacier doesn't work #4747

Open
Mridul-Chaudhary opened this issue Feb 16, 2024 · 15 comments

Comments

@Mridul-Chaudhary
Copy link

Mridul-Chaudhary commented Feb 16, 2024

New feature

For rnasplice pipeline, I wanted to provide my previous salmon output as an input as suggested here

So I prepared input sheet like this:

sample,condition,salmon_results
xyz,case,s3://bucket/path/salmon/xyz

where xyz directory is in Glacier storage class but has been restored

xyz/
├── aux_info
│   ├── ambig_info.tsv
│   ├── expected_bias.gz
│   ├── fld.gz
│   ├── meta_info.json
│   ├── observed_bias_3p.gz
│   └── observed_bias.gz
├── cmd_info.json
├── lib_format_counts.json
├── libParams
│   └── flenDist.txt
├── logs
│   └── salmon_quant.log
├── quant.genes.sf
└── quant.sf

Each of the file inside xyz has been individually restored and has been tested for download. It does not require --force-glacier-transfer option.

But my sample-sheet above forces nxf_s3_download() function to attempt recursive download on xyz directory which is not downloadable without specifying --force-glacier-transfer option

Example:
aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive does not work
Error:
warning: Skipping file s3://bucket/path/salmon/xyz/aux_info/ambig_info.tsv. Object is of storage class GLACIER. Unable to perform download operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 download help for additional parameter options to ignore or force these transfers

aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive --force-glacier-transfer works

Usage scenario

Provide restored glacier directory paths as input

Suggest implementation

Similar to a change made here , --quiet was changed to --only-show-errors , --force-glacier-transfer option could be made a default in case the s3 object is not in standard storage

Or,
Introduce an optional setting similar to aws.client.glacierAutoRetrieval, say aws.client.glacierForceTransfer, wihch when true will add --force-glacier-transfer to nxf_s3_download() function

@bentsherman
Copy link
Member

Hi @ludirm93 , we do not plan to support automatic Glacier restore and we have removed the glacierAutoRetrieval option. See #4511 (comment) for the discussion on this point.

Instead, I recommend that you write a custom process to restore any Glacier objects all at once, to minimize the amount of time waiting for the Glacier restore

@bentsherman bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2024
@Mridul-Chaudhary
Copy link
Author

Hi @bentsherman
Thank you for your response!
May be I was not able to convey my request..
The s3 files have already been restored by me manually..

nextflow is not able to download even the pre-restored because in nxf_s3_download() function, it would require an aws cli option "--force-glacier-transfer"

Worst case, I would need to make a duplicate copy of my data in S3 and keep it in standard storage

@bentsherman
Copy link
Member

That is strange, it should be able to download the S3 object like normal. How exactly did you restore it from Glacier?

@Mridul-Chaudhary
Copy link
Author

For the salmon output folder tree shown above, I individually executed restore-object command for each file.

aws s3api restore-object --bucket bucket --key path/salmon/xyz/aux_info/ambig_info.tsv --restore-request '{"Days":17,"GlacierJobParameters":{"Tier":"Standard"}}'

When I try to download the files outside of nextflow, to test if it is downloadable,
I can download each of the files with by using complete key

aws s3 cp s3://bucket/path/salmon/xyz/quant.sf ./

I need to provide entire salmon output directory xyz as an input.

aws s3 cp s3://bucket/path/salmon/xyz/ ./ --recursive does not work.

aws proposes to add "--force-glacier-transfer" option in recursive download of a prefix

The nxt_s3_download() function used to get these restored files is mentioned below:

nxf_s3_download() {
local source=$1
local target=$2
local file_name=$(basename $1)
local is_dir=$(/home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 ls $source | grep -F "PRE ${file_name}/" -c)
if [[ $is_dir == 1 ]]; then
/home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 cp --only-show-errors --recursive "$source" "$target"
else
/home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 cp --only-show-errors "$source" "$target"
fi
}

It recognizes my input path to be a directory and thus goes with recursive download option which in this case won't work because it needs --force-glacier-transfer

I do not know if there is a way to pass this to nxf_s3_download()

@Mridul-Chaudhary
Copy link
Author

Additionally,
Listing recursively on a restored directory works without --force-glacier-transfer.

aws s3 ls s3://bucket/path/salmon/xyz/ --recursive

So another solution without passing --force-glacier-transfer could be,

if [[ $is_dir == 1 ]]; then
list all files
download each of the above

@bentsherman
Copy link
Member

So this problem only happens when recursively downloading a directory, not with files

@bentsherman bentsherman changed the title To be able to provide --force-glacier-transfer option for staging restored S3 files Recursively downloading S3 directory restored from Glacier doesn't work Feb 23, 2024
@bentsherman bentsherman reopened this Feb 23, 2024
@Mridul-Chaudhary
Copy link
Author

Thanks for re-opening the issue.
Yes, that is correct..
This would affect pipelines or modules that need an entire directory as input.
For me, it was RNAsplice pipeline with salmon results as input.

@bentsherman
Copy link
Member

Seems like if the aws cli can recursively list the directory but not download it, it is a bug with the aws cli.

In the meantime, you can work around this issue by specifying the files in the directory as inputs rather than the directory itself. It's slightly more verbose but much better in the long run.

@MrOlm
Copy link

MrOlm commented Apr 16, 2024

@bentsherman - it's not quite a bug (it's how they intend it to be- https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html), but I agree it's a really strange implementation choice.

Just wanted to say I'm hitting the exact same issue as @Mridul-Chaudhary , but it's not quite clear to be how to switch from using directories to listing all files in a directory and using that instead.

Thanks,
Matt

@bentsherman
Copy link
Member

There is a way, but it ain't pretty if the input directory is deeply nested. Instead of providing the directory as a single input, you can enumerate all of the files in the directory so that they are downloaded individually instead of a single recursive download

I can understand the AWS CLI not wanting to recursively restore Glacier objects, but I don't understand why they are restricting objects that are already restored. Those objects should behave like regular objects

@MrOlm
Copy link

MrOlm commented Apr 17, 2024

Totally agree with you on the AWS CLI implementation, and I was as surprised as you are to find they implemented it this way, but it is what it is.

Do you have any example code you could maybe point me to around this workaround? I'm just not sure how it would work 1) generating a list of these files in nextflow clode and 2) passing this list of files of variable length and order to a nextflow process.

Thanks!
Matt

@bentsherman
Copy link
Member

It's easiest if the directory is produced by a process, which I assume is the case.

Using the OP directory structure, you would normally declare it like this:

output:
path('xyz')

Instead you must declare as a tuple of paths for each group of files:

output:
tuple path('xyz/aux_info/*'), path('xyz/libParams/*'), path('xyz/logs/*'), path('xyz/*')

Though even that might not be specific enough. Given the wide variety of file types, you might need to declare each file individually 🤢

In general I don't recommend declaring entire directories as a single output as shown in the first example, because it doesn't tell you anything about the directory contents. I realize it's more convenient though. We might be able to get around all of this with #4553 , which would allow you to define a record type that enumerates all of those files, then you just declare the record as an input/output and the files are staged individually, which happens to solve this issue.

record Xyz {
  XyzAuxInfo aux_info
  Path cmd_info
  Path lib_format_counts
  // ...
}

record XyzAuxInfo {
  Path ambig_info
  Path expected_bias
  // ...
}

process foo {
  input:
  Xyz xyz

  script:
  """
  cat ${xyz.aux_info.ambig_info}
  """
}

This is aspirational, not something that can be done today, just wanted to note it here for future reference.

@MrOlm
Copy link

MrOlm commented Apr 17, 2024

Thanks @bentsherman ! That makes to me.

I also just want to point out to @Mridul-Chaudhary and others that may have this issue, that there's another workaround that works as well. After restoring the files, you can remove them from glacier entirely using a command like:

$ aws s3 cp s3://sonn-current/instrain_profile/v1/ s3://sonn-current/instrain_profile/v1/ --storage-class STANDARD --recursive --force-glacier-transfer

After running that command the files are entirely removed from glacier and put into standard storage, meaning the next time you want to download them you don't need to use the --force-glacier-transfer flag (which also means they'll work with nextflow, which doesn't use the --force-glacier-transfer flag)

@k1sauce
Copy link

k1sauce commented Oct 23, 2024

I am also not a fan of the way this was implemented in the AWS CLI (good luck getting any changes made there). If you follow the discussion on this thread (aws/aws-cli#1699 (comment)) it at least explain why they ended up implementing the --force-glacier-transfer flag.

For me I am running demultiplexing on an entire sequencing run data directory. The sequencing data sat in s3 glacier storage for a while and I restored it because I needed to run demux on it again. The way nxf_s3_download() is written means that I cannot download the dir and I get the glacier error.

I think given the design decision of the AWS CLI, the right thing to do here is to implement the --force-glacier-transfer flag as this feature request specifies.

Thanks

@k1sauce
Copy link

k1sauce commented Oct 23, 2024

My preference is that it is implemented in the AWS scope. That way it is a simple configuration change. The reason I don't think it should be default is the same reason the --force-glacier-transfer flag was implemented to begin with; that is, it requires a HeadObject vs a GetObjects request.

I like @Mridul-Chaudhary suggestion here:
"Introduce an optional setting similar to aws.client.glacierAutoRetrieval, say aws.client.glacierForceTransfer, wihch when true will add --force-glacier-transfer to nxf_s3_download() function"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants