Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 artifact download performance improvement #9650

Closed
xubofei1983 opened this issue Sep 21, 2022 · 1 comment
Closed

S3 artifact download performance improvement #9650

xubofei1983 opened this issue Sep 21, 2022 · 1 comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/feature Feature request

Comments

@xubofei1983
Copy link
Contributor

xubofei1983 commented Sep 21, 2022

Summary

We try to use artifact to download a fairly big s3 bucket and found the performance is not satisfactory comparing with aws s3 cli.

The bucket size is about 60G total, containing subfolders and many files.

with argo, it took around 26 mins

time="2022-09-21T03:23:41.924Z" level=info msg="Downloading artifact: my-art"
time="2022-09-21T03:23:41.924Z" level=info msg="Specified artifact path /argo/input/my-artifact overlaps with volume mount at /argo/input. Extracting to volume mount"
time="2022-09-21T03:23:41.924Z" level=info msg="S3 Load path: /mainctrfs/argo/input/my-artifact.tmp, key: 104412/align-reads/"
time="2022-09-21T03:23:41.924Z" level=info msg="Creating minio client using IAM role"
time="2022-09-21T03:23:41.924Z" level=info msg="Getting file from s3" bucket=results.bravo.kariusdx.com endpoint=s3.amazonaws.com key=104412/align-reads/ path=/mainctrfs/argo/input/my-artifact.tmp
time="2022-09-21T03:23:43.033Z" level=info msg="Getting directory from s3" bucket=results.bravo.kariusdx.com endpoint=s3.amazonaws.com key=104412/align-reads/ path=/mainctrfs/argo/input/my-artifact.tmp
time="2022-09-21T03:23:43.033Z" level=info msg="Listing directory from s3" bucket=results.bravo.kariusdx.com endpoint=s3.amazonaws.com key=104412/align-reads/
time="2022-09-21T03:49:38.528Z" level=info msg="Detecting if /mainctrfs/argo/input/my-artifact.tmp is a tarball"
time="2022-09-21T03:49:38.528Z" level=info msg="Successfully download file: /mainctrfs/argo/input/my-artifact"

with aws s3 sync, it took around 3 mins. (on same ec2 machine)

What change needs making?

improve the performance with async? or multi processing? or allow user to plugin download functionality?

Use Cases

We have an aggregation step in time sensitive pipeline, which needs to download large data and aggregate.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@xubofei1983 xubofei1983 added the type/feature Feature request label Sep 21, 2022
@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label May 30, 2024
@agilgur5
Copy link
Member

Just got pointed to this issue from a Slack thread. Sorry it never got a response from a maintainer in the past 2 years (it predates me).

As Tim wrote on Slack:
"Make sure you are right-sizing your executor. Pulling big data and then untarring it needs quite a bit of resources. The default size is pretty tiny and not artifact-friendly."

And as I wrote in addition to Tim's comment:
"The init container and wait container specifically are responsible for artifacts. It uses the S3 Go SDK, so I'd expect the main difference is resource allocation as Tim said. Note that CPU, memory, network, and FS can all impact this (not sure which is the bottleneck in your case though; often memory is)"

You can set executor resources globally in your Controller ConfigMap under executor.resources.

You can set them per template by using podSpecPatch on the template.

with aws s3 sync, it took around 3 mins. (on same ec2 machine)

Regarding aws s3 sync, I also wrote that it could be more optimal under-the-hood, say if it uses rsync or something.

improve the performance with async? or multi processing? or allow user to plugin download functionality?

There is a separate, specific issue to parallelize artifacts: #12442
Same for artifact plugins: #5862

Going to mark this issue as duplicative of those as they are more specific in their feature request.

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/feature Feature request
Projects
None yet
Development

No branches or pull requests

2 participants