Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHCOS 4.10 build failure: Aliyun quota issues + pruning? #701

Closed
miabbott opened this issue Feb 2, 2022 · 10 comments
Closed

RHCOS 4.10 build failure: Aliyun quota issues + pruning? #701

miabbott opened this issue Feb 2, 2022 · 10 comments

Comments

@miabbott
Copy link
Member

miabbott commented Feb 2, 2022

We're running into quota issues with the account(s) we are using to import images into the cloud. As it looks right now, we have a limit of importing 500 images into the cloud from the account. (Unknown if this is a per region limit or a global limit.) This is different from the other cloud models where our accounts appear to have unlimited capacity for pushing data/images into the cloud.

The result right now is that the RHCOS 4.10 builds are failing because they can't upload to Aliyun (due to exceeding our quota). I've reached out to the folks working with Alibaba (and DPP) about getting clarity on the quota limitation and possibly getting it increased, but that seems like just a bandaid that will eventually burst.

There's a couple of things we could do:

  • disable Aliyun unless we need a bootimage bump (short term fix, increased pain related to bootimage bumps, easy to mess up)
  • come up with a pruning strategy for removing older builds (longer term fix, do we do this for everything or just Aliyun?, also seems easy to get wrong)

I'm inclined to disable Aliyun for now just so we are able to keep builds happening.

@miabbott miabbott changed the title RHCOS 4.10 build failure: Aliyun quota issues + pruning RHCOS 4.10 build failure: Aliyun quota issues + pruning? Feb 2, 2022
@cgwalters
Copy link
Member

disable Aliyun unless we need a bootimage bump (short term fix, increased pain related to bootimage bumps, easy to mess up)

Wouldn't we really just disable all bootimages?

come up with a pruning strategy for removing older builds (longer term fix, do we do this for everything or just Aliyun?, also seems easy to get wrong)

One core problem with pruning is that older installer versions can reference older images.

500 is not a small number, so if we prune fairly aggressively in development releases I think we'd be fine for a while. Still, it seems like this is limit is going to pull us towards openshift/enhancements#201

@miabbott
Copy link
Member Author

miabbott commented Feb 2, 2022

disable Aliyun unless we need a bootimage bump (short term fix, increased pain related to bootimage bumps, easy to mess up)

Wouldn't we really just disable all bootimages?

Yeah, I guess we can get away with that.

come up with a pruning strategy for removing older builds (longer term fix, do we do this for everything or just Aliyun?, also seems easy to get wrong)

One core problem with pruning is that older installer versions can reference older images.

500 is not a small number, so if we prune fairly aggressively in development releases I think we'd be fine for a while. Still, it seems like this is limit is going to pull us towards openshift/enhancements#201

I created a card towards a pruning tool - https://issues.redhat.com/browse/COS-1173

@travier
Copy link
Member

travier commented Feb 2, 2022

Another option is to work on improving the pipeline to still build boot images but not push them by default (but store the sha256sums) and then have a job that can take an ostree commit and re-build the exact same boot images (matching sha256sums) when we need a boot image bump. This is the next step of the re-hydrator work AFAIK.

@cgwalters
Copy link
Member

Interesting thought. I think that's strongly related to rehydrator, but different in implementation. It's saying that our image build process itself should be bit for bit reproducible. I think in the limit that's going to be quite hard. Others have done it for e.g. ISOs https://wiki.debian.org/ReproducibleInstalls/LiveImages but I think it's harder with e.g. XFS.

@miabbott
Copy link
Member Author

miabbott commented Feb 2, 2022

Another option is to work on improving the pipeline to still build boot images but not push them by default (but store the sha256sums) and then have a job that can take an ostree commit and re-build the exact same boot images (matching sha256sums) when we need a boot image bump. This is the next step of the re-hydrator work AFAIK.

A less elegant solution than "re-hydrator" for this problem is to just store the various artifacts in S3, then have a process publish them to the different clouds when we do the bootimage bump.

@cgwalters
Copy link
Member

Yeah, splitting up build vs publish makes total sense; it's what FCOS is doing today.

@cgwalters
Copy link
Member

OK but, no matter what, we must prune Aliyun images now to progress, right? Is anyone looking at that?

@miabbott
Copy link
Member Author

miabbott commented Feb 4, 2022

I'm swarming on it with @ravanelli

@miabbott
Copy link
Member Author

Renata and I swarmed on https://github.com/miabbott/rhcos-aliyun-pruner/

I have a request with ART to run the pruner on the prod account

@miabbott
Copy link
Member Author

miabbott commented Mar 8, 2022

ART ran the pruner script/container and cleaned up the unused images across all the regions in the account they use for production builds. Additionally, the folks at Alibaba increased the image quota to 1000 for each region, so we have a good amount of cushion going forward.

I'd recommend we start being more conservative with our bootimage generation and only enable it when we want to update the metadata in openshift/installer. It'll introduce savings in terms of storage and build time, plus will contribute to avoiding problems like this in the future. (FTR, I think we should also investigate a more holistic pruning function that operates on all CoreOS builds/artifacts)

@miabbott miabbott closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants