Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup efa installer archive before install #6870

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Jul 29, 2023

Currently, the UserData section that runs during cloud init happens before any root volumes are expanded with growpart. Although the best solution would be to ensure the filesystem resize happens before these scripts are run, a quick means to fix the current issue is simply to cleanup the efa installer tar.gz, which is very large. I have tested this with hpc7g for a size 2 and size 8 cluster (previously both not working) and can confirm the devices are functioning after.

image

And logs for a running node (what they should look like!)

image

This will close #6869

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@vsoch
Copy link
Contributor Author

vsoch commented Dec 3, 2023

Sorry - why was this closed?

@vsoch
Copy link
Contributor Author

vsoch commented Aug 21, 2024

Hi could you please reopen this? We just spent a ton of money (and many hours) bringing up clusters with broken efa because a subset of nodes didn't have efa. Why? Because of this issue:

  Error downloading packages:
          gcc-7.3.1-17.amzn2.x86_64: Insufficient space in download directory /var/cache/yum/x86_64/2/amzn2-core/packages
            * free   11 M
            * needed 22 M

        Error: Failed to install packages.
        Error: failed to install EFA packages, exiting
        /var/lib/cloud/instances/i-08e860474f7d2683a/boothooks/part-001: line 7: pop: command not found
        /usr/bin/cloud-init-per: line 63: /opt/amazon/efa/bin/fi_info: No such file or directory

The archive needs to be cleaned up. This can't keep happening. I opened this over a year ago and I don't understand why it's been ignored and closed. What do you need from me?

@cPu1 cPu1 added priority/important-soon Ideally to be resolved in time for the next release and removed stale labels Aug 21, 2024
@cPu1 cPu1 reopened this Aug 21, 2024
@cPu1
Copy link
Collaborator

cPu1 commented Aug 21, 2024

@vsoch, please give us some time, we'll prioritize reviewing this PR.

@vsoch
Copy link
Contributor Author

vsoch commented Aug 21, 2024

Thank you! Much appreciated. I looked at the code and I think the change needs to be added to two additional files - I'll do that shortly.

Currently, the UserData section that runs during cloud init
happens before any root volumes are expanded with growpart.
Although the best solution would be to ensure the filesystem
resize happens before these scripts are run, a quick means
to fix the current issue is simply to cleanup the efa
installer tar.gz, which is very large. I have tested this
with hpc7g for a size 2 and size 8 cluster (previously both
not working) and can confirm the devices are functioning after.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the cleanup-efa-installer-archive branch from 747bf6d to d4eb786 Compare August 21, 2024 08:09
@vsoch
Copy link
Contributor Author

vsoch commented Aug 21, 2024

Updated to include the same 2023 files. I also tested this today (this evening) and it fixed the issue I posted above - my cluster came up with all efa nodes. I'll need to try the experiments for the two clusters that failed tonight, but with less funds now, tomorrow. Thanks for the help and TBA speedy review!

Problem: the node consistently runs out of disk space when
adding efa, resulting in an unusable cluster with scattered
nodes where the installer failed.
Solution: the installer archive itself is huge, and we can
simply remove it and avoid this error.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the cleanup-efa-installer-archive branch from d4eb786 to a47a32a Compare August 21, 2024 08:22
@github-actions github-actions bot added the stale label Sep 21, 2024
@github-actions github-actions bot closed this Sep 26, 2024
@vsoch
Copy link
Contributor Author

vsoch commented Sep 26, 2024

Hi @cPu1 your bot closed the PR again!

@cPu1 cPu1 reopened this Sep 26, 2024
@cPu1 cPu1 removed the stale label Sep 26, 2024
@cPu1
Copy link
Collaborator

cPu1 commented Sep 26, 2024

Hi @cPu1 your bot closed the PR again!

Sorry about that, I am not in favor of having this stale bot. We'll discuss this more.

As for the PR, we will try to get this reviewed and released by next week.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 26, 2024

Thank you! And no worries about stale bot - it can be very helpful. I'm subscribed to the thread and am good to ping when it needs to be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/improvement priority/important-soon Ideally to be resolved in time for the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] default device is not large enough for efa installer
2 participants