Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

Open
john-forrest opened this issue Aug 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@john-forrest
Copy link

Describe the bug
We are running AL2023 on several T3 ec2 instances - they have replaced some instances that ran centos 7 (now passed EOL) but we also have instances that run AL2 with similar items - we have never seen this issue before. In particular we have two instances that regularly appear to freeze up - their status is "running" but we cease to be able to connect to them over TCP and can only recover by rebooting. These instances run sonarqube and gitlab on docker containers - we don't get too much control over the internals, although have followed instructions for gitlab at least to run with less resources as much as possible, thinking this might help. Theory is that we are seeing the same issue as described in https://gist.github.com/raggi/1f8d0b9f45c5b62e7131b03e6e2ffe68 although that is ubuntu, it definitely sounds the same. We raised this AWS Support and were told the basis was that we'd used all our CPU Credits and thus the instance had been slowed down. From our viewpoint though the real issue is the way AL2023 is handling that.

I have hoped that, esp. since there is apparently a fix for this, AL2023 would itself be fixed, but no sign. We've added monitors that will reboot the instances automatically should they stop responding but it is not ideal and we have job failures etc for when the freeze happens.

To Reproduce
Steps to reproduce the behavior:

  1. Run an app on the instance that has occasional peaks in CPU.
  2. At some point the system will freeze.

Expected behavior
We should never hit this scenario

Screenshots

image (2)

Additional context

We did the original analysis a few months ago but issue still occurring. We are keeping up to date on AL2023 releases.

@ozbenh
Copy link

ozbenh commented Aug 9, 2024

I am not convinced by the explanation about using up CPU resources... it does look like DHCP is timing out which looks more like a networking problem to me... unless of course the CPU is slowed down so much that systemd fails to receive the DHCP responses but that sounds far fetched to me but the bug you linked does seem to indicate it as being a possibility...

We will attempt to get to the bottom of it

@ozbenh ozbenh added the bug Something isn't working label Aug 9, 2024
@john-forrest
Copy link
Author

@ozbenh Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants