Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load balancer returning 502 Bad Gateway error #211

Closed
Siegrift opened this issue Feb 1, 2024 · 4 comments
Closed

Fix load balancer returning 502 Bad Gateway error #211

Siegrift opened this issue Feb 1, 2024 · 4 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@Siegrift
Copy link
Collaborator

Siegrift commented Feb 1, 2024

One of the Airnode feed issues is 502 Gateway rejection by AWS ELB. See the query here.

The Signed APIs seem to be underutilised based of the CPU and memory metrics. This should be investigated and fixed. Note, that these rejections happen sporadicly and doesn't seem to impact the system that much.

@Siegrift Siegrift self-assigned this Feb 2, 2024
@Siegrift Siegrift added the bug Something isn't working label Feb 2, 2024
@Siegrift
Copy link
Collaborator Author

Siegrift commented Feb 2, 2024

I enabled access and connection load balancer logs on one of the Signed APIs and hope to get more clarity. The Signed API service seems to be under-used (max ~15% CPU and MEM according to the dashboard).

@Siegrift Siegrift added the blocked Blocked by something label Feb 2, 2024
@Siegrift
Copy link
Collaborator Author

Siegrift commented Feb 2, 2024

I gathered some of the ELB logs and compared it with Grafana logs. Note, that I turned off the ELB logs after a while because they are extra costs.

The ELB logs are nice and provide a lot of information. I saw that few requests were rejected by ELB. I was able to match other (successful) request with Grafana logs. I couldn't find why they are rejected though. I observed, that before each 502 rejection Airnode feed pushed large batch of Signed data (size 100).

My hypothesis is that signature verification caused a very short term CPU spike which caused ELB to reject the other request. I was rejecting this before, because the AWS metrics show the service is quite underused, but I am not sure how exactly AWS measures this metric.

Note, that the failure happens infrequently, but I want us to understand the reason behind every error we see. I think #213, #212 and #214 could both help with mitigating this issue.

As an unrelated note, I noticed that some of the Grafana logs look quite strange:

{
  "log": "...."
  "partial_id": "....",
  "partial_last": "false",
  "partial_message": "true",
  "partial_ordinal": "2"
}

This is caused by the log forwarder when it needs to forward long lines (it splits them into these partial logs). We should not see this with INFO level.

I'm keeping the issue open, but adding on-hold label. I don't want to close it because the issue still exists (and I don't think we should just ACK it).

@Siegrift Siegrift added on hold We do not plan to address this at the moment and removed blocked Blocked by something labels Feb 2, 2024
@Siegrift
Copy link
Collaborator Author

Siegrift commented Feb 23, 2024

#215 might fix this issue. We need to redeploy the Signed APIs and see if there are less errors for a few days. I didn't want to close the issue before I see the errors gone.

@Siegrift
Copy link
Collaborator Author

Siegrift commented Mar 7, 2024

I've checked Grafana logs after redeploing with 0.5.1. Before the mentioned fix, there were ~300 errors during a 12 hour period, but after the fix its ~90 which is significant reduction, but it's not a complete fix.

I also noticed that 0.6.0 has more of these errors, but @bdrhn9 pointed out this may be because of the increased bandwith from logging the data.

I think this is enough to warrant closing the issue. To get rid of 502 errors completely, we may get rid of ELB.

@Siegrift Siegrift closed this as completed Mar 7, 2024
@Siegrift Siegrift added wontfix This will not be worked on and removed bug Something isn't working on hold We do not plan to address this at the moment labels Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant