-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix load balancer returning 502 Bad Gateway error #211
Comments
I enabled access and connection load balancer logs on one of the Signed APIs and hope to get more clarity. The Signed API service seems to be under-used (max ~15% CPU and MEM according to the dashboard). |
I gathered some of the ELB logs and compared it with Grafana logs. Note, that I turned off the ELB logs after a while because they are extra costs. The ELB logs are nice and provide a lot of information. I saw that few requests were rejected by ELB. I was able to match other (successful) request with Grafana logs. I couldn't find why they are rejected though. I observed, that before each 502 rejection Airnode feed pushed large batch of Signed data (size 100). My hypothesis is that signature verification caused a very short term CPU spike which caused ELB to reject the other request. I was rejecting this before, because the AWS metrics show the service is quite underused, but I am not sure how exactly AWS measures this metric. Note, that the failure happens infrequently, but I want us to understand the reason behind every error we see. I think #213, #212 and #214 could both help with mitigating this issue. As an unrelated note, I noticed that some of the Grafana logs look quite strange:
This is caused by the log forwarder when it needs to forward long lines (it splits them into these partial logs). We should not see this with INFO level. I'm keeping the issue open, but adding |
#215 might fix this issue. We need to redeploy the Signed APIs and see if there are less errors for a few days. I didn't want to close the issue before I see the errors gone. |
I've checked Grafana logs after redeploing with 0.5.1. Before the mentioned fix, there were ~300 errors during a 12 hour period, but after the fix its ~90 which is significant reduction, but it's not a complete fix. I also noticed that 0.6.0 has more of these errors, but @bdrhn9 pointed out this may be because of the increased bandwith from logging the data. I think this is enough to warrant closing the issue. To get rid of 502 errors completely, we may get rid of ELB. |
One of the Airnode feed issues is 502 Gateway rejection by AWS ELB. See the query here.
The Signed APIs seem to be underutilised based of the CPU and memory metrics. This should be investigated and fixed. Note, that these rejections happen sporadicly and doesn't seem to impact the system that much.
The text was updated successfully, but these errors were encountered: