-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI sometimes fails due to timeout on query to service #28
Comments
Additional logging and further investigation seem to suggest that this problem arises when the background daemon of mock_uss's atproxy_client functionality stops working. When this happens, a query to atproxy hangs for the timeout limit (currently 1 minute) and then atproxy responds with a 500 indicating that the handler client didn't handle the request. Looking at recent logs, I see these logs from mock_uss with atproxy_client when a getCapabilities request ends up timing out:
The last log line is printed from here, and a successful fulfillment would have then later printed one of the lines in this if block, indicating that this query is responsible for the daemon failure. And indeed, the container is later populated with a stacktrace indicating a connection timeout to port 8075 of host.docker.internal. Port 8075 is the canonical local port for atproxy, confirming that the mock_uss atproxy_client daemon is timing out trying to respond back to atproxy with a result. |
This failure appears to not be due to atproxy or mock_uss's atproxy_client functionality. Instead, a call to the scdsc mock_uss (port 8074) to inject a flight intent timed out. uss_qualifier indicates that the request to inject flight 8b4a497a-825b-47dc-a98a-581320dcd562 was initiated at 16:32:01.04, and the failed (timed-out) result reported at 16:33:01.10. The only other mention of 8b4a497a-825b-47dc-a98a-581320dcd562 in uss_qualifier's logs is port-8074 mock_uss's response to a later request to clear the area where it indicates flight 8b4a497a-825b-47dc-a98a-581320dcd562 was deleted. The logs of port-8074 mock_uss with scdsc functionality pause at 16:31:58.12 ( |
This is apparently a very difficult problem to diagnose; see my comment here for some additional diagnostic work I've done. After having this problem occur so often on my local machine that I was essentially unable to complete the |
In addition to @BenjaminPelletier investigation, we have been running periodically the CI job to gather data for the past week here. Please note that when the job fails a tcpdump file which can be open with wireshark is available in the job's artifacts. |
After disabling atproxy in #126, I observed this failure today:
|
For the record, we had the following happen very consistently today:
|
After a huge number of overall failures of the F3411-22a CI task on #170, I took a closer look at the logs and think this may be valuable:
These are the logs of mock_uss_scdsc_b, located (in this PR) at scdsc.uss2.localutm. Here's my narration:
Thoughts:
So, it seems like there is some low-hanging improvements:
|
Some additional investigations on my end. My gut feeling is that we are hitting some kind of network limit at a deeper level. Analyzing a tcpdump capture of a failed run:
Then by trying to monitor some network parameters with
|
|
This continues to be a serious issue causing perhaps 50% of CI runs to fail recently. I am considering trying podman rather than docker to see if that might make things better. Other debugging attempts welcome as well, for instance configuring no socket reuse. |
After digging several leads I've became convinced that the issue is caused (at least partially) by Gunicorn, an issue very probably triggered by the high-volume traffic. Now that there is no NAT involved anymore, the network dumps are much easier to analyze. That was not the case in my previous debugging attempt as since then @BenjaminPelletier reworked the deployed docker containers to talk directly to each other on the docker virtual network rather than going through the host network stack + NAT. One thing that I've been able to pinpoint by analyzing the dumps, is that all timeouts are due to Gunicorn basically freezing when handling a request. Sometimes we see other timeouts or errors, but whenever that happened, the root could be traced back to such a Gunicorn freeze. The dumps show that the request by the client is sent properly to the container running Gunicorn, the container does send an ACK to the client that it received properly the request (so this is NOT a lost TCP packet), but then absolutely nothing happens inside the container / on the Gunicorn side, no log, no request processing, no nothing.
Additional relevant links:
With all this, all that might be needed is changing Gunicorn configuration. I have an open draft PR (Orbitalize#8) where I've been trying out a number of things. All those different things are stacked on top of each other, so the Gunicorn config change might not be the only change needed. With the changes made there, so far all CI executions are successful without any timeout, currently on attempt no. 8. I will run it again a few more time in an attempt to be more certain that this fixes the issue. Fingers crossed... The config change in question is about using a different worker class ( Other leads followedNetwork issues in GitHub Actions VMsA lead I followed is issues with the networking in the VMs used by GitHub Actions. I had found several reported issues that could look similar to our issue, but in the end I do not believe this was it:
Docker virtual networks issuesAnother credible lead were those system logs in the VM that appeared many times over the execution of the tests:
Those were correlated with the numerous creations/deletions of docker containers. Reports found online hinted at issue with interactions between the VM network stack and the Docker virtual networks. Some links relevant to that:
|
This is awesome and we should definitely implement any changes that have been shown to reduce the problem (or any problem, really). It seems like perhaps this problem may have multiple different root causes manifesting with the same observation (lost requests). When I was able to reproduce very reliably at one point (though I am no longer able to reproduce like this), I was able to use an unconfigured nginx container to produce the symptoms, so I don't think that instance was due to gunicorn (rather, something Docker). |
That is interesting! That is a lead I had pursue initially and as such in the test PR in question I did try out several things, which were still active when I concluded the gunicorn issue:
In addition there was as well still active the change with the container running tests created only once. So indeed could be different causes? I will start with the gunicorn change and if issues still appear, will proceed with the different other changes. |
With zero observed instances of this issue since @mickmis 's fix, I'm going to declare this issue resolved. Huge thanks for the fix! |
* Improve CI documentation * Remove reference to #28 * Fix link
Example
Sometimes, the CI checks fail due to a timeout when uss_qualifier queries a service at port 8074; the error message is:
When the CI is immediately rerun, this error usually disappears.
The service at port 8074 in the standard local deployment (used by CI) is mock_uss providing scdsc capabilities. Unfortunately, the details of this query attempt are not currently captured.
The text was updated successfully, but these errors were encountered: