-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection closed when request is attempted #272
Comments
The idle times you're seeing there are based on What you're seeing here makes sense based on the logic of the http1 pool. A connection is checked out, finch determines that its passed the idle time that has been specified, executes the telemetry event, and closes the connection. I suspect why you were seeing more of these errors in the past is because connections were being closed by the server and finch believed it still had an open connection when it fact it didn't. The max idle time helps avoid this because it causes finch to close connections more often. But somewhere in the chain of events, information is being lost or not handled properly. This same thing has plagued people in the past and as far as I know no one has been able to reliably reproduce it or get a packet capture or similar to see exactly what's going on. |
Thank you for your response and for explaining monotonic time. Well, if I'm understanding things correctly, 115 seconds is more than the However, the documentation states:
Another question, can Finch actually detect that the connection is closed without trying an http request, moving to receive and returning an error? Can Finch just checkout another connection in that case or does it monitor connections? Edit: Also, it's still not clear to me why is the pool closed in this case. It obviously wasn't idle since the connection was checked out more than a second before the pool was closed. |
In theory, yes. When the connection is checked into the pool, the pool monitors incoming messages and tcp events and updates the sockets accordingly. What seems to be happening is something like this:
Its never been entirely clear how this can happen. There's a long issue where lots of folks tried to work out the root cause. But it was never really found: #62. |
Just to add a bit more: @salex89 if there's any way you can get a packet capture of this happening, that would go a long way towards determining the root of the issue. |
I talked to our ops team and we should be able to do that. At least on our staging environment which should be enough since we happen to see these events there. Do you have any suggestion on which monitor/capture to use? Maybe |
@salex89 awesome! either |
I had a similar issue. I'm not sure if this is exactly the same thing, but here is what happened:
What happened was :
Looks like it's possible to detect this behavior since AWS says that a RST packet is sent before closing the connection. |
Our setup is also behind an AWS NAT, but it looks like the |
This might be a re-hash of these and other similar issues:
#62
elixir-tesla/tesla#467
However, I indeed think something strange is happening, along with the other folks who commented on other issues, some very recent and mimic my experience. The issue is that certain requests fail with a
{:error, "socket closed"}
error, without actually being executed at all and not due to any real connectivity issue. As a sidenote, the{:error, "socket closed"]
result just the{:error, %Mint.TransportError{reason: :closed}
returned by Finch wrapped by the Tesla adapter. But that's beside the point, that just says that the adapter is an unlikely culprit.These issues were much more common before I set the
pool_max_idle_time
to60_000
down from the default:infinity
, like 10 times more common. Yet the error still happens. I setup telemetry to try and gather more data on the issue. After capturing few such cases, this is the pattern I usually see (telemetry events forwarded to logs):Please notice two things, the
idle_time
is always a very large number when this happens, this is the equivalent of 3.7 years. Successful requests have an expected idle_time, less than 60 seconds. I also noticed (I believe always) that there is aFinch telemetry: pool idle timeout
event when this error happens like immediately, a millisecond, before the connection closes. I didn't have telemetry from back when thepool_max_idle_time
was at it's default (:infinity
) but the the number of errors such was greater. It's a bit hard for me to revert it back to:infinity
and test now easily because this code is deployed to an environment where it's being used and would increase error rate downstream potentially. And we need some traffic to make it happen, but if it would help debugging, I could try to make it work.If the pool shuts down/restarts because of the idle time, I don't think it should have done that. There was more than a second delay after the request and the pool being closed on itself, it should have registered that it's still in use and stay alive with connections open. On the flip side, I don't have an explanation for the idle time, though, as it was a large number to begin with.
We see this a few times each day towards different hosts, maybe 2-6 events daily, but on a mildly utilised system. We expect more traffic soon. Also, we used Hackney before migrating to Finch and didn't see similar issues. Looking back at it, I don't think Hackney did any retries by itself. We migrated to Finch because of the more advanced connection pool, but we had to build a lot of defensive logic around this issue which we didn't need. Which is kind of fine, connections can be broken or unstable and it should be taken care , but this still bothers me because these are not real connectivity issues in my opinion :-) . I would rather improve on this and keep Finch than migrate back to Hackney.
Please advise which more information can I send. I'll continue looking on my side, but I'm not yet familiar with Finch's source code yet.
Edit: I've just seen this issue: #269 and it really resembles my case, and we're indeed using OTP 26.1.2, but the OP had issues with any connection to the affected host. Mine are haphazard.
The text was updated successfully, but these errors were encountered: