-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition on ssh client connection startup #7549
Conversation
CT Test Results 2 files 29 suites 30m 58s ⏱️ For more details on these failures, see this check. Results for commit 7d8e6a9. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
please change the target branch to maint. I think, we would like to have a fix in OTP-26.2. |
There were 2 race conditions that can happen on the startup of a SSH client connection: - two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}}; - one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited. This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.
b5969be
to
7d8e6a9
Compare
@u3s I think I have done it now. |
BTW since this is a bug that exists in other releases, for instance OTP 25, can it be backported there as well? Is that something you have to do or can I help as well with that? |
Yes it can be backported to OTP 25.
Not really. Way of working is that first it has to be merged to Sorry for delays on that one. I'm working on it but got disrupted last week. |
In our tests the SSH server is also implemented in Erlang. This is indeed hard to reproduce unless you somehow influence the interleaving of the 2 processes. If you just want to see the error you can use the debugger. If you'd like to have an automatic test, then the only way that comes to my mind is to use Concuerror. |
sorry for delay. I've created a diagram for ssh supervision tree
I will refer to it as RACE1
I will refer to it as RACE2
I will refer to it as RACE3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain and maybe verify TCP sockets used for connections?
I think RACE1 can only happen if ssh attempts are made for colliding sockets - this means local address, local port and profile (referred as Address in code since it holds address record) is used more than once.
I currently don't think re-using SysPid on client side is a good idea. I think that instead new connection attempt should be made by user code (hoping that better socket will be provided).
Some ideas for more checks:
- check SO_REUSEPORT and related OS settings
- monitor socket related stats on OS level and match them with occurrence of RACE1
- trace ssh_system_sup:start_system/3 and check if same Address(including port) is used more than once?
- running modified ssh_system_sup so that in case of
already_started
more information is gathered (likenetstat -na | grep Port
or equivalent) - maybe socket collisions could be monitored in your environment in some other way ?
{error, {already_started, SysPid}} -> | ||
%% There was other connection that created the supervisor while | ||
%% this process was trying to create it as well | ||
{ok, SysPid}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might get into troubles here. Each connection on client side (identified by socket local address and local port) is expected to have its own ssh_system_sup process.
I'm afraid this might create unexpected behaviors, because more than 1 connection will be using same ssh_system_sup. Was RACE2 observable without adjustment for RACE1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed that address had the local port and not the destination port. Good catch. Given that I think I am back to the drawing board with the RACE2.
I haven't actually tried just to run the adjustments for RACE1 and RACE3 without RACE2. I can try that but it will take me quite some time to get results I think. One thing is for certain, we were getting already_started
errors from the ssh subsystem and we no longer get those after these changes. And even though I don't have the logs any longer I am quite certain that the already_started error came from start_system.
Shall I remove the changes regarding RACE2 from this PR, so we can proceed with the fixes for RACE1 and RACE3?
Can you also explain:
|
I am sorry for the delay replying, my email client was not fetching emails from the email address I use for GitHub so I missed the notifications. |
@@ -129,13 +131,30 @@ start_subsystem(Role, Address=#address{}, Socket, Options0) -> | |||
supervisor:terminate_child(SysPid, Id), | |||
{error, connection_start_timeout} | |||
end; | |||
{error,noproc} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@u3s do you also agree that this change is needed to fix RACE1?
sorry for delay. I hope to come back to this soon. |
I've created PR-8107 which won't start acceptor for ssh when in client role. |
It was decided to postpone work related to ssh supervision tree. |
Let's close this PR. If I manage to gather more information about the remaining issues I will come back to you. |
OK. In general I agree there are issues with supervision tree appearing in some border cases. |
replaced with #8766 |
There were 2 race conditions that can happen on the startup of a SSH client connection:
This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.