Fix race condition on ssh client connection startup #7549

alexandrejbr · 2023-08-11T05:18:50Z

There were 2 race conditions that can happen on the startup of a SSH client connection:

two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}};
one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited.

This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.

CLAassistant · 2023-08-11T05:18:55Z

All committers have signed the CLA.

github-actions · 2023-08-11T05:19:47Z

CT Test Results

      2 files     29 suites 30m 58s ⏱️
  453 tests   446 ✔️   5 💤 2 ❌
1 667 runs 1 638 ✔️ 27 💤 2 ❌

For more details on these failures, see this check.

Results for commit 7d8e6a9.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

u3s · 2023-09-29T14:13:23Z

please change the target branch to maint. I think, we would like to have a fix in OTP-26.2.

There were 2 race conditions that can happen on the startup of a SSH client connection: - two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}}; - one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited. This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.

alexandrejbr · 2023-10-03T09:38:17Z

@u3s I think I have done it now.

alexandrejbr · 2023-10-03T09:40:35Z

BTW since this is a bug that exists in other releases, for instance OTP 25, can it be backported there as well?

Is that something you have to do or can I help as well with that?

u3s · 2023-10-09T10:14:38Z

BTW since this is a bug that exists in other releases, for instance OTP 25, can it be backported there as well?

Yes it can be backported to OTP 25.

Is that something you have to do or can I help as well with that?

Not really. Way of working is that first it has to be merged to maint and later OTP team prepares a patch.

Sorry for delays on that one. I'm working on it but got disrupted last week.
I tried to reproduce some of those race issues but without success. Is there Erlang ssh server in your setup?

alexandrejbr · 2023-10-09T14:16:36Z

In our tests the SSH server is also implemented in Erlang.

This is indeed hard to reproduce unless you somehow influence the interleaving of the 2 processes. If you just want to see the error you can use the debugger. If you'd like to have an automatic test, then the only way that comes to my mind is to use Concuerror.

u3s · 2023-11-03T16:00:52Z

sorry for delay. I've created a diagram for ssh supervision tree
please note that on client side:

each connection has its own supervision tree starting from ssh_system_sup
ssh_system_sup is registered with id being address record (defined in ssh.hrl) - record combines: local address, local port and profile

There were 2 race conditions that can happen on the startup of a SSH client connection:

two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}};

I will refer to it as RACE1

one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited.

I will refer to it as RACE2

This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.

I will refer to it as RACE3

u3s

Can you explain and maybe verify TCP sockets used for connections?
I think RACE1 can only happen if ssh attempts are made for colliding sockets - this means local address, local port and profile (referred as Address in code since it holds address record) is used more than once.

I currently don't think re-using SysPid on client side is a good idea. I think that instead new connection attempt should be made by user code (hoping that better socket will be provided).

Some ideas for more checks:

check SO_REUSEPORT and related OS settings
monitor socket related stats on OS level and match them with occurrence of RACE1
trace ssh_system_sup:start_system/3 and check if same Address(including port) is used more than once?
running modified ssh_system_sup so that in case of already_started more information is gathered (like netstat -na | grep Port or equivalent)
maybe socket collisions could be monitored in your environment in some other way ?

lib/ssh/src/ssh_system_sup.erl

u3s · 2023-11-03T16:35:54Z

lib/ssh/src/ssh_system_sup.erl

+                {error, {already_started, SysPid}} ->
+                    %% There was other connection that created the supervisor while
+                    %% this process was trying to create it as well
+                    {ok, SysPid};


I think we might get into troubles here. Each connection on client side (identified by socket local address and local port) is expected to have its own ssh_system_sup process.

I'm afraid this might create unexpected behaviors, because more than 1 connection will be using same ssh_system_sup. Was RACE2 observable without adjustment for RACE1?

I totally missed that address had the local port and not the destination port. Good catch. Given that I think I am back to the drawing board with the RACE2.

I haven't actually tried just to run the adjustments for RACE1 and RACE3 without RACE2. I can try that but it will take me quite some time to get results I think. One thing is for certain, we were getting already_started errors from the ssh subsystem and we no longer get those after these changes. And even though I don't have the logs any longer I am quite certain that the already_started error came from start_system.

Shall I remove the changes regarding RACE2 from this PR, so we can proceed with the fixes for RACE1 and RACE3?

u3s · 2023-11-06T14:46:06Z

Can you also explain:

is the given address meaning - there is single ssh deamon listening on single port and handling multiple connection attempts?
are sockets provided to ssh as an argument or ssh receives Host+port and establishes TCP connection internally?
is it Linux environment?
is loopback interface or 127.0.0.1 used as ssh connection target ?

alexandrejbr · 2023-12-05T14:52:13Z

Can you also explain:

is the given address meaning - there is single ssh deamon listening on single port and handling multiple connection attempts?

are sockets provided to ssh as an argument or ssh receives Host+port and establishes TCP connection internally?

is it Linux environment?

is loopback interface or 127.0.0.1 used as ssh connection target ?

I am a bit confused with the first part of your question, these errors happen when the client tries to connect to some ssh servers. Are you asking me about the ssh daemon of the servers?
Host+port and establishes the TCP connection internally;
yes, I am not sure I ever reproduced this one on my Mac, but the noproc one I did.
yes.

I am sorry for the delay replying, my email client was not fetching emails from the email address I use for GitHub so I missed the notifications.

alexandrejbr · 2023-12-05T14:53:23Z

lib/ssh/src/ssh_system_sup.erl

@@ -129,13 +131,30 @@ start_subsystem(Role, Address=#address{}, Socket, Options0) ->
                            supervisor:terminate_child(SysPid, Id),
                            {error, connection_start_timeout}
                    end;
+                {error,noproc} ->


@u3s do you also agree that this change is needed to fix RACE1?

u3s · 2024-01-03T15:45:32Z

sorry for delay. I hope to come back to this soon.

u3s · 2024-02-09T13:36:08Z

I've created PR-8107 which won't start acceptor for ssh when in client role.

u3s · 2024-02-21T10:01:22Z

It was decided to postpone work related to ssh supervision tree.
I would like to revise how supervision tree is implemented and then decide how it should be fixed.

alexandrejbr · 2024-02-22T08:35:57Z

Let's close this PR. If I manage to gather more information about the remaining issues I will come back to you.

u3s · 2024-02-22T09:58:42Z

OK. In general I agree there are issues with supervision tree appearing in some border cases.
But I'm not sure if the solutions proposed at this stage are the optimal ones ... and need more time for checking that.

u3s · 2024-09-10T12:12:21Z

replaced with #8766

alexandrejbr mentioned this pull request Aug 11, 2023

Race conditions on startup of an ssh connection with the client role #7550

Closed

rickard-green added the team:PS Assigned to OTP team PS label Aug 14, 2023

IngelaAndin added the testing currently being tested, tag is used by OTP internal CI label Aug 16, 2023

u3s self-assigned this Aug 16, 2023

u3s linked an issue Aug 22, 2023 that may be closed by this pull request

Race conditions on startup of an ssh connection with the client role #7550

Closed

IngelaAndin added the Planned Focus issue added in sprint planning label Aug 25, 2023

u3s added this to the OTP-26.2 milestone Sep 29, 2023

u3s removed the testing currently being tested, tag is used by OTP internal CI label Sep 29, 2023

alexandrejbr changed the base branch from master to maint October 3, 2023 09:14

alexandrejbr force-pushed the fix-ssh-connect-races branch from b5969be to 7d8e6a9 Compare October 3, 2023 09:37

u3s added the testing currently being tested, tag is used by OTP internal CI label Oct 9, 2023

u3s requested changes Nov 3, 2023

View reviewed changes

u3s removed the testing currently being tested, tag is used by OTP internal CI label Nov 3, 2023

alexandrejbr commented Dec 5, 2023

View reviewed changes

IngelaAndin added the stalled waiting for input by the Erlang/OTP team label Dec 12, 2023

garazdawi removed this from the OTP-26.2 milestone Dec 15, 2023

u3s removed the stalled waiting for input by the Erlang/OTP team label Feb 6, 2024

u3s added the priority:low label Feb 21, 2024

alexandrejbr closed this Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition on ssh client connection startup #7549

Fix race condition on ssh client connection startup #7549

alexandrejbr commented Aug 11, 2023

CLAassistant commented Aug 11, 2023 •

edited

Loading

github-actions bot commented Aug 11, 2023 •

edited

Loading

u3s commented Sep 29, 2023

alexandrejbr commented Oct 3, 2023

alexandrejbr commented Oct 3, 2023

u3s commented Oct 9, 2023 •

edited

Loading

alexandrejbr commented Oct 9, 2023

u3s commented Nov 3, 2023

u3s left a comment

u3s Nov 3, 2023

alexandrejbr Dec 5, 2023

u3s commented Nov 6, 2023 •

edited

Loading

alexandrejbr commented Dec 5, 2023

alexandrejbr Dec 5, 2023

u3s commented Jan 3, 2024

u3s commented Feb 9, 2024

u3s commented Feb 21, 2024

alexandrejbr commented Feb 22, 2024

u3s commented Feb 22, 2024 •

edited

Loading

u3s commented Sep 10, 2024

Fix race condition on ssh client connection startup #7549

Fix race condition on ssh client connection startup #7549

Conversation

alexandrejbr commented Aug 11, 2023

CLAassistant commented Aug 11, 2023 • edited Loading

github-actions bot commented Aug 11, 2023 • edited Loading

CT Test Results

Artifacts

u3s commented Sep 29, 2023

alexandrejbr commented Oct 3, 2023

alexandrejbr commented Oct 3, 2023

u3s commented Oct 9, 2023 • edited Loading

alexandrejbr commented Oct 9, 2023

u3s commented Nov 3, 2023

u3s left a comment

Choose a reason for hiding this comment

u3s Nov 3, 2023

Choose a reason for hiding this comment

alexandrejbr Dec 5, 2023

Choose a reason for hiding this comment

u3s commented Nov 6, 2023 • edited Loading

alexandrejbr commented Dec 5, 2023

alexandrejbr Dec 5, 2023

Choose a reason for hiding this comment

u3s commented Jan 3, 2024

u3s commented Feb 9, 2024

u3s commented Feb 21, 2024

alexandrejbr commented Feb 22, 2024

u3s commented Feb 22, 2024 • edited Loading

u3s commented Sep 10, 2024

CLAassistant commented Aug 11, 2023 •

edited

Loading

github-actions bot commented Aug 11, 2023 •

edited

Loading

u3s commented Oct 9, 2023 •

edited

Loading

u3s commented Nov 6, 2023 •

edited

Loading

u3s commented Feb 22, 2024 •

edited

Loading