Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with OTP ktls implementation and inet_driver #8561

Closed
miriampena opened this issue Jun 11, 2024 · 13 comments · Fixed by #8690
Closed

Race condition with OTP ktls implementation and inet_driver #8561

miriampena opened this issue Jun 11, 2024 · 13 comments · Fixed by #8690
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS

Comments

@miriampena
Copy link

miriampena commented Jun 11, 2024

Describe the bug
When we test OTP ktls by performing +100 consecutive node disconnections and net_adm:ping(Node) we run into a situation in which we receive ssl raw packages in the acceptor after we do the OTP ktls handover.

What happens internally is that the inet_driver buffer receives messages from the client before we set the last ktls options on the socket, so it is unable to do the decoding of the data that is already on the buffer.

To Reproduce
Open two terminals, terminal one as server, terminal 2 as client:

<copy on terminal 1>

{ok,S}=gen_tcp:accept(element(2,gen_tcp:listen(33333,[inet6,binary,{reuseaddr,true}]))).
inet:setopts(S, [{packet, ssl_tls}, {active, 1}]).

<copy terminal 2>

{ok,S}=gen_tcp:connect({0,0,0,0,0,0,0,0},33333,[inet6,binary]).
timer:sleep(100).
inet:setopts(S, [{raw, 6, 31, <<"tls">>}]).
inet:setopts(S, [{raw, 282, 1, <<4,3,52,0,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,0,0,0,0,0,0,0,0>>}]).
inet:setopts(S, [{raw, 282, 2, <<4,3,52,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,0,0,0,0,0,0,0,0>>}]).
inet:setopts(S, [list, {packet, 4}]).
gen_tcp:send(S, "asd"), gen_tcp:send(S, "asd").

<copy on terminal 1>

inet:setopts(S, [{raw, 6, 31, <<"tls">>}]).
inet:setopts(S, [{raw, 282, 1, <<4,3,52,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,0,0,0,0,0,0,0,0>>}]).
inet:setopts(S, [{raw, 282, 2, <<4,3,52,0,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,0,0,0,0,0,0,0,1>>}]).
inet:setopts(S, [list, {packet, 4}]).
gen_tcp:recv(S, 0, 1000).
inet:setopts(S, [list, {packet, 0}]).
gen_tcp:recv(S, 0, 1000).

The first gen_tcp:recv will timeout, the second gen_tcp:recv will receive a ssl packet instead of the tcp packet. Example:

gen_tcp:recv(S, 0, 1000).
{ok,[23,3,3,0,24,160,106,135,181,246,170,232,32,70,171,184,
4,152,246,50,14,220,23,190,184,232,75|...]}

Expected behavior
We receive a tcp package.

gen_tcp:recv(S, 0, 1000).
{ok,"asd"}

Affected versions
Erlang/OTP 26 [erts-14.2.1]

Additional context

@miriampena miriampena added the bug Issue is reported as a bug label Jun 11, 2024
@IngelaAndin IngelaAndin added the not a bug Issue is determined as not a bug by OTP label Jun 11, 2024
@IngelaAndin IngelaAndin self-assigned this Jun 11, 2024
@IngelaAndin
Copy link
Contributor

IngelaAndin commented Jun 11, 2024

@miriampena This is a race condition that can not be solved in any other way than doing:
gen_tcp:accept(element(2,gen_tcp:listen(33333,[inet6,binary,{reuseaddr,true}, {active, false}]))).

{active, true} is default and the options are inherited from listen to accept. So if a message arrives before ssl process has become the controlling process it will fail. This is documented in the User Guide for upgrading TCP sockets to ssl (TLS) ones.

@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Jun 11, 2024
@IngelaAndin
Copy link
Contributor

IngelaAndin commented Jun 12, 2024

@miriampena I heard that your use case is actually the erlang distribution, and then this could be a bug and only the reproduction here is not valid. It is possibly ktls or inet-driver bug. We have a problem in some specific test, will come back to you on that.

@IngelaAndin IngelaAndin removed the not a bug Issue is determined as not a bug by OTP label Jun 12, 2024
@miriampena
Copy link
Author

miriampena commented Jun 12, 2024

Hi Ingela, I was about to reach out to say that upon further inspection I was afraid that the reproduction I pasted was not triggering the right issue. It is indeed in happening in erlang distribution when we connect one node to a different node, then we call disconnect_node/1 and ping/1. If you do that ping/disconnect on a loop +50 times you get that race condition. It will show up as a timeout on a gen_tcp:recv/1 in the beginning of the handshake phase (recv_name_failed) and when we manually move the socket to {packet,0} we can actually see ssl data. Thanks for looking into it.

@IngelaAndin
Copy link
Contributor

We have a test
ssl_dist_bench_SUITE | smoketest:ssl:ktls:setup | parallel_setup

that fails on some machines in a similar way. But we have not been able to find the cause. But it seems they are new Ubuntu versions. Alas the successful test cases that we have seen seems to not been run in a while and ktls is not supported everywhere so it not easy to keep track of.

Looking at the code where I find recv_name_failed, it seems the timout is infinity! Could you share what is the value of Other? I guess it would be there that you should be getting timeout?

@miriampena
Copy link
Author

Normally the connection will just time-out in the recv_name phase as, as you say, the timeout is infinity.
In order for us to capture the bug instead of just timing out on the timer I need to modify the source code.

What I do, just for the purposes of identifying this is a bug happening, is to setup a custom f_recv in HSDATA so if a recv times-out after 2 seconds I move the Socket from {packet,4} to {packet,0} to capture what is actually in the Socket.

f_recv(Driver) ->
    fun(Socket, Length, Timeout) ->
        case Driver:recv(Socket, Length, 2000) of
            {error, timeout} ->
                inet:setopts(Socket, [{packet, 0}]),
                Driver:recv(Socket, Length, Timeout);
            Data ->
                Data
        end
    end.

This causes the Recv function in recv_name return something invalid rather than timing-out. The invalid format of the response triggers a receive_name_fail and then call the shutdown2/2 function.

Then we can capture in the shutdown2/2 function a ssl message

{recv_name_failed,{ok,[23,3,3,0,68,193, 133,254, and another few dozens of bytes ]}}

@IngelaAndin
Copy link
Contributor

That is not necessarily wrong, the bytes on the socket after the TLS handshake will always be "TLS packets". The ssl application will emulate the packet option that shall now apply to the decrypted data, so ssl:setopts might give you another result as long as the socket actually has received a whole "TLS packet".

@IngelaAndin
Copy link
Contributor

Oh yes sorry, forgot I guess it should be offloaded now to ktls that should handle the "TLS packet" and decrypt it.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Jun 13, 2024

You could try splitting option setting in inet_tls_dist.erl to see if there could be something problematic in the inet-driver option handling.

In code

inet_set_ktls(
  #{ socket := Socket, socket_options := SocketOptions } = KtlsInfo) ->
    %%
    maybe
        ok ?=
            set_ktls(
              KtlsInfo
              #{ setopt_fun => fun ?MODULE:inet_ktls_setopt/3,
                 getopt_fun => fun ?MODULE:inet_ktls_getopt/3 }),
        %%
        #socket_options{
           mode = _Mode,
           packet = Packet,
           packet_size = PacketSize,
           header = Header,
           active = Active
          } = SocketOptions,
        case
            inet:setopts(
              Socket,
              [list, {packet, Packet}, {packet_size, PacketSize},
               {header, Header}, {active, Active}])
        of
            ok ->
                ok;
            {error, SetoptError} ->
                {error, {ktls_setopt_failed, SetoptError}}
        end
    end.
 
 

Create a split

inet:setopts(
              Socket,
              [list, {packet, Packet}, {packet_size, PacketSize},
               {header, Header}]),
inets:setopts(Socket, [{active, Active}])
 

@zzydxm
Copy link
Contributor

zzydxm commented Jun 13, 2024

The bug is because even with {active, 1} inet_driver still buffers two ssl packets and translated them into two ssl_tls packets, sent one to the controlling process and wait next active to send another (so ktls option will not take effect because the data is already in inet_driver buffer instead of socket buffer)
To solve this problem, we need to change how inet_driver read socket buffer in ssl_tls mode, which we can discuss

@RaimoNiskanen
Copy link
Contributor

I created a PR that introduces a boolean flag read_ahead, per default true, that makes packet reading non-greedy. Please have a look...

@miriampena
Copy link
Author

Hi @RaimoNiskanen . I tested the PR with our code and it removes the race condition. Thanks for the fix!

RaimoNiskanen added a commit that referenced this issue Sep 2, 2024
…' into maint

* raimo/inet_drv-active-1-no-read-ahead/GH-8561/OTP-19175:
  Fix NULL pointer to memmove in tcp_deliver
  Use read_ahead in SSL for KTLS
  Fix NULL pointer to memmove in tcp_deliver
  Use read_ahead in SSL for KTLS
  Implement option for inet_drv read_ahead
  Fix possible bug for packet type change
  Avoid read-ahead for packet mode ssl_tls
garazdawi pushed a commit that referenced this issue Sep 5, 2024
…' into maint-26

* raimo/inet_drv-active-1-no-read-ahead/GH-8561/OTP-19175:
  Fix NULL pointer to memmove in tcp_deliver
  Use read_ahead in SSL for KTLS
  Fix NULL pointer to memmove in tcp_deliver
  Use read_ahead in SSL for KTLS
  Implement option for inet_drv read_ahead
  Fix possible bug for packet type change
  Avoid read-ahead for packet mode ssl_tls

# Conflicts:
#	erts/preloaded/ebin/prim_inet.beam
@RaimoNiskanen
Copy link
Contributor

Released in OTP-26.2.5.3.

There appears to be other race condigions, not as obvious as this one. If you manage to pinpoint it, please open a new Issue.

@RaimoNiskanen
Copy link
Contributor

The story continues in PR#8785

RaimoNiskanen added a commit that referenced this issue Sep 6, 2024
…maint

* raimo/net/implement-read_ahead/GH-8561/OTP-19199:
  Test read_ahead true|false performance
  Document option `read_ahead`
  Test read_ahead = false
  Don't try to recv peek the header
  Try to recv peek the header - failed (Linux bug?)
  Honour `{read_ahead,true}` and make it default again
  Implement read_ahead option for gen_tcp_socket
  Re-implement gen_tcp_socket to not read ahead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants