Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enormous CPU usage when overflowing UDP socket with backend inet #7573

Open
mickel8 opened this issue Aug 17, 2023 · 8 comments
Open

Enormous CPU usage when overflowing UDP socket with backend inet #7573

mickel8 opened this issue Aug 17, 2023 · 8 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@mickel8
Copy link

mickel8 commented Aug 17, 2023

Describe the bug

When opening an UDP socket with inet backend, we observe enormous CPU usage (~90% of every thread on 32 cores 64 threads machine) when process, which open the socket doesn't read data fast enough.

We are implementing TURN server where many clients can send data to the same address. As long as we have 650 diffrent clients (each sending 50kbps of traffic in 100 byte datagrams) everything works correctly and avg. CPU usage stays around 10%. When we try to increase the number of clients to 750, after about 30 seconds, the CPU usage increases to 90% on every thread.

This doesn't happen when using socket backend. In case of socket backend, packets are silently dropped (as they are not read fast enough) and the CPU usage stays around 25%, which is higher than in case of inet backend but stable.

To Reproduce
This might be pretty hard. We can provide you with our server, which is an open source project and benchmarking script if you wish to.

Expected behavior
If a process is not reading data fast enough, data should be dropped without any impact on the CPU usage.

Affected versions
To our knowledge at least 24 and 26. We didn't test on 25.

Additional context
The architecture of the TURN server is as follows:

We have one listener process, which has one UDP socket, and N allocation processes. Every allocation process has its own allocation socket and a "reference" to the listener socket.

Clients (in the number of 750) sends data to the listener socket. Listener process reads data from the socket with gen_udp.recv and routes this data to the appropriate allocation process. Allocation process forwards data to its peer using allocation socket. Peer echos back data to the allocation socket, which is read by the allocation process and sent to the corresponding client using reference to the listener socket.

My assumption is that after some time of not reading data fast enough from the listener socket, something breaks and impacts the sending process i.e. none of the allocation can successfuly send data on the listener socket. Because we have 750 processes that can write to the same socket, we observe high CPU usage on every thread.

The output from perf top -g --sort=dso at some arbitray moment after the crash (i.e. CPU usage increases) happens

-   93.31%    72.12%  beam.smp                                                                       ▒
   - 9.98% 0x7feb99dfb965                                                                            ▒
      - 56.25% erts_schedule                                                                         ▒
           52.60% scheduler_wait                                                                     ▒
   - 1.88% erts_schedule                                                                             ▒
      + 2.04% sched_yield                                                                            ▒
   + 1.07% 0x18b                                                                                     ▒
   - 1.07% beam_jit_call_bif                                                                         ◆
      - 1.07% erts_internal_port_command_3                                                           ▒
         - 1.39% erts_suspend                                                                        ▒
            + 0.68% __lll_lock_wait                                                                  ▒
+   92.25%     0.44%  memfd:vmem (deleted)                                                           ▒
+   23.27%    23.27%  [kernel]                                                                       ▒
+   16.68%     1.34%  libpthread-2.31.so                                                             ▒
+    7.43%     0.77%  libc-2.31.so                                                                   ▒
+    2.06%     0.00%  [unknown]                                                                      ▒
+    2.03%     1.91%  perf                                                                           ▒
     0.06%     0.03%  node_exporter                                                                  ▒
     0.04%     0.04%  [vdso]                                            
-   21.34%     1.62%  libpthread-2.31.so                                                             ▒
   - 5.11% __lll_lock_wait                                                                           ▒
      - 4.89% entry_SYSCALL_64_after_hwframe                                                         ◆
         - 4.87% do_syscall_64                                                                       ▒
            - 4.85% __x64_sys_futex                                                                  ▒
               - 4.86% do_futex                                                                      ▒
                  - 5.02% futex_wait                                                                 ▒
                     - 4.29% futex_wait_setup                                                        ▒
                        - 6.80% _raw_spin_lock                                                       ▒
                             6.80% native_queued_spin_lock_slowpath                                  ▒
                     - 0.62% futex_wait_queue_me                                                     ▒
                        - 0.66% schedule                                                             ▒
                             0.70% __schedule           
@mickel8 mickel8 added the bug Issue is reported as a bug label Aug 17, 2023
@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Aug 18, 2023
@RaimoNiskanen
Copy link
Contributor

RaimoNiskanen commented Aug 22, 2023

Let's see if I got the situation right...

You have one UDP socket you call the "listener socket" that is the central socket to where clients (750 of them) sends requests as UDP datagrams.

There is one process, the "listener process", in a tight loop over gen_udp:recv/2,3 (arguments, please?) that dispatches received datagrams from the "listener socket" to a set of "allocation processes" (how large a set?) via normal Erlang inter-process messaging.

Each "allocation process" does an UDP request + gets a reply towards a "peer" over an UDP socket that is handled only by the "allocation process".

Each "allocation process" then sends a reply with gen_udp:send/2,3 (arguments, please?) over the "listener socket".

So we have a large set of "allocation processes" that hammer the "listener socket" with gen_udp:send/2,3 calls, right?

If so; i this a duplicate of GH-6455? Then it should be fixed in OTP-26.0?

@mickel8
Copy link
Author

mickel8 commented Aug 22, 2023

Let's see if I got the situation right...

@RaimoNiskanen exactly!

We open socket like this:

     {:ok, socket} =
       :gen_udp.open(
         port,
         [
           {:ifaddr, ip},
           {:active, false},
           {:recbuf, 1024 * 1024},
           :binary
         ]
      )

receive data in this way:

:gen_udp.recv(socket, 0)

and send data with the folowing arguments:

:ok = :gen_udp.send(state.turn_socket, c_ip, c_port, channel_data)

The issue you linked sounds promising but we are already on OTP 26, Elixir 1.15

Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:64:64] [ds:64:64:10] [async-threads:1] [jit:ns]

Eshell V14.0.2 (press Ctrl+G to abort, type help(). for help)
➜  rel git:(master) ✗ asdf list erlang
  24.0
  24.0.6
  24.1.5
  24.1.6
  24.1.7
  25.0.3
  25.2.3
 *26.0.2
➜  rel git:(master) ✗ asdf list elixir
  1.12.0
  1.13.0
  1.13.4
  1.14
  1.14.3-otp-25
  1.15
 *1.15.0-otp-26
  master

Here is the code if you wish to take a look:

I linked specific commit as we moved to the :socket module.

Also, we don't notice such a behaviour when using socket backend or socket module.

I am willing to dive into this more, in particular I could check what functions are called under the hood but I couldn't find anything with perf or observer_cli. Is there something other I could give a try?

@RaimoNiskanen
Copy link
Contributor

Have a look at https://www.erlang.org/doc/apps/erts/beamasm#linux-perf-support in particular:

# Record a running instance started with `+JPperf true` for 10s
perf record --pid $BEAM_PID -- sleep 10

Also note:

perf record --call-graph=lbr may work better in those cases, but it's worse at tracking in general.

@RaimoNiskanen
Copy link
Contributor

Thinking again, the optimization in OTP 26.0 would not have any effect here. That was for a process that has a large message queue when calling gen_udp:send/*, and your "allocator processes" I guess would not have large message queues.

One investigation step that can be done, mostly to rule out simple problems is to produce an Erlang Crash Dump e.g by calling erlang:halt("Suitable Slogan"). Then investigate that Crash Dump to see if there is any process with large message queue, unreasonably large heap or whatnot.

But if there are no such simple problems, the most likely culprit here may be lock contention the (driver) port lock, on the listen socket (driver) port. I have heard a war story about a customer that got around a similar problem by cloning the port (open another socket port for the same file descriptor) and then used one port for input and the other for output, to avoid lock contention between the reader and the writers.

Lock contention can be investigated by compiling an instrumented VM. See lcnt - The Lock Profiler. Unfortunately it is not as easy as adding a start option to the normal VM...

Since NIFs leaves it to the user to implement locking the socket NIF from the start implemented slightly more fine granular locking than what the port driver framework does, so the socket NIFs have one lock used by the send operations and one lock used by the recv operations per socket. This may be why your code performs adequately over socket sockets...

@mickel8
Copy link
Author

mickel8 commented Aug 23, 2023

@RaimoNiskanen thanks, I will look into that.

Keep in mind that the problem starts to occur after 5-20seconds so at the beginning everything works correctly.

I have heard a war story about a customer that got around a similar problem by cloning the port (open another socket port for the same file descriptor) and then used one port for input and the other for output, to avoid lock contention between the reader and the writers.

By port you mean something different than gen_udp/tcp or scoket modules? I am curious how you can open another socket for the same file descriptor. Probably using fd option for gen_udp/tcp or reuseport but reuseport would load balance incoming traffic across all sockets, at least in case when they are in the active mode.

@RaimoNiskanen
Copy link
Contributor

By "(driver) port" here I mean an Erlang port(), in this case inet backend of gen_udp, i.e inet_drv.c.

And I guess they used the fd option to gen_udp:open/* and some (maybe undocumented) way to get the file descriptor from the original socket (port()). Then two port() instances can access the same file descriptor, and if done carefully it works. The OS protocol stack has got its own locking on socket / file descriptor accesses from different threads.

@mickel8
Copy link
Author

mickel8 commented Aug 26, 2023

I finally found a free time to dive into this problem a little more. Starting with perf, I ran EVM with +JPperf true and recorded ONE OF (the one with highest % of CPU usage as presented by htop) EVM pids with perf record --call-graph=fp --pid 321991 -- sleep 10. I called perf record after CPU usage increased from 15% to 90%, so after a possible bug happens:

Samples: 2M of event 'cycles', Event count (approx.): 1812813388981
  Children      Self  Command        Shared Object      Symbol
-    0.99%     0.00%  erts_sched_51  [JIT] tid 321991   [.] $global::process_main                    ◆
   - 0.99% $global::process_main                                                                     ▒
        0.99% erts_schedule                                                                          ▒
-    0.99%     0.02%  erts_sched_51  beam.smp           [.] erts_schedule                            ▒
     0.96% erts_schedule                                                                             ▒
-    0.95%     0.00%  erts_sched_50  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
   - 0.94% entry_SYSCALL_64_after_hwframe                                                            ▒
      - 0.94% do_syscall_64                                                                          ▒
         - 0.86% __x64_sys_futex                                                                     ▒
              0.86% do_futex                                                                         ▒
-    0.94%     0.00%  erts_sched_58  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
   - 0.94% entry_SYSCALL_64_after_hwframe                                                            ▒
      - 0.93% do_syscall_64                                                                          ▒
         - 0.86% __x64_sys_futex                                                                     ▒
              0.86% do_futex                                                                         ▒
-    0.94%     0.00%  erts_sched_50  [kernel.kallsyms]  [k] do_syscall_64                            ▒
   - 0.94% do_syscall_64                                                                             ▒
      - 0.86% __x64_sys_futex                                                                        ▒
           0.86% do_futex                                                                            ▒
-    0.94%     0.00%  erts_sched_49  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
   - 0.93% entry_SYSCALL_64_after_hwframe                                                            ▒
      - 0.93% do_syscall_64                                                                          ▒
         - 0.86% __x64_sys_futex                                                                     ▒
              0.86% do_futex                                                                         ▒
+    0.93%     0.00%  erts_sched_32  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_3   [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_62  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_58  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.93%     0.00%  erts_sched_41  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_45  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_49  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.93%     0.00%  erts_sched_20  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_9   [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.01%  erts_sched_1   [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_7   [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.93%     0.00%  erts_sched_3   [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.93%     0.00%  erts_sched_32  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.93%     0.00%  erts_sched_62  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.93%     0.00%  erts_sched_41  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.92%     0.00%  erts_sched_26  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_17  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_55  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_45  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.92%     0.00%  erts_sched_39  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_25  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_2   [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_46  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_11  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_20  [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.92%     0.00%  erts_sched_30  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_22  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe           ▒
+    0.92%     0.00%  erts_sched_9   [kernel.kallsyms]  [k] do_syscall_64                            ▒
+    0.92%     0.00%  erts_sched_7   [kernel.kallsyms]  [k] do_syscall_64                            

It looks like there are a lot of calls to do_futex, which suggests, as you said, lock contention problem. Let me know if that's enough or you could benefit from recompiling EVM with lcnt option.

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Aug 30, 2023
@RaimoNiskanen
Copy link
Contributor

Well, yes, the VM guys would like some lcnt data to be sure it is lock contention.

@IngelaAndin IngelaAndin removed the team:PS Assigned to OTP team PS label Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

4 participants