Binary leaking in h2_connection processes #120

jonasrichard · 2017-10-31T17:20:58Z

I use chatterbox as a dependency of the inaka/apns4erl project. After some hours of running of the application I faced with ever growing process sizes (process_info(pid, memory)). Did some investigation I found this:

process_info(pid(0,1238,0), binary).
{binary,[{139665827264464,60,1},
         {139665831210656,60,1},
         {139664600088128,60,1},
         {139665831140616,60,1},
         {139665831222376,60,1},
         {139664600244240,60,1},
         {139665827188416,60,1},
         {139664600522856,60,1},
         {139664605657712,60,1},
         {139664605954696,60,1},
         {139664605903872,60,1},
         {139664606052304,60,1},
         {139664600433752,60,1},
         {139664605649112,60,1},
         {139664531257216,60,1},
         {139665827110208,60,1},
         {139664605153864,60,1},
         {139665827411976,60,1},
         {139665827397160,60,1},
         {139665827115824,60,1},
         {139664605243344,60,1},
         {139664605133416,60,1},
         {139664605420536,60,1},
         {139664605405256,60,1},
         {139664637840368,60,...},
         {139664638044448,...},
         {...}|...]}
(myhost@my_ip)48> length(element(2, v(47))).
1851

Do you have a clue where this binary leak can occur?

The text was updated successfully, but these errors were encountered:

ferd · 2017-10-31T21:44:58Z

If you call for a manual garbage collection of the process, do these go away? If so the leak is probably fixable through well-placed hibernation calls more than anything else.

jonasrichard · 2017-11-01T16:10:41Z

I did a erlang:garbage_collect/0 but the binaries didn't go away. I don't know the process structure of chatterbox, but I think there are two processes pass the binaries, and after some matching one process hold refc binaries. It is hard to find when those binaries remain, because there were at least 50,000 http2 requests were sent (so much more than the number of refc binaries). I think it happens only in edge cases.
I didn't use chatterbox directly but apns4erl.
https://github.com/inaka/apns4erl
Tomorrow I will check if other processes has something similar. Do you think in a crashdump I can see those binaries? Or the GC during crashdump will eliminate them?

ferd · 2017-11-01T17:28:33Z

If you can see them in the crash dump, they were really being held in memory by the process. If they are gone, that's because the VM ran a GC before writing them to the dump.

jonasrichard · 2017-11-02T09:41:59Z

Ok, calling the garbage_collect(Pid) decreased the number of refc binaries from 1932 to 84. So probably we will call GC for only those processes when the refc binaries count is more than a threshold. Does it sound right (not too odd)?

ferd · 2017-11-02T10:57:06Z

That can work temporarily. I think the ideal fixes in these cases for now is to modify the risky processes to run hibernation tasks from time to time when they get downtime so they force this on their own.

jonasrichard · 2017-11-02T11:51:31Z

It is also interesting that process_info(memory) and heap size in normal process_info gives me different results. And like I said, the reductions were huge. So I decided to reconnect in every 10 minute instead of the GC.

> process_info(P1, memory).
{memory,201303168}
> process_info(P1).
...
 {total_heap_size,17474138},
 {heap_size,2072833},
 {stack_size,10},
 {reductions,123302750526},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,41}]},
...

tsloughter · 2018-06-25T23:07:59Z

Looking at this again, if it makes sense we could add a hibernate_after option to the gen_statem:start_link of h2_connection so it hibernates if idle.

Suggestions for what a good timeout would be?

sebastiw · 2020-08-06T08:45:34Z

Did anyone find the issue with the leaking memory? I have the same problem, the garbage_collect method mentioned above reclaimed about ~1/3 of the memory.

> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,73238512},<0.16417.364>},
 {{memory,73238512},<0.16368.364>},
 {{memory,73238512},<0.16350.364>},
 {{memory,73238512},<0.16343.364>},
 {{memory,70935368},<0.16419.364>},
 {{memory,70935368},<0.16372.364>},
 {{memory,70935368},<0.16325.364>},
 {{memory,69016080},<0.16440.364>},
 {{memory,69016080},<0.16429.364>},
 {{memory,69016080},<0.16422.364>},
 {{memory,69016080},<0.16405.364>},
 {{memory,69016080},<0.16384.364>},
 {{memory,69016080},<0.16338.364>},
 {{memory,69016080},<0.16337.364>},
 {{memory,69016080},<0.16319.364>},
 {{memory,61032264},<0.16391.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]
> process_info(list_to_pid("<0.16417.364>")).
[{current_function,{gen_statem,loop_receive,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.16300.364>,#Port<0.79272>]},
 {dictionary,[{'$initial_call',{h2_connection,init,1}},
              {'$ancestors',[<0.16300.364>,<0.16257.364>,
                             grpcbox_channel_sup,grpcbox_sup,<0.1063.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1062.0>},
 {total_heap_size,8866796},
 {heap_size,1439468},
 {stack_size,11},
 {reductions,47123903084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6023}]},
 {suspending,[]}]
> garbage_collect(list_to_pid("<0.16384.364>")).
....
> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,85122328},<0.16391.364>},
 {{memory,55082824},<0.16440.364>},
 {{memory,52779680},<0.16372.364>},
 {{memory,50860392},<0.16429.364>},
 {{memory,50860392},<0.16325.364>},
 {{memory,50860392},<0.16319.364>},
 {{memory,48205672},<0.16422.364>},
 {{memory,48205672},<0.16337.364>},
 {{memory,43983240},<0.16343.364>},
 {{memory,42474712},<0.16384.364>},
 {{memory,42383832},<0.16419.364>},
 {{memory,40171568},<0.16350.364>},
 {{memory,40171568},<0.16338.364>},
 {{memory,38252280},<0.16417.364>},
 {{memory,38252280},<0.16405.364>},
 {{memory,38252280},<0.16368.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]

I'm using chatterbox via grpcbox.

ferd · 2020-08-06T12:21:38Z

Yeah the post before yours does mention a possible remediation to hibernate when idle. Someone likely just needs to go and implement it.

sebastiw · 2020-08-11T09:20:03Z

So we have been tracing and found out that the client responses are only cleaned up from the h2_stream state if the function h2_connection:get_response/2 is called.

For instance grpcbox_client:unary_handler does not call this function.
for chatterbox, I think there should be an api for closing/removing a stream when not using get_response, and that function could then be called by client libraries such as grpcbox.

What do you think?

tsloughter · 2020-08-11T14:28:02Z

The grpcbox client doesn't call it because it is getting the body from messages to the client pid, right? If that is the case then it sounds more like chatterbox should have a mode that it keeps the response until it is requested and another where it only sends to some pid.

sebastiw · 2020-08-11T20:16:15Z

yes, there seem to be two flows in h2; one with asynchronous bangs to a notify pid, and one with some synchronous parts. might be a good idea to split them up in two, or clarify which is which.

we went with an easy quick fix which I will try to test tomorrow. Basically just added h2_connection:get_response(Pid, Stream), after receiving the data in grpcbox_client:unary_handler/6

tsloughter · 2020-08-11T21:58:34Z

Yea, pretty sure the notify a pid flow was added on as a quick hack at the time. Definitely could use some cleaning up.

tsloughter · 2022-09-12T22:36:16Z

Looking at this again.

As far as I can tell from the code the only time frames are added to the queue that becomes the response body received is if the callback is undefined. So I don't get how the response body is built up if this is the case.

tsloughter · 2022-09-12T22:54:14Z

Adding a hibernate_after option to h2_connection, but also merged a fix for terminating grpc streams and that may help, not sure, tsloughter#15

tsloughter mentioned this issue Sep 12, 2022

Performance issue / holding reference to binaries open-telemetry/opentelemetry-erlang#459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary leaking in h2_connection processes #120

Binary leaking in h2_connection processes #120

jonasrichard commented Oct 31, 2017 •

edited

Loading

ferd commented Oct 31, 2017

jonasrichard commented Nov 1, 2017 •

edited

Loading

ferd commented Nov 1, 2017

jonasrichard commented Nov 2, 2017 •

edited

Loading

ferd commented Nov 2, 2017

jonasrichard commented Nov 2, 2017

tsloughter commented Jun 25, 2018

sebastiw commented Aug 6, 2020

ferd commented Aug 6, 2020

sebastiw commented Aug 11, 2020

tsloughter commented Aug 11, 2020

sebastiw commented Aug 11, 2020

tsloughter commented Aug 11, 2020

tsloughter commented Sep 12, 2022

tsloughter commented Sep 12, 2022

Binary leaking in h2_connection processes #120

Binary leaking in h2_connection processes #120

Comments

jonasrichard commented Oct 31, 2017 • edited Loading

ferd commented Oct 31, 2017

jonasrichard commented Nov 1, 2017 • edited Loading

ferd commented Nov 1, 2017

jonasrichard commented Nov 2, 2017 • edited Loading

ferd commented Nov 2, 2017

jonasrichard commented Nov 2, 2017

tsloughter commented Jun 25, 2018

sebastiw commented Aug 6, 2020

ferd commented Aug 6, 2020

sebastiw commented Aug 11, 2020

tsloughter commented Aug 11, 2020

sebastiw commented Aug 11, 2020

tsloughter commented Aug 11, 2020

tsloughter commented Sep 12, 2022

tsloughter commented Sep 12, 2022

jonasrichard commented Oct 31, 2017 •

edited

Loading

jonasrichard commented Nov 1, 2017 •

edited

Loading

jonasrichard commented Nov 2, 2017 •

edited

Loading