Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(memory): Add script to debug consumer OOM issues #6585

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nikhars
Copy link
Member

@nikhars nikhars commented Nov 20, 2024

Description

From initial investigations it has been found that rebalances could potentially be the reason for some of our consumers having high memory usage. Added a script which mimics the behavior. I can also see the increase in memory usage locally. This will make it easier to triage the problem by potentially using memory profiler locally.

The script uses a parameter -g optionally to generate spans on the topic. If you need data, then run with that option. It always resets offsets to 0 between re-runs. So if you have enough data on kafka, you do not need to pass the -g option.

Logs of running the script and seeing memory increases

On my machine, I was able to use up more than 2GB of memory easily.

./spans_oom_reproduce.sh

GROUP                          TOPIC                          PARTITION  NEW-OFFSET     
spans_group                    snuba-spans                    0          0              
Memory usage of consumer 59636: 0.05MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          8070            247110          239040          -               -               -
Memory usage of consumer 59636: 186.89MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          21613           247110          225497          -               -               -
Memory usage of consumer 59636: 236.58MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          51814           247110          195296          rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 269.22MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          65471           247110          181639          rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 277.39MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          79444           247110          167666          rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 281.97MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          94648           247110          152462          rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 285.45MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          117620          247110          129490          -               -               -
Memory usage of consumer 59636: 294.11MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          132972          247110          114138          -               -               -
Memory usage of consumer 59636: 303.69MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          146266          247110          100844          -               -               -
Memory usage of consumer 59636: 307.16MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          175804          247110          71306           rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 311.31MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          188988          247110          58122           rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 315.23MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          205754          247110          41356           -               -               -
Memory usage of consumer 59636: 315.59MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          221322          247110          25788           rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 318.47MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          244086          247110          3024            rdkafka-1308e39b-7b82-484e-9281-19daa23f512c /172.18.0.1     rdkafka
Memory usage of consumer 59636: 204.11MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               -               -               -
Memory usage of consumer 59636: 211.70MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               -               -               -
Memory usage of consumer 59636: 211.83MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 211.83MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 211.88MB

Warning: Consumer group 'spans_group' is rebalancing.

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               -               -               -
Memory usage of consumer 59636: 211.91MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-0665fa4a-8c0d-4943-8988-b2765116b54d /172.18.0.1     rdkafka
Memory usage of consumer 59636: 201.06MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 159.34MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 141.23MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 132.75MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 132.75MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 131.02MB

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
spans_group     snuba-spans     0          247110          247110          0               rdkafka-1db73671-7811-4432-98ab-0ec8a347d830 /172.18.0.1     rdkafka
Memory usage of consumer 59636: 131.02MB
^CCleaning up...

Verfification that rebalances were being triggered

Logs of the kafka broker

[2024-11-20 19:08:22,974] INFO [GroupCoordinator 1]: Member MemberMetadata(memberId=rdkafka-63fefc81-b929-4ef8-8b0f-c281f61415bf, groupInstanceId=None, clientId=rdkafka, clientHost=/172.18.0.1, sessionTimeoutMs=30000, rebalanceTimeoutMs=30000, supportedProtocols=List(range, roundrobin)) has left group spans_group through explicit `LeaveGroup`; client reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:23,099] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 234 (__consumer_offsets-0) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:23,100] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 234. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:24,989] INFO [GroupCoordinator 1]: Dynamic member with unknown member id joins group spans_group in Stable state. Created a new member id rdkafka-dcfd93e6-9df6-4235-857d-241f58fc1428 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:24,989] INFO [GroupCoordinator 1]: Preparing to rebalance group spans_group in state PreparingRebalance with old generation 234 (__consumer_offsets-0) (reason: Adding new member rdkafka-dcfd93e6-9df6-4235-857d-241f58fc1428 with group instance id None; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:26,106] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 235 (__consumer_offsets-0) with 2 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:26,108] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 235. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:29,126] INFO [GroupCoordinator 1]: Preparing to rebalance group spans_group in state PreparingRebalance with old generation 235 (__consumer_offsets-0) (reason: Removing member rdkafka-dcfd93e6-9df6-4235-857d-241f58fc1428 on LeaveGroup; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:29,126] INFO [GroupCoordinator 1]: Member MemberMetadata(memberId=rdkafka-dcfd93e6-9df6-4235-857d-241f58fc1428, groupInstanceId=None, clientId=rdkafka, clientHost=/172.18.0.1, sessionTimeoutMs=30000, rebalanceTimeoutMs=30000, supportedProtocols=List(range, roundrobin)) has left group spans_group through explicit `LeaveGroup`; client reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:31,016] INFO [GroupCoordinator 1]: Dynamic member with unknown member id joins group spans_group in PreparingRebalance state. Created a new member id rdkafka-4b82251b-00c0-4111-b544-eacd8aac6a6e and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:32,119] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 236 (__consumer_offsets-0) with 2 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:32,121] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 236. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:35,021] INFO [GroupCoordinator 1]: Preparing to rebalance group spans_group in state PreparingRebalance with old generation 236 (__consumer_offsets-0) (reason: Removing member rdkafka-4b82251b-00c0-4111-b544-eacd8aac6a6e on LeaveGroup; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:35,021] INFO [GroupCoordinator 1]: Member MemberMetadata(memberId=rdkafka-4b82251b-00c0-4111-b544-eacd8aac6a6e, groupInstanceId=None, clientId=rdkafka, clientHost=/172.18.0.1, sessionTimeoutMs=30000, rebalanceTimeoutMs=30000, supportedProtocols=List(range, roundrobin)) has left group spans_group through explicit `LeaveGroup`; client reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:35,330] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 237 (__consumer_offsets-0) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:35,332] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 237. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:37,036] INFO [GroupCoordinator 1]: Dynamic member with unknown member id joins group spans_group in Stable state. Created a new member id rdkafka-d77d3679-3d51-4b97-a1b3-4e109c56644f and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:37,037] INFO [GroupCoordinator 1]: Preparing to rebalance group spans_group in state PreparingRebalance with old generation 237 (__consumer_offsets-0) (reason: Adding new member rdkafka-d77d3679-3d51-4b97-a1b3-4e109c56644f with group instance id None; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:38,141] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 238 (__consumer_offsets-0) with 2 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:38,143] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 238. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:41,166] INFO [GroupCoordinator 1]: Preparing to rebalance group spans_group in state PreparingRebalance with old generation 238 (__consumer_offsets-0) (reason: Removing member rdkafka-d77d3679-3d51-4b97-a1b3-4e109c56644f on LeaveGroup; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:41,166] INFO [GroupCoordinator 1]: Member MemberMetadata(memberId=rdkafka-d77d3679-3d51-4b97-a1b3-4e109c56644f, groupInstanceId=None, clientId=rdkafka, clientHost=/172.18.0.1, sessionTimeoutMs=30000, rebalanceTimeoutMs=30000, supportedProtocols=List(range, roundrobin)) has left group spans_group through explicit `LeaveGroup`; client reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:43,078] INFO [GroupCoordinator 1]: Dynamic member with unknown member id joins group spans_group in PreparingRebalance state. Created a new member id rdkafka-2d88f5dd-6e2c-4bcd-8256-1822732f2c7b and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:44,156] INFO [GroupCoordinator 1]: Stabilized group spans_group generation 239 (__consumer_offsets-0) with 2 members (kafka.coordinator.group.GroupCoordinator)
[2024-11-20 19:08:44,157] INFO [GroupCoordinator 1]: Assignment received from leader rdkafka-7e893d3f-2a7c-4232-a4e3-362fcd51ec04 for group spans_group for generation 239. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)

From initial investigations it has been found that rebalances could
potentially be the reason for some of our consumers having high
memory usage. Added a script which mimics the behavior. I can also
see the increase in memory usage locally. This will make it easier
to triage the problem by potentially using memory profiler locally.
@nikhars nikhars requested a review from a team as a code owner November 20, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants