Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmm-client (docker container) that monitors Postgres bloat memory and process in the container crashes with an OOM #2563

Open
1 task done
Yuskovich opened this issue Oct 20, 2023 · 4 comments
Assignees
Labels
bug Bug report

Comments

@Yuskovich
Copy link

Yuskovich commented Oct 20, 2023

Description

pmm-client docker container has limit 32GB RAM and with random period of time,usually half hour, container reaches the RAM usage limit of 32GB and oom kill postgre_exporter inside container. The container is running on the same system, as monitored postgresql.

OS (monitored system): Ubuntu 20.04.4 LTS (Focal Fossa)
Linux kernel (monitored system): Linux HOSTNAME_REMOVED 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Docker image of pmm-client: percona/pmm-client: 2.39.0 (same problem on 2.33, 2.36, 2.40.1)
PMM servers version: 2.39 (same problem on 2.33, 2.36, 2.40.1)
Monitored service: postgresql (14.9)
Total RAM on monitored postgresql server: 128Gb

Available memory on the monitored host during reaching limit is about 59GB:

#free -hw
              total        used        free      shared     buffers       cache   available
Mem:          125Gi        31Gi       841Mi        33Gi       387Mi        93Gi        59Gi
Swap:            0B          0B          0B

Expected Results

pmm-client container does not reach limit 32GB RAM

Actual Results

pmm-client container reaches its memory usage limit of 32GB

Version

PMM Server v2.39, PMM client 2.39

Steps to reproduce

Create new postgres cluster
Create many schemas (our case is about 10k)
Create many empty tables (our case is about 70k)
deploy pmm-client in docker and add a PostgreSQL service

Relevant logs

ppm-client docker container logs from start to first reach RAM limit:
INFO[2023-10-20T08:44:51.625+00:00] Run setup: false Sidecar mode: false          component=entrypoint
INFO[2023-10-20T08:44:51.626+00:00] Starting 'pmm-admin run'...                   component=entrypoint
INFO[2023-10-20T08:44:51.726+00:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
INFO[2023-10-20T08:44:51.726+00:00] Runner capacity set to 32.                    component=runner
INFO[2023-10-20T08:44:51.726+00:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
ERRO[2023-10-20T08:44:52.995+00:00] ts=2023-10-20T08:44:52.932Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data  agentID=/agent_id/8b47212b-ec90-4c24-9a1d-b8c4cc3eaa63 component=agent-process type=node_exporter
ERRO[2023-10-20T09:18:58.502+00:00] ts=2023-10-20T09:18:58.495Z caller=postgres_exporter.go:750 level=error err="Error opening connection to database (postgres://pmm:PASSWORD_REMOVED@HOST_REMOVED:PORT_REMOVED/postgres?connect_timeout=1&sslmode=disable): driver: bad connection"  agentID=/agent_id/ef46e805-0e0a-4246-9b94-d21be2e69ba7 component=agent-process type=postgres_exporter
WARN[2023-10-20T09:38:29.330+00:00] Process: exited: signal: killed.              agentID=/agent_id/ef46e805-0e0a-4246-9b94-d21be2e69ba7 component=agent-process type=postgres_exporter

dmesg log:
[Fri Oct 20 09:38:24 2023] pmm-agent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Fri Oct 20 09:38:24 2023] CPU: 0 PID: 2403132 Comm: pmm-agent Not tainted 5.4.0-164-generic #181-Ubuntu
[Fri Oct 20 09:38:24 2023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[Fri Oct 20 09:38:24 2023] Call Trace:
[Fri Oct 20 09:38:24 2023]  dump_stack+0x6d/0x8b
[Fri Oct 20 09:38:24 2023]  dump_header+0x4f/0x1eb
[Fri Oct 20 09:38:24 2023]  oom_kill_process.cold+0xb/0x10
[Fri Oct 20 09:38:24 2023]  out_of_memory+0x1cf/0x500
[Fri Oct 20 09:38:24 2023]  mem_cgroup_out_of_memory+0xbd/0xe0
[Fri Oct 20 09:38:24 2023]  try_charge+0x77c/0x810
[Fri Oct 20 09:38:24 2023]  mem_cgroup_try_charge+0x71/0x190
[Fri Oct 20 09:38:24 2023]  __add_to_page_cache_locked+0x2ff/0x3f0
[Fri Oct 20 09:38:24 2023]  ? scan_shadow_nodes+0x30/0x30
[Fri Oct 20 09:38:24 2023]  add_to_page_cache_lru+0x4d/0xd0
[Fri Oct 20 09:38:24 2023]  pagecache_get_page+0x101/0x300
[Fri Oct 20 09:38:24 2023]  filemap_fault+0x6b2/0xa50
[Fri Oct 20 09:38:24 2023]  ? unlock_page_memcg+0x12/0x20
[Fri Oct 20 09:38:24 2023]  ? page_add_file_rmap+0xff/0x1a0
[Fri Oct 20 09:38:24 2023]  ? xas_load+0xd/0x80
[Fri Oct 20 09:38:24 2023]  ? xas_find+0x17f/0x1c0
[Fri Oct 20 09:38:24 2023]  ? filemap_map_pages+0x24c/0x380
[Fri Oct 20 09:38:24 2023]  ext4_filemap_fault+0x32/0x50
[Fri Oct 20 09:38:24 2023]  __do_fault+0x3c/0x170
[Fri Oct 20 09:38:24 2023]  do_fault+0x24b/0x640
[Fri Oct 20 09:38:24 2023]  __handle_mm_fault+0x4c5/0x7a0
[Fri Oct 20 09:38:24 2023]  handle_mm_fault+0xca/0x200
[Fri Oct 20 09:38:24 2023]  do_user_addr_fault+0x1f9/0x450
[Fri Oct 20 09:38:24 2023]  __do_page_fault+0x58/0x90
[Fri Oct 20 09:38:24 2023]  do_page_fault+0x2c/0xe0
[Fri Oct 20 09:38:24 2023]  do_async_page_fault+0x39/0x70
[Fri Oct 20 09:38:24 2023]  async_page_fault+0x34/0x40
[Fri Oct 20 09:38:24 2023] RIP: 0033:0x43730f
[Fri Oct 20 09:38:24 2023] Code: Bad RIP value.
[Fri Oct 20 09:38:24 2023] RSP: 002b:00007f8f94ff84f8 EFLAGS: 00010206
[Fri Oct 20 09:38:24 2023] RAX: ffffffffffffff92 RBX: 0000000000000000 RCX: 0000000000473d63
[Fri Oct 20 09:38:24 2023] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000000236c030
[Fri Oct 20 09:38:24 2023] RBP: 00007f8f94ff8538 R08: 0000000000000000 R09: 0000000000000000
[Fri Oct 20 09:38:24 2023] R10: 00007f8f94ff8528 R11: 0000000000000206 R12: 00007f8f94ff8528
[Fri Oct 20 09:38:24 2023] R13: 0000000000000013 R14: 000000c0001036c0 R15: 000000c000452000
[Fri Oct 20 09:38:24 2023] memory: usage 33554432kB, limit 33554432kB, failcnt 640017
[Fri Oct 20 09:38:24 2023] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Fri Oct 20 09:38:24 2023] kmem: usage 90612kB, limit 9007199254740988kB, failcnt 0
[Fri Oct 20 09:38:24 2023] Memory cgroup stats for /docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a:
[Fri Oct 20 09:38:24 2023] anon 34263621632
                           file 1167360
                           kernel_stack 1216512
                           slab 20291584
                           sock 0
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           anon_thp 4464836608
                           inactive_anon 0
                           active_anon 34263457792
                           inactive_file 0
                           active_file 0
                           unevictable 0
                           slab_reclaimable 11001856
                           slab_unreclaimable 9289728
                           pgfault 8587986
                           pgmajfault 71247
                           workingset_refault 1611621
                           workingset_activate 115170
                           workingset_nodereclaim 0
                           pgrefill 723458
                           pgscan 8194323
                           pgsteal 1630807
                           pgactivate 480414
                           pgdeactivate 599413
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 1617
                           thp_collapse_alloc 0
[Fri Oct 20 09:38:24 2023] Tasks state (memory values in pages):
[Fri Oct 20 09:38:24 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Oct 20 09:38:24 2023] [2403075]  1002 2403075   178251      210    86016        0             0 pmm-agent-entry
[Fri Oct 20 09:38:24 2023] [2403121]  1002 2403121   349731     2272   274432        0             0 pmm-agent
[Fri Oct 20 09:38:24 2023] [2403137]  1002 2403137   180996     5064   163840        0             0 vmagent
[Fri Oct 20 09:38:24 2023] [2403139]  1002 2403139   181981     2540   159744        0             0 node_exporter
[Fri Oct 20 09:38:24 2023] [2403156]  1002 2403156  8558644  8354434 67321856        0             0 postgres_export
[Fri Oct 20 09:38:24 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,mems_allowed=0,oom_memcg=/docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,task_memcg=/docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,task=postgres_export,pid=2403156,uid=1002
[Fri Oct 20 09:38:24 2023] Memory cgroup out of memory: Killed process 2403156 (postgres_export) total-vm:34234576kB, anon-rss:33417736kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:65744kB oom_score_adj:0
[Fri Oct 20 09:38:28 2023] oom_reaper: reaped process 2403156 (postgres_export), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Code of Conduct

  • I agree to follow Percona Community Code of Conduct
@Yuskovich Yuskovich added the bug Bug report label Oct 20, 2023
@BupycHuk
Copy link
Member

Hello @Yuskovich, we are working on fixing this problem and we released some improvements in PMM 2.40.1 and going to provide more improvements in PMM 2.41.0. Please upgrade to 2.40.1 and provide feedback if it helped you.

@Yuskovich
Copy link
Author

Hello @BupycHuk, we have updated server and agent to version 2.40.1. Issue still persists. And now we found way to reproduce:

  • Create new postgres cluster
  • Create many schemas (our case is about 10k)
  • Create many empty tables (our case is about 70k)
  • deploy pmm-client in docker and add a PostgreSQL service

@Yuskovich Yuskovich reopened this Oct 25, 2023
@BupycHuk
Copy link
Member

got it, thank you. Please wait for 2.41.0, it should be fixed in upcoming release.

@BupycHuk
Copy link
Member

@Yuskovich what about database number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

3 participants