Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres exporter errors - database is closed #430

Open
cassanellicarlo opened this issue Sep 4, 2024 · 3 comments
Open

postgres exporter errors - database is closed #430

cassanellicarlo opened this issue Sep 4, 2024 · 3 comments
Labels
UNCONFIRMED Initial state. Need's triaged

Comments

@cassanellicarlo
Copy link

Describe the issue:

I'm running a PostgresCluster with pgo operator v5.5.0.
There are errors in the exporter container. Metrics cannot be scraped by prometheus - target is down.

Error running query on database - database is closed
postgres_exporter_watcher: line 24: 146 Killed postgres_exporter "${postgres_exporter_flags[@]}"

Logs

ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_incr_backup sql: database is closed"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_settings_pending_restart sql: database is closed"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_database_size sql: database is closed"
ts=2024-09-02T12:10:26.847Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.849Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 40 errors"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_wal sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_activity sql: database is closed"
ts=2024-09-02T12:10:26.847Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_cpustat sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_lag sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_data_checksum_failure sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_archiver sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_replication sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_process sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_slots sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_incr_backup sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_settings_pending_restart sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_table_size sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_database_size sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_wal sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_connection_stats sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_activity sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_total sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_lag sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_full_backup sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_postgresql_version sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 46 errors"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_data_checksum_failure sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 37 errors"
ts=2024-09-02T12:10:27.845Z caller=postgres_exporter.go:682 level=error err="Error opening connection to database (postgresql://ccp_monitoring:PASSWORD_REMOVED@localhost:5432/postgres): pq: the database system is in recovery mode"
ts=2024-09-02T12:10:27.845Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=localhost:5432
ts=2024-09-02T12:10:27.856Z caller=postgres_exporter.go:613 level=info msg="Semantic version changed" server=localhost:5432 from=0.0.0 to=14.10.0
postgres_exporter_watcher: line 24:   146 Killed                  postgres_exporter "${postgres_exporter_flags[@]}"

If i try to call the metrics endpoint, i get the following logs

An error has occurred while serving metrics:

86 error(s) occurred:
* collected metric "pg_statio_user_tables_heap_blocks_read" { label:{name:"datname"  value:"postgres"}  label:{name:"relname"  value:"pgbackrest_info"}  label:{name:"schemaname"  value:"monitor"}  counter:{value:0}} was collected before with the same name and label values
* collected metric "pg_statio_user_tables_heap_blocks_hit" { label:{name:"datname"  value:"postgres"}  label:{name:"relname"  value:"pgbackrest_info"}  label:{name:"schemaname"  value:"monitor"}  counter:{value:0}} was collected before with the same name and label values

[...]

Describe the expected behavior:

pgo exporter works well and prometheus can scrape metrics.

Tell us about your environment:

  • OpenShift version: 4.14.33
  • pgMonitor version:
  • Container or non-container: container
  • Container name / image: registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0
  • PostgreSQL Version: 14
  • Exporter(s) in use (incl. version): https://registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-0.15.0-0
  • Prometheus version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c896664ee140743667e9860ca07824ec7a37b5a557a2c36a7f133a1c3f1733e

What could be the issue? Do you have an idea?

Thanks.

@cassanellicarlo cassanellicarlo added the UNCONFIRMED Initial state. Need's triaged label Sep 4, 2024
@keithf4
Copy link
Collaborator

keithf4 commented Sep 4, 2024

Not quite sure what the database is closed error is.

The was collected before with the same name and label values error is mostly self explanatory in that it's saying that this metric was already collected before. Can be tricky to narrow down why tho. Usually in the case of table statistics like this, it's because it's missing the database name label when trying to collect them from multiple databases. But I do see the database name label there. Perhaps the same database name got added in multiple times to the connection configuration.

@andrewlecuyer any ideas here how these issues crept up in the container?

@andrewlecuyer
Copy link
Contributor

@cassanellicarlo what is the current status of this Postgres Pod? Is it up-and-running, and ready?

Considering the following error:

Error opening connection to database (postgresql://ccp_monitoring:PASSWORD_REMOVED@localhost:5432/postgres): pq: the database system is in recovery mode"

It looks like the exporter is simply unable to connect to the database. What do the Postgres logs show for this PG instance? My hunch here is that the issue is with Postgres itself, and the exporter is simply unable to connect.

@cassanellicarlo
Copy link
Author

cassanellicarlo commented Sep 4, 2024

@andrewlecuyer Postgres is running. The instance is 5/5 Ready and Running.

These are the logs

2024-09-04 14:54:17,796 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:27,787 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:37,798 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:47,790 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:57,822 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:07,789 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:17,791 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:27,805 INFO: no action. I am (pg-instance-0), the leader with the lock

Previous logs:

2024-09-02 12:45:37,640 INFO: Lock owner: ; I am pg-instance-0
2024-09-02 12:45:37,641 INFO: starting as a secondary
2024-09-02 12:45:38.526 UTC [2029083] LOG:  pgaudit extension initialized
2024-09-02 12:45:38,541 INFO: postmaster pid=2029083
/tmp/postgres:5432 - no response
2024-09-02 12:45:38.639 UTC [2029083] LOG:  redirecting log output to logging collector process
2024-09-02 12:45:38.639 UTC [2029083] HINT:  Future log output will appear in directory "log".
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - accepting connections
2024-09-02 12:45:42,671 INFO: establishing a new patroni connection to the postgres cluster
2024-09-02 12:45:42,721 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:45:42,814 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:45:52,671 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:45:52,735 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:46:02,672 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:46:02,721 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:46:13,912 INFO: Got response from <omitted> <omitted>: {"state": "starting", "role": "replica", "cluster_unlocked": true, "dcs_last_seen": 1725281167, "database_system_identifier": "<omitted>", "patroni": {"version": "3.1.1", "scope": "---}}
2024-09-02 12:46:13,912 WARNING: Request failed to pg-instance-0: GET <omitted> ('xlog')
2024-09-02 12:46:14,252 INFO: Got response from <omitted> https://<omitted>:8008/patroni: {"state": "starting", "role": "replica", "cluster_unlocked": true, "dcs_last_seen": 1725281167, "database_system_identifier": "<omitted>", "patroni": {"version": "3.1.1", "scope": "<omitted>"}}
2024-09-02 12:46:14,252 WARNING: <omitted>: GET https://<omitted>:8008/patroni ('xlog')
2024-09-02 12:46:14,329 INFO: promoted self to leader by acquiring session lock
server promoting
2024-09-02 12:46:15,410 INFO: no action. I am (pg-instance-0), the leader with the lock

Seems like the exporter got stuck when the database was not available (2024-09-02) / during a failover and then it couldn't recover anymore (postgres_exporter_watcher: line 24: 146 Killed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
UNCONFIRMED Initial state. Need's triaged
Projects
None yet
Development

No branches or pull requests

3 participants