Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

Closed
soroshsabz opened this issue Feb 15, 2022 · 7 comments
Closed
Labels

Comments

@soroshsabz
Copy link

soroshsabz commented Feb 15, 2022

ITNOA

Overview

I try to create postgresql cluster from your example in https://github.com/CrunchyData/postgres-operator-examples (kustomize/postgres), but after I run it with some modification I see below error

ssoroosh@master:~$ kubectl get pods -n pgo
NAME                                       READY   STATUS             RESTARTS       AGE
harbor-postgres-cluster-instance1-2j4l-0   2/3     CrashLoopBackOff   5 (117s ago)   6m2s
harbor-postgres-cluster-repo-host-0        1/1     Running            0              6m1s
pgo-68db564fb5-6h2pc                       1/1     Running            1 (12h ago)    3d21h

Environment

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: 1.23.3
ssoroosh@master:~$ kubectl get nodes
NAME     STATUS   ROLES                  AGE    VERSION
host1    Ready    <none>                 113d   v1.23.3
host2    Ready    <none>                 113d   v1.23.3
host3    Ready    <none>                 113d   v1.23.3
host4    Ready    <none>                 66m    v1.23.3
master   Ready    control-plane,master   116d   v1.23.3
  • PGO Image Tag: registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.0.4-0
  • Postgres Version : registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0
  • Storage: openebs-hostpath

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. Run kubectl apply -k postgres/ (I write my kustomization.yaml in end of issue, and I do not edit any things in postgres.yaml)
  2. Run kubectl get pods -n pgo

EXPECTED

I expected see three pods with Running in Status

ACTUAL

ssoroosh@master:~$ kubectl get -n pgo pods

NAME                                       READY   STATUS             RESTARTS       AGE
harbor-postgres-cluster-instance1-2j4l-0   2/3     CrashLoopBackOff   6 (4m4s ago)   11m
harbor-postgres-cluster-repo-host-0        1/1     Running            0              11m
pgo-68db564fb5-6h2pc                       1/1     Running            1 (12h ago)    3d21h

Logs

kubectl logs -n pgo harbor-postgres-cluster-instance1-nnlx-0 database
2022-02-15 19:24:14,647 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-02-15 19:24:14,649 INFO: Lock owner: None; I am harbor-postgres-cluster-instance1-nnlx-0
2022-02-15 19:24:14,741 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

creating directory /pgdata/pg13 ... ok
creating directory /pgdata/pg13_wal ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
sh: line 1:   824 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   826 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   828 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   830 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   832 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default max_connections ... 20
sh: line 1:   834 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   836 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   838 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   840 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   842 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   844 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   846 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   848 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   850 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   852 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   854 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   856 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   858 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   860 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   862 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   864 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   866 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   868 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   870 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default shared_buffers ... 400kB
selecting default time zone ... UTC
creating configuration files ... ok
child process was terminated by signal 7: Bus error
initdb: removing data directory "/pgdata/pg13"
initdb: removing WAL directory "/pgdata/pg13_wal"
pg_ctl: database system initialization failed
2022-02-15 19:24:18,254 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 171, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 139, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 109, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 112, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1471, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1345, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1238, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1231, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

Additional Information

My kustomization.yaml file is like below

# ITNOA
#
# Define the Postgresql Cluster for Harbor
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: pgo

resources:
- postgres.yaml

# For more information please see https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/patches/
# TODO: Patch using Patch Strategic Merge
patches:
  # Naming patch
  - patch: |-
      - op: replace
        path: /metadata/name
        value: harbor-postgres-cluster
    target:
      kind: PostgresCluster
  # Storage class name patch
  - patch: |-
      - op: add
        path: /spec/instances/0/dataVolumeClaimSpec/storageClassName
        value: openebs-hostpath
      - op: add
        path: /spec/backups/pgbackrest/repos/0/volume/volumeClaimSpec/storageClassName
        value: openebs-hostpath
    target:
      kind: PostgresCluster

I checked my pvc and I hope all of things is good

ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor/postgres$ kubectl get pvc -n pgo
NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
harbor-postgres-cluster-instance1-2j4l-pgdata   Bound    pvc-23e7c073-695a-4c8a-899f-983716d9d819   1Gi        RWO            openebs-hostpath   15m
harbor-postgres-cluster-repo1                   Bound    pvc-12ac0acf-4d57-411e-9c1a-af5aa1f57e55   1Gi        RWO            openebs-hostpath   15m
ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor/postgres$ kubectl -n pgo get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
         STORAGECLASS       REASON   AGE
pvc-12ac0acf-4d57-411e-9c1a-af5aa1f57e55   1Gi        RWO            Delete           Bound    pgo/harbor-postgres-cluster-repo1
         openebs-hostpath            15m
pvc-23e7c073-695a-4c8a-899f-983716d9d819   1Gi        RWO            Delete           Bound    pgo/harbor-postgres-cluster-instance1-2j4l-pgdata   openebs-hostpath            15m
@soroshsabz
Copy link
Author

related to #3011

@jmckulk jmckulk added the v5 label Feb 15, 2022
@benjaminjb
Copy link
Contributor

Hello @soroshsabz, in the linked issue, the user solved the problem by using a different cluster (switching from Talos to k3s). I am also curious what platform you're running on (e.g., AWS, k3s, etc.) and if you've tried another platform? I cannot reproduce this problem on the platforms I use for testing, so I wonder if it's platform dependent.

Alternatively, in looking into this problem, I found a similar issue raised with Patroni: patroni/patroni#1393

There a user seems to have solved their error by turning huge_files off (I believe huge_files defaults to "try"); you can change that setting through the spec:

spec:
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          huge_files: "off"

I would be curious to see if that solves the error, especially since I cannot reproduce it.

@soroshsabz
Copy link
Author

@benjaminjb Hi, I do not use any external platform, I create my cluster in on-premise lab

Thanks

@andrewlecuyer
Copy link
Collaborator

andrewlecuyer commented Jun 7, 2022

@soroshsabz this is due to a known issue in Kubernetes:

kubernetes/kubernetes#71233

And as described by @benjaminjb, you should be able to work around this issue by setting huge_files to off in your PostgreSQL configuration, e.g.:

spec:
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          huge_pages: "off"

@cr1cr1
Copy link

cr1cr1 commented Sep 7, 2022

As per postgres docs, there should be huge_pages and not huge_files.

Anyway, for me setting it to off did not work, neither huge_files nor huge_pages. This seems to be a postgres issue running under kubernetes, and not a CrunchyData operator issue, nor a patroni issue.

PGO: 5.2.0

PG: 14.5

Did some investigating and tried to set limits with hugepages for the instance:

      resources:
        limits:
          memory: 500Mi
          hugepages-2Mi: 500Mi

and it seemed to work.

In my case, I needed to enable hugepage support on nodes for other software.

More info:

Related:

@David-Angel
Copy link

The workaround to enable hugepages isn't going to work when you are required to disable hugepages.
We have a requirement not to use hugepages which can't be disabled without altering crunchydata code.

This file needs to change to turn it off.
./usr/pgsql-14/share/postgresql.conf.sample

initdb uses that file instead of the standard config file.

David-Angel added a commit to David-Angel/postgres that referenced this issue Jan 20, 2023
When the system has huge_pages turned on initdb is using the "postgresql.conf.sample" file causing the process to crash in Kubernetes.
Turning off huge pages in this file would resolve the issue.

Here are some links for further information

Crunchydata
CrunchyData/postgres-operator#3477
CrunchyData/postgres-operator#3039
CrunchyData/postgres-operator#2258
CrunchyData/postgres-operator#3126
CrunchyData/postgres-operator#3421

Bitnami
bitnami/charts#7901
@cr1cr1
Copy link

cr1cr1 commented Jun 15, 2023

Actually setting what @andrewlecuyer suggested above works without setting hugepages-2Mi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants