patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

soroshsabz · 2022-02-15T19:55:54Z

ITNOA

Overview

I try to create postgresql cluster from your example in https://github.com/CrunchyData/postgres-operator-examples (kustomize/postgres), but after I run it with some modification I see below error

ssoroosh@master:~$ kubectl get pods -n pgo
NAME                                       READY   STATUS             RESTARTS       AGE
harbor-postgres-cluster-instance1-2j4l-0   2/3     CrashLoopBackOff   5 (117s ago)   6m2s
harbor-postgres-cluster-repo-host-0        1/1     Running            0              6m1s
pgo-68db564fb5-6h2pc                       1/1     Running            1 (12h ago)    3d21h

Environment

Please provide the following details:

Platform: Kubernetes
Platform Version: 1.23.3

ssoroosh@master:~$ kubectl get nodes
NAME     STATUS   ROLES                  AGE    VERSION
host1    Ready    <none>                 113d   v1.23.3
host2    Ready    <none>                 113d   v1.23.3
host3    Ready    <none>                 113d   v1.23.3
host4    Ready    <none>                 66m    v1.23.3
master   Ready    control-plane,master   116d   v1.23.3

PGO Image Tag: registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.0.4-0
Postgres Version : registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0
Storage: openebs-hostpath

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

Run kubectl apply -k postgres/ (I write my kustomization.yaml in end of issue, and I do not edit any things in postgres.yaml)
Run kubectl get pods -n pgo

EXPECTED

I expected see three pods with Running in Status

ACTUAL

ssoroosh@master:~$ kubectl get -n pgo pods

NAME                                       READY   STATUS             RESTARTS       AGE
harbor-postgres-cluster-instance1-2j4l-0   2/3     CrashLoopBackOff   6 (4m4s ago)   11m
harbor-postgres-cluster-repo-host-0        1/1     Running            0              11m
pgo-68db564fb5-6h2pc                       1/1     Running            1 (12h ago)    3d21h

Logs

kubectl logs -n pgo harbor-postgres-cluster-instance1-nnlx-0 database
2022-02-15 19:24:14,647 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-02-15 19:24:14,649 INFO: Lock owner: None; I am harbor-postgres-cluster-instance1-nnlx-0
2022-02-15 19:24:14,741 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

creating directory /pgdata/pg13 ... ok
creating directory /pgdata/pg13_wal ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
sh: line 1:   824 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   826 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   828 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   830 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   832 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default max_connections ... 20
sh: line 1:   834 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   836 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   838 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   840 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   842 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   844 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   846 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   848 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   850 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   852 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   854 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   856 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   858 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   860 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   862 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   864 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   866 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   868 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:   870 Bus error               (core dumped) "/usr/pgsql-13/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default shared_buffers ... 400kB
selecting default time zone ... UTC
creating configuration files ... ok
child process was terminated by signal 7: Bus error
initdb: removing data directory "/pgdata/pg13"
initdb: removing WAL directory "/pgdata/pg13_wal"
pg_ctl: database system initialization failed
2022-02-15 19:24:18,254 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 171, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 139, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 109, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 112, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1471, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1345, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1238, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1231, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

Additional Information

My kustomization.yaml file is like below

# ITNOA
#
# Define the Postgresql Cluster for Harbor
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: pgo

resources:
- postgres.yaml

# For more information please see https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/patches/
# TODO: Patch using Patch Strategic Merge
patches:
  # Naming patch
  - patch: |-
      - op: replace
        path: /metadata/name
        value: harbor-postgres-cluster
    target:
      kind: PostgresCluster
  # Storage class name patch
  - patch: |-
      - op: add
        path: /spec/instances/0/dataVolumeClaimSpec/storageClassName
        value: openebs-hostpath
      - op: add
        path: /spec/backups/pgbackrest/repos/0/volume/volumeClaimSpec/storageClassName
        value: openebs-hostpath
    target:
      kind: PostgresCluster

I checked my pvc and I hope all of things is good

ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor/postgres$ kubectl get pvc -n pgo
NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
harbor-postgres-cluster-instance1-2j4l-pgdata   Bound    pvc-23e7c073-695a-4c8a-899f-983716d9d819   1Gi        RWO            openebs-hostpath   15m
harbor-postgres-cluster-repo1                   Bound    pvc-12ac0acf-4d57-411e-9c1a-af5aa1f57e55   1Gi        RWO            openebs-hostpath   15m

ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor/postgres$ kubectl -n pgo get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
         STORAGECLASS       REASON   AGE
pvc-12ac0acf-4d57-411e-9c1a-af5aa1f57e55   1Gi        RWO            Delete           Bound    pgo/harbor-postgres-cluster-repo1
         openebs-hostpath            15m
pvc-23e7c073-695a-4c8a-899f-983716d9d819   1Gi        RWO            Delete           Bound    pgo/harbor-postgres-cluster-instance1-2j4l-pgdata   openebs-hostpath            15m

The text was updated successfully, but these errors were encountered:

soroshsabz · 2022-02-15T19:56:36Z

related to #3011

benjaminjb · 2022-03-29T20:10:49Z

Hello @soroshsabz, in the linked issue, the user solved the problem by using a different cluster (switching from Talos to k3s). I am also curious what platform you're running on (e.g., AWS, k3s, etc.) and if you've tried another platform? I cannot reproduce this problem on the platforms I use for testing, so I wonder if it's platform dependent.

Alternatively, in looking into this problem, I found a similar issue raised with Patroni: patroni/patroni#1393

There a user seems to have solved their error by turning huge_files off (I believe huge_files defaults to "try"); you can change that setting through the spec:

spec:
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          huge_files: "off"

I would be curious to see if that solves the error, especially since I cannot reproduce it.

soroshsabz · 2022-04-01T16:17:38Z

@benjaminjb Hi, I do not use any external platform, I create my cluster in on-premise lab

Thanks

andrewlecuyer · 2022-06-07T00:02:17Z

@soroshsabz this is due to a known issue in Kubernetes:

kubernetes/kubernetes#71233

And as described by @benjaminjb, you should be able to work around this issue by setting huge_files to off in your PostgreSQL configuration, e.g.:

spec:
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          huge_pages: "off"

cr1cr1 · 2022-09-07T20:15:40Z

As per postgres docs, there should be huge_pages and not ~~huge_files~~.

Anyway, for me setting it to off did not work, neither huge_files nor huge_pages. This seems to be a postgres issue running under kubernetes, and not a CrunchyData operator issue, nor a patroni issue.

PGO: 5.2.0

PG: 14.5

Did some investigating and tried to set limits with hugepages for the instance:

      resources:
        limits:
          memory: 500Mi
          hugepages-2Mi: 500Mi

and it seemed to work.

In my case, I needed to enable hugepage support on nodes for other software.

More info:

David-Angel · 2023-01-06T15:33:07Z

The workaround to enable hugepages isn't going to work when you are required to disable hugepages.
We have a requirement not to use hugepages which can't be disabled without altering crunchydata code.

This file needs to change to turn it off.
./usr/pgsql-14/share/postgresql.conf.sample

initdb uses that file instead of the standard config file.

When the system has huge_pages turned on initdb is using the "postgresql.conf.sample" file causing the process to crash in Kubernetes. Turning off huge pages in this file would resolve the issue. Here are some links for further information Crunchydata CrunchyData/postgres-operator#3477 CrunchyData/postgres-operator#3039 CrunchyData/postgres-operator#2258 CrunchyData/postgres-operator#3126 CrunchyData/postgres-operator#3421 Bitnami bitnami/charts#7901

cr1cr1 · 2023-06-15T13:52:25Z

Actually setting what @andrewlecuyer suggested above works without setting hugepages-2Mi

jmckulk added the v5 label Feb 15, 2022

benjaminjb mentioned this issue Apr 2, 2022

'huge_pages' does not take effect #3126

Closed

2 tasks

andrewlecuyer closed this as completed Jun 7, 2022

howels mentioned this issue Nov 29, 2022

Not honoring hugepages setting during initdb causes DB crash #3477

Closed

David-Angel mentioned this issue Jan 20, 2023

Update postgresql.conf.sample to fix huge_pages in Kubernetes postgres/postgres#114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

soroshsabz commented Feb 15, 2022 •

edited

Loading

soroshsabz commented Feb 15, 2022

benjaminjb commented Mar 29, 2022

soroshsabz commented Apr 1, 2022

andrewlecuyer commented Jun 7, 2022 •

edited

Loading

cr1cr1 commented Sep 7, 2022 •

edited

Loading

David-Angel commented Jan 6, 2023

cr1cr1 commented Jun 15, 2023

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' #3039

Comments

soroshsabz commented Feb 15, 2022 • edited Loading

Overview

Environment

Steps to Reproduce

REPRO

EXPECTED

ACTUAL

Logs

Additional Information

soroshsabz commented Feb 15, 2022

benjaminjb commented Mar 29, 2022

soroshsabz commented Apr 1, 2022

andrewlecuyer commented Jun 7, 2022 • edited Loading

cr1cr1 commented Sep 7, 2022 • edited Loading

David-Angel commented Jan 6, 2023

cr1cr1 commented Jun 15, 2023

soroshsabz commented Feb 15, 2022 •

edited

Loading

andrewlecuyer commented Jun 7, 2022 •

edited

Loading

cr1cr1 commented Sep 7, 2022 •

edited

Loading