Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when unsealing restored raft snapshot: failed to create cipher: crypto/aes: invalid key size 0 #28880

Open
cottand opened this issue Nov 12, 2024 · 3 comments

Comments

@cottand
Copy link

cottand commented Nov 12, 2024

Describe the bug

We have a development cluster of Vault that is a few years old, with a 3-out-of-7 Shamir seal structure. After having migrated it to raft, I took a raft snapshot via the API, then proceeded to restore it in another 5-node cluster (also 3-out-of-7 Shamir seal structure). The API call to restore does succeed.

When trying to unseal the newly restored cluster, I get the following error after the last unseal key.

Error unsealing: Error making API request.

URL: PUT http://server1:8200/v1/sys/unseal
Code: 500. Errors:

* failed to create cipher: crypto/aes: invalid key size 0

I have managed to reproduce this for several snapshots of the old cluster, as well as when restoring the snapshots in a 3-node local cluster.

I have successfully restored snapshots of 'fresh' clusters with otherwise the exact same configuration, which leads me to think there is something wrong with the state of my original cluster.

To Reproduce
I struggle to give instructions to reproduce this, because it only happens with snapshots of the original development cluster,

Steps to reproduce the behavior:

  1. Start a new vault cluster (3-out-of-7 Shamir seal)
  2. Initialise
  3. Unseal
  4. valut operator raft snapshot restore -force original-cluster.backup
  5. Try to unseal

Expected behavior
Unsealing the restored cluster to just succeed

Environment:

  • Vault Server Version (retrieve with vault status): 1.17.6
  • Vault CLI Version (retrieve with vault version): 1.17.6
  • Server Operating System/Architecture: linux x86_46

Additional context
Most interestingly, I can unseal the same snapshot on 1-node raft clusters.
I also managed to restore the snapshot on a new 5-node cluster by:

  • restoring the snapshot on a 1-node cluster
  • running vault operator rekey
  • taking a new snapshot
  • restoring that snapshot on the 5-node cluster

This last bit above is what makes me think that this is a bug, as my snapshot can be restored on 1-node clusters, and rekeying it does allow me to restore it on 5-node clusters.

We would like to migrate to raft in prod too, but first we need to be confident that we will be able to restore production snapshots. Unfortunately that is much harder for us to test, as we would rather not restore the production snapshots if we can help it. Because of this we are keen to understand why snapshots from the original cluster cannot be restored easily.

@mpalmi
Copy link
Contributor

mpalmi commented Nov 12, 2024

Thanks for the report, @cottand! The error in question appears to be coming from this call returning nil, which could be happening for a number of reasons.

Would you mind sharing your current server configuration (with any PII obscured, of course)? This will be helpful in reproducing the behavior you’re seeing.

Aside from that, there should be more detailed errors in Vault’s operational logs. Would you mind checking your operational logs for ERROR-level lines that seem relevant? Sorry for not having an immediate answer, but I’m confident we can get to the bottom of this!

@cottand
Copy link
Author

cottand commented Nov 13, 2024

Could you list some of the reasons why the call could be returning nil?

In the meantime, here is the Vault config of the cluster the original snapshot is from, which is itself almost identical to the config of the cluster I am restoring into:

storage "raft" {
    path = "/vault/raft-db"
    performance_multiplier = 1

    retry_join {
        auto_join = "provider=aws region=eu-west-1 tag_key=role tag_value=_______"
        auto_join_port = 8300
        leader_ca_cert_file = "/etc/ssl/certs/root-ca.pem"
        leader_client_cert_file = "/etc/certs/vault/cert.pem"
        leader_client_key_file = "/etc/certs/vault/key.pem"
    }
}

listener "tcp" {
    address     = "0.0.0.0:8200"
    tls_disable = 1
}

listener "tcp" {
    address     = "0.0.0.0:8300"
    tls_cert_file = "/etc/certs/vault/cert.pem"
    tls_key_file  = "/etc/certs/vault/key.pem"
}


api_addr = "https://_________________:8300"
cluster_addr = "https://_________________:8301"

plugin_directory = "/opt/vault/plugins"

disable_mlock = true
ui = true

telemetry {
  statsd_address = "127.0.0.1:9125"
}

@cottand
Copy link
Author

cottand commented Nov 13, 2024

Here are some logs from a local test cluster when performing the 3rd unseal key input. Note these are interleaved journalctl logs of several servers (server1, server2, server3, on the left column).

This is an output from a scripted environment, where unseal commands are run directly in each node (which is why you see some client output too)

server1 # [   37.989086] vault[832]: 2024-11-12T09:28:12.277Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   38.590723] vault[832]: 2024-11-12T09:28:12.878Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server2 # [   38.353526] vault[836]: 2024-11-12T09:28:12.637Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server2 # [   38.378866] vault[836]: 2024-11-12T09:28:12.663Z [INFO]  storage.raft: copied to local snapshot: bytes=52739096
server2 # [   38.380883] vault[836]: 2024-11-12T09:28:12.665Z [INFO]  storage.raft.fsm: installing snapshot to FSM
server2 # [   38.384020] vault[836]: 2024-11-12T09:28:12.667Z [INFO]  storage.raft.fsm: snapshot installed
server2 # [   38.386350] vault[836]: 2024-11-12T09:28:12.671Z [INFO]  storage.raft: snapshot restore progress: id=3-5923866-1731403689926 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server2 # [   38.389233] vault[836]: 2024-11-12T09:28:12.674Z [INFO]  storage.raft: Installed remote snapshot
server3 # [   38.065012] vault[819]: 2024-11-12T09:28:13.356Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server3 # [   38.086935] vault[819]: 2024-11-12T09:28:13.378Z [INFO]  storage.raft: copied to local snapshot: bytes=52739096
server3 # [   38.087722] vault[819]: 2024-11-12T09:28:13.379Z [INFO]  storage.raft.fsm: installing snapshot to FSM
server3 # [   38.088474] vault[819]: 2024-11-12T09:28:13.380Z [INFO]  storage.raft.fsm: snapshot installed
server3 # [   38.091062] vault[819]: 2024-11-12T09:28:13.383Z [INFO]  storage.raft: snapshot restore progress: id=3-5923866-1731403690574 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server3 # [   38.091761] vault[819]: 2024-11-12T09:28:13.383Z [INFO]  storage.raft: Installed remote snapshot
server2 # [   38.483306] vault[836]: 2024-11-12T09:28:12.768Z [WARN]  storage.raft: skipping application of old log: index=5923866
server1 # [   38.873730] vault[832]: 2024-11-12T09:28:13.162Z [INFO]  storage.raft: pipelining replication: peer="{Nonvoter server2 server2:8201}"
server3 # [   38.188332] vault[819]: 2024-11-12T09:28:13.480Z [WARN]  storage.raft: skipping application of old log: index=5923866
server1 # [   38.970451] vault[832]: 2024-11-12T09:28:13.258Z [INFO]  storage.raft: pipelining replication: peer="{Nonvoter server3 server3:8201}"
server1 # [   38.992499] vault[832]: 2024-11-12T09:28:13.280Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   39.598119] vault[832]: 2024-11-12T09:28:13.886Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   40.333612] vault[832]: 2024-11-12T09:28:14.621Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   40.880143] vault[832]: 2024-11-12T09:28:15.167Z [INFO]  core: running post snapshot restore invalidations
server1 # [   40.880789] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: failed to perform key upgrades, sealing: error="error reloading root key: error reloading root key: failed to read root key path: no decryption key available for term 2"
server1 # [   40.882057] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: marked as sealed
server1 # [   40.882659] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: pre-seal teardown starting
server1 # [   40.883349] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: stopping raft active node
server1 # [   40.884039] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: pre-seal teardown complete
server1 # [   40.885256] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] core: unlocking HA lock failed: error="cannot find peer"
server1 # [   40.886054] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] storage.raft.raft-net: failed to accept connection: error="Raft RPC layer closed"
server1 # [   40.886713] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  storage.raft: aborting pipeline replication: peer="{Nonvoter server3 server3:8201}"
server1 # [   40.887256] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  storage.raft: aborting pipeline replication: peer="{Nonvoter server2 server2:8201}"
server1 # [   40.887788] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  core: stopping cluster listeners
server1 # [   40.888524] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
server3 # [   40.271642] vault[819]: 2024-11-12T09:28:15.562Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server1 # [   41.296277] vault[832]: 2024-11-12T09:28:15.583Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   41.297096] vault[832]: 2024-11-12T09:28:15.584Z [INFO]  core.cluster-listener: rpc listeners successfully shut down
server1 # [   41.297609] vault[832]: 2024-11-12T09:28:15.584Z [INFO]  core: cluster listeners successfully shut down
server1 # [   41.300927] vault[832]: 2024-11-12T09:28:15.586Z [INFO]  core: vault is sealed
(finished: must succeed: vault operator unseal _________________________________________, in 3.65 seconds)
server1: must succeed: vault operator unseal _________________________________________
(finished: must succeed: vault operator unseal _________________________________________, in 0.11 seconds)
server1: must succeed: vault operator unseal _________________________________________
server1 # Error unsealing: Error making API request.
server1 #
server1 # URL: PUT http://server1:8200/v1/sys/unseal
server1 # Code: 500. Errors:
server1 #
server1 # * failed to create cipher: crypto/aes: invalid key size 0                  ## <<<<<<<<<<<< the API-level error
server1: output:
command `vault operator unseal ____________________________________` failed (exit code 2)
>>> server2 # [   46.613746] vault[836]: 2024-11-12T09:28:20.897Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   47.069448] vault[819]: 2024-11-12T09:28:22.360Z [WARN]  storage.raft: not part of stable configuration, aborting election
server3 # [   50.272312] vault[819]: 2024-11-12T09:28:25.563Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   52.214088] vault[836]: 2024-11-12T09:28:26.498Z [WARN]  storage.raft: not part of stable configuration, aborting election
server2 # [   56.615193] vault[836]: 2024-11-12T09:28:30.898Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   60.273123] vault[819]: 2024-11-12T09:28:35.564Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   66.615611] vault[836]: 2024-11-12T09:28:40.899Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   70.273883] vault[819]: 2024-11-12T09:28:45.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   76.615875] vault[836]: 2024-11-12T09:28:50.900Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   80.274173] vault[819]: 2024-11-12T09:28:55.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   86.617118] vault[836]: 2024-11-12T09:29:00.901Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   90.275506] vault[819]: 2024-11-12T09:29:05.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   96.617767] vault[836]: 2024-11-12T09:29:10.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  100.275955] vault[819]: 2024-11-12T09:29:15.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  106.618096] vault[836]: 2024-11-12T09:29:20.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  110.275971] vault[819]: 2024-11-12T09:29:25.567Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  116.618969] vault[836]: 2024-11-12T09:29:30.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  120.277206] vault[819]: 2024-11-12T09:29:35.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  126.619124] vault[836]: 2024-11-12T09:29:40.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  130.277792] vault[819]: 2024-11-12T09:29:45.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
>>> server2 # [  136.619812] vault[836]: 2024-11-12T09:29:50.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  140.278102] vault[819]: 2024-11-12T09:29:55.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  146.619942] vault[836]: 2024-11-12T09:30:00.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  150.278923] vault[819]: 2024-11-12T09:30:05.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants