Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4081: Eagerly sharing fallback keys with federated servers #4081

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
129 changes: 129 additions & 0 deletions proposals/4081-claim-fallback-keys-on-network-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# MSC4081: Claim fallback key on network failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ara4n's comment somehow ended up on the commit rather than on the PR, so copying here:

I'm a bit worried about this: it's (nominally) weakening security in order to work around network reliability issues. it reminds me of our misadventures in key gossiping, where we similarly weaken security to mainly work around bad retry mechanisms and network unreliability.

If our server can't talk to the other server, i wonder if we should warn the sender (e.g. a "can't contact bob.com!" warnings on the message) and then retry? the sender will know to keep the app open while it tries to retry (just as they would if they were stuck sending the message too)? This feels better than to give up and send the message with (nominally) lower security, and could also make the app feel more responsive with appropriate UX (i.e. rather than being stuck in 'sending' state for ages while a /key/claims times out, it could declare itself sent to 10 out of 11 servers, or similar).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks for rescuing it - GH mobile app doing weird things. i was about to rewrite it from scratch.

Copy link
Member

@BillCarsonFr BillCarsonFr Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify to what extent it weakens security? The use of fallback keys is I think mitigated by the fact that the double ratchet will do a DH step on the next message and that will restore security.
Fallback key exists so that communication do not break when all OTKs are exausted (as of convenience and it's mitigated), why can't also they be used for transiant federation connectivity problems?
Maybe they could be modified to have a ttl?
And in case of replay attack (same prekey message sent other and other), couldn't the client add some additional mitigiations (as it could already BTW)? Like detecting abusive use of fallback?

Copy link
Member Author

@kegsay kegsay Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to re-emphasise the nominal reduction in security: in reality there is negligible impact, further reinforced by other secure protocols (Signal in this case) allowing OTKs to be optional in the setup phase. I think this MSC is overall net positive, as it makes the protocol more robust, and fixes concrete bugs we've seen in the wild.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the important thing is to emphasise in the threat model is that OTKs are security theatre whatever once you introduce fallback keys - given an attacker can force use of the fallback key by both exhausting the OTK pool (which leaves an audit trail), as well as simply deny the network (which doesn't leave an audit trail).

So, it feels like the only reason end up we keep OTKs is:
a) To enjoy their (nominal) security properties for paranoid deployments which disable fallback keys
b) To keep exercising the OTK code path, even when fallback keys are around, to help stop it regressing for deployments where fallback keys are disabled.

In which case, yes, perhaps this MSC isn't as bad as it felt at first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the important thing is to emphasise in the threat model is that OTKs are security theatre whatever once you introduce fallback keys - given an attacker can force use of the fallback key by both exhausting the OTK pool (which leaves an audit trail), as well as simply deny the network (which doesn't leave an audit trail).

So, it feels like the only reason end up we keep OTKs is: a) To enjoy their (nominal) security properties for paranoid deployments which disable fallback keys b) To keep exercising the OTK code path, even when fallback keys are around, to help stop it regressing for deployments where fallback keys are disabled.

That's not quite right. As I see it, OTKs guard against a passive attacker, who has nevertheless managed to snarf the network data, and then later gets access to [the data on] Bob's device. You don't have to be paranoid and disable fallback keys to benefit from them. I've linked to https://crypto.stackexchange.com/a/52825 in the doc, as I think it really helps explain this.

So yes, an attacker with access to the network between homeservers can now force use of a fallback key where previously no communication would happen at all. But it's far easier to claim all the OTKs than it is to get access to the network to block that /claim request, so I'm not sure it's really moving the needle?


*Abstract: This MSC aims to increase the robustness of the Olm session setup protocol over federation.
With this MSC, transient network failures over federation will not cause undecryptable messages due to
failing to claim OTKs.*

Check warning on line 5 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".

In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys
(OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must

Check warning on line 8 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".
only be used once. However, this presents several problems:
- what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion)
- what happens if the OTK cannot be claimed due to transient network failures.

[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys"
which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys,

Check warning on line 14 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".
specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords.
The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier
sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been
used, with some lag time to account for slow networks.

## Proposal

Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC
proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server
the target device is on is unreachable. In order for servers to return fallback keys during the network failure,
the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new
key `fallback_keys` to the `m.device_list_update` EDU. This MSC proposes changing the spec wording (bold is new):
richvdh marked this conversation as resolved.
Show resolved Hide resolved

> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and
> must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a
> room which contains servers which are not already receiving updates for that user’s device list, or changes in
> device information such as the device’s human-readable name **or fallback key**).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue that I can see with pre-sending fallback keys is something like this scenario:

  • Alice uploads fallback key A, which gets sent to Bob's server
  • Bob's server goes down for a while
  • Alice receives some olm messages that use the fallback keys, so rotates her keys, first to fallback B and then to fallback C. At this point, she has evicted private key A. Since Bob's server is down, it doesn't receive the new fallback keys.
  • Alice's server goes down for a while, and then Bob's server comes back up
  • Bob tries to establish an Olm session with Alice, receives fallback A, and sends an encrypted message to Alice
  • Alice's server comes back up, and Alice receives the message from Bob, but can't decrypt since she doesn't have private key A any more

This is worse than the situation where Bob doesn't get any OTK for Alice, since if he doesn't receive any key, he knows about the failure and can retry later. On the other hand, this is likely a very rare scenario, so may not be worth worrying about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be mitigated by either:

  • signalling to the sender that the session was undecryptable. Whilst this can be prompt, I would be worried about DoS and oracle-like attacks though. DoS because now attackers have a way to cause clients to send traffic automatically by sending keys the client doesn't have. Oracle because an attacker can send various keys and know if the client has the private key on-disk still, i.e it exposes whether the key has been evicted or not which is a critical part of forward secrecy.
  • treating fallback-initiated sessions as unreliable, and hence if you do not get an established session after time N, try to claim another OTK and try again? Unsure of the security implications here, as an attacker may be able to make use of the fact that there may be >1 established Olm session?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this scenario is: "what if Bob isn't told that Alice has rotated her fallback key, and tries to use a stale cached fallback key". This is very similar to the "what if Alice restores her server from backup, and starts handing out stale OTKs" failure mode.

I think we have to consider these as wedged sessions, and keep trying to retry setup from the client (with your server nudging you to retry by waking you up with a push, or similar, when the server sees that the remote server has updated its device list).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is basically the same as OTK reuse where the session is encrypted for a key the client has dropped. The frequency of these is quite different though.

Server backups will reliably cause OTK reuse currently, because we have no mechanism to tell clients that they need to reupload all their OTKs again (and even if we did, there would be a race condition where some user has claimed it during the reupload process). As a result, if during the bad deployment a user claimed 5 keys, upon rollback, the next 5 OTKs will be bad due to key reuse, guaranteed.

what if Bob isn't told that Alice has rotated her fallback key

For this to happen the two servers need to be partitioned for time N where N is the time between uploading a new fallback key and deleting the old fallback key on the client. N is configurable, and X3DH provides an example interval of "once a week, or once a month". Looking at the Android source it seems they try to cycle it every 2 days, with a max time of 14 days. This feels like a long enough time for most ephemeral network partitions to resolve themselves. Therefore, the likelihood of actually seeing this is much more remote.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N is configurable, and X3DH provides an example interval of "once a week, or once a month". Looking at the Android source it seems they try to cycle it every 2 days, with a max time of 14 days.

As discussed in the other thread, there is confusion here between "how often do we create a new key" and "how long do we keep old keys, having created a new one". The X3DH spec just says "eventually" for the latter:

After uploading a new signed prekey, Bob may keep the private key corresponding to the previous signed prekey around for some period of time, to handle messages using it that have been delayed in transit. Eventually, Bob should delete this private key for forward secrecy.

I think the Signal Android app is using 30 days for this period (https://github.com/signalapp/Signal-Android/blob/940cee0f30d6a2873ae08c65bb821c34302ccf5d/app/src/main/java/org/thoughtcrime/securesms/crypto/PreKeyUtil.java#L210-L239).

Nevertheless, I agree with the principles in this discussion: If Alice and Bob's servers manage to overlap their downness for long enough, then yes Bob will be unable to message Alice. But that's an extreme case, and I don't think it should prevent us making this incremental improvement in reliability even if we still end up bolting on retries later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what to do with this thread. Maybe I should add a section to the MSC to call out the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the issue should be rare enough that it's not worth trying to solve it right now. So at most, add something in the MSC that says something about it.


The following key/values are added to the `DeviceKeys` object definition (bold is new):

| Name | Type | Description |
|------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| algorithms | [string] | Required: The encryption algorithms supported by this device. |
| device_id | string | Required: The ID of the device these keys belong to. Must match the device ID used when logging in. |
| keys | {string: string} | Required: Public identity keys. The names of the properties should be in the format <algorithm>:<device_id>. The keys themselves should be encoded as specified by the key algorithm. |
| signatures | Signatures | Required: Signatures for the device key object. A map from user ID, to a map from <algorithm>:<device_id> to the signature. The signature is calculated using the process described at Signing JSON. |
| user_id | string | Required: The ID of the user the device belongs to. Must match the user ID used when logging in. |
| **fallback_key** | **{string: KeyObject}** | **The fallback key for this device, if set. The format of this object is identical to the /keys/claim response for a single device. This replaces any previously sent fallback key.** |
kegsay marked this conversation as resolved.
Show resolved Hide resolved
richvdh marked this conversation as resolved.
Show resolved Hide resolved

An example of the new field:
```js
{
// ...
"fallback_key": {
"signed_curve25519:AAAAHg": {
"key": "zKbLg+NrIjpnagy+pIY6uPL4ZwEG2v+8F9lmgsnlZzs",
"signatures": {
"@alice:example.com": {
"ed25519:JLAFKJWSCS": "FLWxXqGbwrb8SM3Y795eB6OA8bwBcoMZFXBqnTn58AYWZSqiD45tlBVcDa2L7RwdKXebW/VzDlnfVJ+9jok1Bw"
}
}
}
}
}
```

As a reminder, clients SHOULD rotate their fallback key when they realise it has been used, with some lag time
kegsay marked this conversation as resolved.
Show resolved Hide resolved
to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new
`m.device_list_update` EDU MUST be sent.
kegsay marked this conversation as resolved.
Show resolved Hide resolved

This proposal has no client-side changes.
kegsay marked this conversation as resolved.
Show resolved Hide resolved

## Comparisons with X3DH (Signal)

X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, it is worth researching what X3DH
does with respect to OTKs.

Check warning on line 70 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".

> To perform an X3DH key agreement with Bob, Alice contacts the server and fetches a "prekey bundle" containing the following values:
>
> - Bob's identity key IKB
> - Bob's signed prekey SPKB
> - Bob's prekey signature Sig(IKB, Encode(SPKB))
> - (Optionally) Bob's one-time prekey OPKB

https://signal.org/docs/specifications/x3dh/#sending-the-initial-message


Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to OTK. In X3DH, one-time
keys are optional. If they are exhausted, the protocol simply continues without it. If they are present, an additional
DH operation is performed.

This optionality makes the protocol robust to OTK exhaustion and transient network failures (e.g to a database to
claim OTKs as Signal is not federated).

Check warning on line 87 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".

## Security Considerations

Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they
dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to
the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this
richvdh marked this conversation as resolved.
Show resolved Hide resolved
flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely

Check warning on line 94 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".
opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and
return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like
security theater to add this flag.

A malicious actor who can control network conditions can force a client to use a fallback key by temporarily
richvdh marked this conversation as resolved.
Show resolved Hide resolved
richvdh marked this conversation as resolved.
Show resolved Hide resolved
preventing two homeservers from communicating. Previously, the only way a malicious actor could force a client to
richvdh marked this conversation as resolved.
Show resolved Hide resolved
use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this

Check warning on line 101 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".
MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It
is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds,

Check warning on line 103 in proposals/4081-claim-fallback-keys-on-network-failure.md

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"OT" should be "TO" or "OF" or "OR".
forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for
_active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to
force those sessions to be set up with the fallback key.

By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time than
before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user requests it.
At that point, the client is notified via `/sync` that the fallback key has been used and hence should be rotated.
With this MSC, the client would not be notified when the fallback key is used on the remote server, because this MSC
is robust to network partitions. Instead, the user will be notified when they receive a to-device event encrypted with
the fallback key. If having access to the public part of the fallback key
_for an extended period of time_ is useful for an attacker, then this MSC decreases security. The author is not aware
of any scenario where having access to the public key for a longer period of time is a security risk. If there is a
risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on long-lived public keys as
addresses would also be vulnerable. Furthermore, the user's own homeserver has access to the fallback key today. If
access to the key for an extended time is a security risk, and the user does not trust their own homeserver (not
unreasonable given this is for E2EE) then any concerns _are already present today_, just not over federation.

## Alternatives

Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message
will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this
will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer.
As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804)
and this is seen as very low.
richvdh marked this conversation as resolved.
Show resolved Hide resolved


Loading