-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Stop getting missing prev_events
after we already know their signature is invalid
#13816
Conversation
…e is invalid Related to - #13622 - #13635 - #13676 Follow-up to #13815 which tracks event signature failures. This PR aims to stop us from trying to `_get_state_ids_after_missing_prev_event` after we already know that the prev_event will fail from a previous attempt To explain an exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid. - `backfill` - `outgoing-federation-request` `/backfill` - `_check_sigs_and_hash_and_fetch` - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - `_process_pulled_events` - `_process_pulled_event` for each validated event - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it - `_get_state_ids_after_missing_prev_event` - `outgoing-federation-request` `/state_ids` - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check With this PR, we no longer call the burdensome `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before we try to run this.
See #13815 (comment) So wes can avoid things like `CancelledError` which is a valid error but probably should not count as an error
…nto madlittlemods/13622-13700-stop-getting-state-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently Conflicts: synapse/storage/databases/main/event_federation.py
…nto madlittlemods/13622-13700-stop-getting-state-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently Conflicts: synapse/storage/databases/main/event_federation.py
Because we're doing the recording in `_check_sigs_and_hash_for_pulled_events_and_fetch` (previously named `_check_sigs_and_hash_and_fetch`), this means we will track signature failures for `backfill`, `get_room_state`, `get_event_auth`, and `get_missing_events` (all pulled event scenarios). And we also record signature failures from `get_pdu`. Part of #13700 Part of #13676 and #13356 This PR will be especially important for #13816 so we can avoid the costly `_get_state_ids_after_missing_prev_event` down the line when `/messages` calls backfill.
…ate-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently
/** | ||
* Exponential back-off (up to the upper bound) so we don't try to | ||
* pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc. | ||
* | ||
* We use `1 << n` as a power of 2 equivalent for compatibility | ||
* with older SQLites. The left shift equivalent only works with | ||
* powers of 2 because left shift is a binary operation (base-2). | ||
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`. | ||
*/ | ||
AND ? /* current_time */ < last_attempt_ts + ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, TBH I'm wondering whether for this it would be clearer to just fetch event_id, last_attempt_ts, num_attempts
and do the checks in python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, TBH I'm wondering whether for this it would be clearer to just fetch
event_id, last_attempt_ts, num_attempts
and do the checks in python?
I think this makes a lot of sense here since we have to get a response out for each event_ids
anyway 👍
It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?
In terms of other places to worry about this: we already strip out the newlines which is why I use /* */
instead of inline --
. Is there a way we can strip the comments out so we don't lose the nice readability? The performance of find/replacing this every time doesn't sound great but our _make_sql_one_line
doesn't look particularly optimized by eye either 🤔. I'm not sure why we started doing that (introduced in #1939), maybe to get rid of all of the extra indentation spaces in multi-line strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about that too, though we'd want to make sure that we do it safely (I'm not sure if just stripping /* ... */
is safe by itself).
Note that we don't actually send the one-line SQL to the database, we use it for logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #14192 to continue this comments in SQL conversation ⏩
Note that we don't actually send the one-line SQL to the database, we use it for logging.
huh, I swore this was a problem before. Good to know!
synapse/api/errors.py
Outdated
@@ -274,6 +274,31 @@ def __init__(self, destination: Optional[str]): | |||
) | |||
|
|||
|
|||
class FederationPullAttemptBackoffError(SynapseError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense for this to inherit from SynapseError
. The risk is that this manages to bubble all the way to the client API, returning a 403, which seems entirely wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍. The docstrings for SynapseError
and FederationError
could use some improving (would do in separate PR so we don't hang up on that).
We probably want to also update FederationDeniedError
in a separate PR as well?
Line 251 in e6e876b
class FederationDeniedError(SynapseError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ate-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently
Also can pass multiple event_ids, see #13816 (comment)
…andle FederationPullAttemptBackoffError See #13816 (comment)
…ults for all given events anyway See #13816 (comment)
# Exponential back-off (up to the upper bound) so we don't try to | ||
# pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc. | ||
if current_time | ||
< event_failed_pull_attempt["last_attempt_ts"] | ||
+ ( | ||
2 | ||
** min( | ||
event_failed_pull_attempt["num_attempts"], | ||
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_MAXIMUM_DOUBLING_STEPS, | ||
) | ||
) | ||
* BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_MILLISECONDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Sanity check appreciated here to make sure we're doing the equivalent of
synapse/synapse/storage/databases/main/event_federation.py
Lines 842 to 857 in b6baa46
/** | |
* Exponential back-off (up to the upper bound) so we don't retry the | |
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc. | |
* | |
* We use `1 << n` as a power of 2 equivalent for compatibility | |
* with older SQLites. The left shift equivalent only works with | |
* powers of 2 because left shift is a binary operation (base-2). | |
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`. | |
*/ | |
AND ( | |
failed_backfill_attempt_info.event_id IS NULL | |
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + ( | |
(1 << {least_function}(failed_backfill_attempt_info.num_attempts, ? /* max doubling steps */)) | |
* ? /* step */ | |
) | |
) |
prev_events
after we already know their signature is invalid
Co-authored-by: Erik Johnston <[email protected]>
Thanks for the review @erikjohnston 🐫 |
Explain `SynapseError` and `FederationError` better Spawning from #13816 (comment)
Upstream changes: Synapse 1.70.1 (2022-10-28) =========================== (bugfixes) Synapse 1.70.0 (2022-10-26) =========================== Features -------- - Support for [MSC3856](matrix-org/matrix-spec-proposals#3856): threads list API. ([\#13394](matrix-org/synapse#13394), [\#14171](matrix-org/synapse#14171), [\#14175](matrix-org/synapse#14175)) - Support for thread-specific notifications & receipts ([MSC3771](matrix-org/matrix-spec-proposals#3771) and [MSC3773](matrix-org/matrix-spec-proposals#3773)). ([\#13776](matrix-org/synapse#13776), [\#13824](matrix-org/synapse#13824), [\#13877](matrix-org/synapse#13877), [\#13878](matrix-org/synapse#13878), [\#14050](matrix-org/synapse#14050), [\#14140](matrix-org/synapse#14140), [\#14159](matrix-org/synapse#14159), [\#14163](matrix-org/synapse#14163), [\#14174](matrix-org/synapse#14174), [\#14222](matrix-org/synapse#14222)) - Stop fetching missing `prev_events` after we already know their signature is invalid. ([\#13816](matrix-org/synapse#13816)) - Send application service access tokens as a header (and query parameter). Implements [MSC2832](matrix-org/matrix-spec-proposals#2832). ([\#13996](matrix-org/synapse#13996)) - Ignore server ACL changes when generating pushes. Implements [MSC3786](matrix-org/matrix-spec-proposals#3786). ([\#13997](matrix-org/synapse#13997)) - Experimental support for redirecting to an implementation of a [MSC3886](matrix-org/matrix-spec-proposals#3886) HTTP rendezvous service. ([\#14018](matrix-org/synapse#14018)) - The `/relations` endpoint can now be used on workers. ([\#14028](matrix-org/synapse#14028)) - Advertise support for Matrix 1.3 and 1.4 on `/_matrix/client/versions`. ([\#14032](matrix-org/synapse#14032), [\#14184](matrix-org/synapse#14184)) - Improve validation of request bodies for the [Device Management](https://spec.matrix.org/v1.4/client-server-api/#device-management) and [MSC2697 Device Dehyrdation](matrix-org/matrix-spec-proposals#2697) client-server API endpoints. ([\#14054](matrix-org/synapse#14054)) - Experimental support for [MSC3874](matrix-org/matrix-spec-proposals#3874): Filtering threads from the `/messages` endpoint. ([\#14148](matrix-org/synapse#14148)) - Improve the validation of the following PUT endpoints: [`/directory/room/{roomAlias}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3directoryroomroomalias), [`/directory/list/room/{roomId}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3directorylistroomroomid) and [`/directory/list/appservice/{networkId}/{roomId}`](https://spec.matrix.org/v1.4/application-service-api/#put_matrixclientv3directorylistappservicenetworkidroomid). ([\#14179](matrix-org/synapse#14179)) Deprecations and Removals ------------------------- - Remove the experimental implementation of [MSC3772](matrix-org/matrix-spec-proposals#3772). ([\#14094](matrix-org/synapse#14094)) - Remove the unstable identifier for [MSC3715](matrix-org/matrix-spec-proposals#3715). ([\#14106](matrix-org/synapse#14106), [\#14146](matrix-org/synapse#14146))
Stop getting missing
prev_events
after we already know their signature is invalid. While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.Related to
Part of #13356
Follow-up to #13815 which tracks event signature failures.
With this PR, we avoid the call to the costly
_get_state_ids_after_missing_prev_event
because the signature failure will count as an attempt before and we filter events based on the backoff before calling_get_state_ids_after_missing_prev_event
now.For example, this will save us 156s out of the 185s total that this
matrix.org
/messages
request. If you want to see the full Jaeger trace of this, you can drag and drop thistrace.json
into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761To explain this exact scenario around
/messages
-> backfill, we call/backfill
and first check the signatures of the 100 events. We see bad signature for$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M
and$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw
(both member events). Then we process the 98 events remaining that have valid signatures but one of the events references$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M
as aprev_event
. So we have to do the whole_get_state_ids_after_missing_prev_event
rigmarole which pulls in those same events which fail again because the signatures are still invalid.backfill
outgoing-federation-request
/backfill
_check_sigs_and_hash_and_fetch
_check_sigs_and_hash_and_fetch_one
for each event received over backfill$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M
fails withSignature on retrieved event was invalid.
:unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)
$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw
fails withSignature on retrieved event was invalid.
:unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)
_process_pulled_events
_process_pulled_event
for each validated event$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0
references$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M
as aprev_event
which is missing so we try to get it_get_state_ids_after_missing_prev_event
outgoing-federation-request
/state_ids
get_pdu
for$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M
which fails the signature check againget_pdu
for$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw
which fails the signature checkDev notes
Sytest:
source /Users/eric/Documents/github/element/synapse/.venv/bin/activate ./run-tests.pl --synapse-directory ../synapse tests/48admin.pl
Dump all events in the room with useful debug info
Adapted from matrix-org/matrix-react-sdk#8354 for
txn
(originally from #12370)asdf_get_debug_events_in_room_ordered_by_depth_txn
Todo
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.Pull request includes a sign off(run the linters)