Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dendrite 0.6.2 fails to sync/federate #2150

Closed
matrixbot opened this issue Nov 1, 2024 · 40 comments
Closed

Dendrite 0.6.2 fails to sync/federate #2150

matrixbot opened this issue Nov 1, 2024 · 40 comments
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. urgent

Comments

@matrixbot
Copy link
Collaborator

This issue was originally created by @Undef-a at matrix-org/dendrite#2150.

Background information

Dendrite version or git SHA: v0.6.2 (last known good was 0.6.0)
Monolith or Polylith?: monolith
SQLite3 or Postgres?: postgresql
Running in Docker?: no
go version: 1.16.13 and 1.18beta1
Client used (if applicable): app.element.io, Hydrogen and Fluffychat

Description

Dendrite fails to receive new events for any room and fails to sync existing events to some clients as of version 0.6.2. In the clients, is shown by either frozen rooms and disconnection messages (element) or never finishing the initial sync (Fluffychat and Hydrogen). Rolling back to 0.6.0 resolves the issue.

The following logs may be relevant:

level=debug msg="Transaction: Failed to query room version for room!" error="context canceled" req.id= req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
level=debug msg="Transaction: Failed to parse event JSON of event {\"auth_events\":[\"\",\"\",\"\"],\"content\":{\"algorithm\":\"m.megolm.v1.aes-sha2\",\"ciphertext\":\"\",\"device_id\":\"\",\"sender_key\":\"\",\"session_id\":\"\"},\"depth\":20,\"hashes\":{\"sha256\":\"h0hdF/eV9J+\"},\"origin\":\"matrix.org\",\"origin_server_ts\":,\"prev_events\":[\"$\"],\"prev_state\":[],\"room_id\":\"!:.\",\"sender\":\"@\",\"signatures\":{\"matrix.org\":{\"ed25519:a_RXGa\":\"h2jmBBg\"}},\"type\":\"m.room.encrypted\",\"unsigned\":{\"age_ts\":}}" error="gomatrixserverlib: unsupported room version ''" req.id=KzqABiJjMlbU req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader (delegator.go:65)

NOTE: user data from the logs has been stripped.

@matrixbot matrixbot added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. urgent labels Nov 1, 2024
@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

Can you please see if this behaviour is any better as of commit a2b4777?

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

Hi, I'm seeing the same issue. I can send messages within the instance, but anything from outside doesn't come through.

I'm seeing a lot of the following errors in my log:

time="2022-02-09T07:51:35.290943700Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=libre.brussels total=1 wait=1h0m0s
time="2022-02-09T07:51:45.290814047Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=jloa.ovh total=1 wait=1h0m0s
time="2022-02-09T07:51:45.348646367Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=thebrown.men total=1 wait=1h0m0s
time="2022-02-09T07:51:45.382300790Z" level=error msg="failed to query device keys for some users" context=missing failed=2 server_name=chatserver.ca total=2 wait=1h0m0s
time="2022-02-09T07:51:55.292139584Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=bergercloud.eu total=1 wait=1h0m0s

@neilalexander I tried to build the container from commit a2b4777 however the issue persists and the above stated error messages are still flowing.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @Undef-a at matrix-org/dendrite#2150 (comment).

with a2b4777 federation wasn't perfect, but was far better than 0.6.2 (where just about everything was locked up).

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

I just updated to 0.6.3 and federation is still broken. When grepping for level=error in the docker container output, I still see a lot of "failed to query device keys for some users" entries, same as above. The instance was working just fine prior to updating to 0.6.0, now internal messaging works but federation is still broken.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @stintel at matrix-org/dendrite#2150 (comment).

Broken for me as well in 0.6.3. Downgrading to 0.6.0 doesn't help. Downgrading further seems to require a database rollback, which is not an option as these backups have expired already.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

Can you please try the latest main commit 5106cc8 and report back on if it helps or not?

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

I tried a new docker container with matrix-org/dendrite@5106cc8 and the issue is still present.

In addition now also the inter-instance communication is broken. Also: when logging in via my Android phone the app was performing a initial sync, despite me being logged in before.

When grepping for level=error I still see a lot of the "failed to query device keys for some user context=missing" errors.

I'm attaching here the output of docker logs dendrite_monolith_1 in case you find this useful: dendrite-logs.txt

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @Undef-a at matrix-org/dendrite#2150 (comment).

Using 5106cc8 helps a little. Now rooms all correctly sync up immediately after restarting dendrite. Over the next 10-30 minutes the problem rooms will start to drop events.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @stintel at matrix-org/dendrite#2150 (comment).

Unfortunately 5106cc8 doesn't help at all here. I have not a single message after February 8 in any of the rooms I'm in. At that time I was running 0.6 since January 29. So for me the problem started with 0.6, but only ~10 days after upgrading to it.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

Short update: matrix-org/dendrite@5106cc8 seems to help. On Sunday after deploying my instance was still silent, however today I see that some of the noisier channels are finally filling again with messages from Sunday onwards. I still see a message hole between the date of updating to 0.6 and Sunday.
Also intra-instance communications works again.

I keep monitoring this issue, but I'm mildly optimistic that matrix-org/dendrite@5106cc8 might help to resolve the issue or at least help.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @alistair23 at matrix-org/dendrite#2150 (comment).

I am having the same issue, lots of failed to query device keys for some users in the logs. Using matrix-org/dendrite@5a39512 is no help

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

FWIW failed to query device keys for some users is specific to E2EE and ultimately a separate issue to failing to federate in public rooms.

Out of curiosity, are you all running the internal NATS deployment built into Dendrite or standalone NATS Server? If any of you are running a standalone NATS Server, which options are you running with?

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @alistair23 at matrix-org/dendrite#2150 (comment).

I deleted the jetstream dir (matrix-org/dendrite#2181) and now it appears to be working after waiting awhile. I also bumped my max connections up to try and avoid matrix-org/dendrite#2173

I'm using the NATS build into Dendrite

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

I'm also using the NATS build directly into Dendrite. I am using the same config as provided in the project yaml example:

  # Configuration for NATS JetStream
  jetstream:
    # A list of NATS Server addresses to connect to. If none are specified, an
    # internal NATS server will be started automatically when running Dendrite
    # in monolith mode. It is required to specify the address of at least one
    # NATS Server node if running in polylith mode.
    #addresses:
    #  - jetstream:4222

    # Keep all NATS streams in memory, rather than persisting it to the storage
    # path below. This option is present primarily for integration testing and
    # should not be used on a real world Dendrite deployment.
    in_memory: false

    # Persistent directory to store JetStream streams in. This directory 
    # should be preserved across Dendrite restarts.
    storage_path: ./

    # The prefix to use for stream names for this homeserver - really only
    # useful if running more than one Dendrite on the same NATS deployment.
    topic_prefix: Dendrite

  # Configuration for Prometheus metric collection.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

@neilalexander Do you want me to file another issue for the failed to query device keys for some users? I did a quick search in the issues list but could not find anything matching for it.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @stintel at matrix-org/dendrite#2150 (comment).

I deleted the jetstream dir (#2181) and now it appears to be working after waiting awhile.

Thanks, that seems to have helped. Also using built-in.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

OK, so to understand what's really going on, I could use a goroutine trace and a profile from Dendrites that are experiencing these issues.

To do this, you need to start Dendrite with the PPROFLISTEN=localhost:65432 environment variable set and then leave it running with that so that the profiler is accessible when the problem occurs. You should see a level=warning msg="Starting pprof on ..." line at startup if the profiler starts successfully.

Then the next time you run into problems, capture the following profiles:

curl -O http://localhost:65432/debug/pprof/goroutine?debug=1
curl -O http://localhost:65432/debug/pprof/goroutine?debug=2
curl -O http://localhost:65432/debug/pprof/profile

... and then upload all three files along with the commit ID that you are running — they don't contain configuration or anything sensitive (apart from possibly the folder names that Dendrite was built in) so should be safe to share. The two goroutine profiles should download pretty much instantly, the profile one will take 30 seconds to complete.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@grisu48

Do you want me to file another issue for the failed to query device keys for some users? I did a quick search in the issues list but could not find anything matching for it.

A lot of those issues will be genuine connection errors or bad keys so I wouldn't worry about those log lines unless you are having problems with E2EE specifically — in that case best to open a separate issue.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @imyxh at matrix-org/dendrite#2150 (comment).

I think I'm seeing this too. Deleting the jetstream folder and restarting dendrite does fix it but only temporarily and after a while my Element will stop connecting properly again. The log is surprisingly quiet for me though, other than the "context canceled" errors that occur after my client gives up, and the response.WriteHeader messages in #2123.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@imyxh Please follow the instructions a couple posts up and if you can supply profiles from the next time it happens, that’d be amazing.

Deleting the entire JetStream folder is not ideal and doing so is a very good way for downstream components to get in an out-of-sync state with the roomserver, so I can’t recommend that as a fix. A much much safer approach if absolutely necessary is to delete just the jetstream/$G/streams/DendriteInputRoomEvent sub-folder only rather than the whole JetStream folder.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @imyxh at matrix-org/dendrite#2150 (comment).

@neilalexander

Please follow the instructions a couple posts up and if you can supply profiles from the next time it happens, that’d be amazing.

Whoops, skipped over that. Here they are!

https://upload.disroot.org/r/9QJS70Hn#Kbx45aT6u79C2hAcB8D6ReproE37SPN1s6aZxQvD90U=
https://upload.disroot.org/r/43J5jdKk#2tZNirONZlnYzbl7lEpam5Y+w8r1KKk2jsRIv7ETkJg=
https://upload.disroot.org/r/YnkyrEMW#ZOQ0uLUXmdVlqG9TpkjdYaGa2Ye44DMkGPpyFUajGJU=

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@imyxh Thanks for these, the profiles are extremely useful. Can you please just confirm for me which commit ID of Dendrite you are running?

I’m seeing a pattern in the goroutine trace — there are a few roomserver workers that are all blocked on the select query in BulkSelectEventID, and then there are a few other roomserver workers waiting to obtain a database connection, likely blocked behind the others.

Can you please also get a few more details for me:

  • the result of the SHOW max_connections; query against your database
  • the value you have for max_open_conns in the room_server block of your Dendrite config file
  • the sum total value of all of the max_open_conns lines across your entire Dendrite config file in all sections
  • whether all of your components are sharing one PostgreSQL database, or if they are using their own separate databases

Thanks!

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@imyxh Actually, looking more closely, I suspect your specific issue may have been fixed already in #2178 — it’s just that it hasn’t made its way into a release yet. I can see this because your goroutine profiles claim to be stuck in QueryMissingAuthPrevEvents which is an API call that no longer exists after that PR.

One way to find out is to update to commit 5106cc8 or anything on the main branch more recent than that — latest commit would do — and then re-test and check the above details.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @imyxh at matrix-org/dendrite#2150 (comment).

@neilalexander

  • I'm running release 0.6.3, so I guess that's commit 9ac27ca.
  • roomserver max_open_conns: 10
  • total max_open_conns: 10 + 10 + 10 + 5 + 5 + 10 + 10 + 10 + 10 = 80
  • all of the components are sharing one PostgreSQL database
  • from psql: max_connections = 100

I just tested from latest git main and indeed, there is no problem :P

Thanks for all your work!

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@imyxh Glad to hear that’s helped — if you run into any more problems, please capture and chuck up some new profiles and we can look again. :-)

You’ve also got headroom of 20 unused database connections so you could increase the roomserver’s max_open_conns by another 10 or 15 to make better use of those resources. That should help with processing data for more rooms in parallel if you aren’t already limited by CPU or RAM.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

Will there be a 0.6.4 release soon with the above mentioned PRs in it so we can switch from a custom commit back to the release channel? 🙂

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@grisu48 Yep, sometime this week.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @alistair23 at matrix-org/dendrite#2150 (comment).

Using commit 002429c I am unable to sign in with new element sessions, I see some rooms but the data isn't in sync and on existing sessions Federated messages aren't working.

Output.zip

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@alistair23 What happens if you try to send outbound messages?

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @alistair23 at matrix-org/dendrite#2150 (comment).

If I send a message in an E2EE room on an existing session it seems to work, but the recipient can't decrypt it. On a new session it also seems to send, but it is sent unencrypted

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

There have been a number of improvements in Dendrite 0.6.4 both for the original issue and for E2EE, anyone who is having outstanding issues please test on the latest version and let me know how you get on.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

Hi! I've updated now to dendrite 0.6.4 using docker-compose now and encountered the following issues:

  • I got signed out from all of my devices. Needed to use the backup keys to log in again
  • I can not receive or send any messages, even intra-instance communication is not working

Not sure if this is related to this issue or if this is a new one though. It might be related to matrix-org/dendrite#2222.

In the logs I don't find anything really pointing at something, This is when I send a message to a room on my instance:

time="2022-02-23T20:12:28.650696285Z" level=warning msg="Failed to retrieve any keys" context=missing fetcher=DirectKeyFetcher
time="2022-02-23T20:12:29.269011299Z" level=error msg="Failed to fetch key for server" context=missing error="Post \"matrix://mpserver.pro/_matrix/key/v2/query\": lookup failed for \"mpserver.pro\"" fetcher=DirectKeyFetcher
time="2022-02-23T20:12:29.269078681Z" level=warning msg="Failed to retrieve any keys" context=missing fetcher=DirectKeyFetcher
time="2022-02-23T20:12:38.409693783Z" level=info msg="Sent event to roomserver" event_id="$BjG2vabZL7XgR51abnnncg5azDoWtjqZ5KuzH6Z7TfY" req.id=n8re6Vf0KsYa req.method=PUT req.path="/_matrix/client/r0/rooms/!ayPKSV6hofj2uZxI:matrix.feldspaten.org/send/m.room.encrypted/m1645647158230.30" room_id="!ayPKSV6hofj2uZxI:matrix.feldspaten.org" room_version=6 user_id="@phoenix:matrix.feldspaten.org"

I see a lot of those "Failed to retrieve any keys" warnings, but I don't know if they are related or not.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @stintel at matrix-org/dendrite#2150 (comment).

After updating to 002429c, I wasn't receiving any messages. Deleting just the jetstream/$G/streams/DendriteInputRoomEvent directory didn't help, deleting the complete jetstream directory did.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @ElDifinitivo at matrix-org/dendrite#2150 (comment).

Continuing to experience this issue with docker 0.6.4

Hi! I've updated now to dendrite 0.6.4 using docker-compose now and encountered the following issues:

  • I got signed out from all of my devices. Needed to use the backup keys to log in again
  • I can not receive or send any messages, even intra-instance communication is not working

Got the former point, but the latter is iffy. Intra-instance is working for me, but federation is incredibly spotty; some messages may send but most do not. And the logs are quiet for me as well.

Deleting just the jetstream/$G/streams/DendriteInputRoomEvent directory didn't help, deleting the complete jetstream directory did.

I also deleted the jetstream directory and it removed the context cancelled errors I had been receiving, but it doesn't do anything to help the federation. I've also been getting big memory increases (go1.17.7) that I hadn't been before, from a fresh server boot startup at 30-40MB constantly increasing to 1GB+ (2-user instance) until the machine becomes unresponsive. Tried it with GODEBUG=madvdontneed=1 and it doesn't change anything. It takes about an hour, but this is always where I end up:

monolith_1  | [1] [INF] pipe - cid:47 - "v1.13.0:go" - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 158 total bytes.
monolith_1  | [1] [WRN] pipe - cid:51 - "v1.13.0:go" - Readloop processing time: 23.583038563s
monolith_1  | [1] [WRN] pipe - cid:54 - "v1.13.0:go" - Readloop processing time: 2m12.323340906s
monolith_1  | [1] [WRN] pipe - cid:50 - "v1.13.0:go" - Readloop processing time: 5m20.187121708s
monolith_1  | [1] [WRN] pipe - cid:42 - "v1.13.0:go" - Readloop processing time: 2m7.013070596s
monolith_1  | [1] [WRN] pipe - cid:53 - "v1.13.0:go" - Readloop processing time: 3m2.370674975s
monolith_1  | time="2022-02-24T18:29:02.139386258Z" level=fatal msg="nats: timeout" subject=DendriteOutputRoomEvent

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @neilalexander at matrix-org/dendrite#2150 (comment).

@ElDifinitivo Built-in NATS or a standalone NATS Server?

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @ElDifinitivo at matrix-org/dendrite#2150 (comment).

@neilalexander Built-in monolith

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @alistair23 at matrix-org/dendrite#2150 (comment).

Any update? Now I'm seeing messages not going out

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @grisu48 at matrix-org/dendrite#2150 (comment).

I could finally solve this issue by deleting the old jetstream directory. Before I had still some ongoing issues with messages not going out and not being received. Now dendrite 0.6.5 is rocking again since Saturday 🙂

@alistair23 maybe that's also worth a shot for you? I just renamed the jetstream directory, and once everything worked, dumped it completely.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @Undef-a at matrix-org/dendrite#2150 (comment).

I've also had a similar experience. Upgrading to 0.6.5 jetstream broke, unable to parse something in that directory. After removing the directory messages are nearly always showing up correctly for the past few days.

However, #2142 (which I had thought fixed, then a symptom of this issue) is back.

@matrixbot
Copy link
Collaborator Author

This comment was originally posted by @kegsay at matrix-org/dendrite#2150 (comment).

Please open a new issue if this is still a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. urgent
Projects
None yet
Development

No branches or pull requests

1 participant