Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

stuck in a #dendrite:matrix.org related loop, heavy resource usage #2181

Closed
bones-was-here opened this issue Feb 11, 2022 · 15 comments
Closed

Comments

@bones-was-here
Copy link

Background information

  • Dendrite version or git SHA: 0.6.3 and 5106cc8
  • Monolith or Polylith?: monolith
  • SQLite3 or Postgres?: postgres
  • Running in Docker?: no
  • go version: 1.17.6
  • Client used (if applicable): element, curl

Description

  • What is the problem: dendrite is stuck in a loop printing the same errors
  • Who is affected:
  • How is this bug manifesting: steadily increasing ram and jetstream storage size
  • When did this first appear: some time after updating to 0.6.3

Steps to reproduce

level=error msg="syncapi: failed to QuerySharedUsers for key change event from key server" error="found 2577 users but only have state key nids for 2350 of them"
level=error msg="syncapi: failed to QuerySharedUsers for key change event from key server" error="found 2448 users but only have state key nids for 2221 of them"
  • it has something to do with #dendrite:matrix.org because that's the only room with that many users that this server has participated in
  • these print at about 8 lines per second continuously
  • restarting and deleting ~/jetstream have no effect
  • ~/jetstream is growing steadily
  • joining and leaving #dendrite:matrix.org has no effect
  • forgetting the room has no effect Rooms are not completely forgotten #2176
@bones-was-here
Copy link
Author

Eventually it panics, restarts, and continues the loop

@bones-was-here
Copy link
Author

It eventually stopped after many hours and several GB of jetstream dir. Now I can restart it and delete jetstream and it continues behaving. Have not attempted to rejoin #dendrite:matrix.org :)

@neilalexander
Copy link
Contributor

Can you please run this query against your roomserver database?

SELECT COUNT(target_nid) FROM roomserver_membership AS m WHERE NOT EXISTS (
	SELECT event_state_key_nid FROM roomserver_event_state_keys AS s
	WHERE m.target_nid = s.event_state_key_nid
);

@bones-was-here
Copy link
Author

 count 
-------
   230
(1 row)

@bones-was-here
Copy link
Author

This dendrite is still running 5106cc8 and is not currently looping those errors at time of query.

@bones-was-here
Copy link
Author

Currently it's printing
level=error msg="syncapi: failed to QuerySharedUsers for key change event from key server" error="found 2623 users but only have state key nids for 2395 of them"
about once per second and continuing after restart.
Slower than the last time, and every error is the same whereas previously it alternated between two slightly different ones. This is 002429c built with golang 1.17.7.

@neilalexander
Copy link
Contributor

Please let me know if things are any better or worse in Dendrite 0.6.4.

@bones-was-here
Copy link
Author

Should I delete this? 4.2G jetstream/$G/streams/DendriteOutputKeyChangeEvent

@neilalexander
Copy link
Contributor

Should be OK to.

@alistair23
Copy link

Seems ok at first, I am having issues with new sessions though, I opened: #2222

@bones-was-here
Copy link
Author

bones-was-here commented Feb 23, 2022

Much better on CPU so far, even when it's obviously working on something.

It still has "slow" federation for #dendrite.xonotic.org with occasional groups of messages from weeks or months ago appearing.

Tends to log level=warning msg="SelectJoinedUsersSetForRooms found 2632 users but BulkSelectEventStateKey only returned state key NIDs for 2404 of them" sometimes 1-2 per second for ages while my element client is open (it constantly spams the server as mentioned in #2184, same version of element logged into matrix.org doesn't spam).

@bones-was-here
Copy link
Author

I spoke too soon, the CPU burn started again, not sure why, it seems inconsistent.

Updating to master to get 4c07374 seems to have given a significant reduction in peak memory and average CPU load during the burn, but has not fixed the problems with this room. It's still a significant load considering how little work the homeserver should be doing.

Possibly it gets worse when people talk in #dendrite:matrix.org, and there's still specific users whose messages never appear.

level=warning msg="SelectJoinedUsersSetForRooms found 2648 users but BulkSelectEventStateKey only returned state key NIDs for 2421 of them" continues to print at about two per second while the CPU burn is happening.

@bones-was-here
Copy link
Author

@zeeZ
Copy link

zeeZ commented Feb 28, 2022

I have the same on 0.6.4. Updated two days ago and that's all it's been doing since then

@bones-was-here
Copy link
Author

The heavy CPU load was caused by clients spamming and was fixed by #2233.
#2234 and #2237 fixed the state key NIDs warnings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants