[Bug]: Table locks / slow queries on `0.19.6` betas #4983

dessalines · 2024-08-20T01:49:46Z

Requirements

Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
Did you check to see if this issue already exists?
Is this only a single bug? Do not put multiple bugs in one issue.
Do you agree to follow the rules in our Code of Conduct?
Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Earlier tonight I tried to deploy 0.19.6-beta.6 to lemmy.ml, after having tested various versions of it on voyager.lemmy.ml for a few weeks. Post queries start stalling out pretty quickly, and it becomes unusable.

I didn't think we changed anything major with the post queries, so this could be trigger related, or something to do with the site and community aggregates causing locks.

Also the controversial migration does take ~30m and locks up things, but I spose that's unavoidable, and not too big a deal since its only run once.

I turned on pg_stat_statements and got this:

For now I restored lemmy.ml from the backup I took before.

cc @dullbananas @phiresky @Nutomic

Steps to Reproduce

NA

Technical Details

NA

Version

0.19.6

Lemmy Instance URL

voyager.lemmy.ml

The text was updated successfully, but these errors were encountered:

phiresky · 2024-08-20T09:10:44Z

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?
The second query with 161s total time and 15k calls is the get_instance_followed_community_inboxes function. Depending on the time frame that's a lot of calls and the time is much higher than it should be on average (most of the calls should be WHERE published > now() - interval '1 minute' which is indexed, returns an empty list, and should take <1ms.

Could just be due to the general DB overload though.

phiresky · 2024-08-20T09:48:13Z

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network. Since our incoming federation is not limited, it will be implicitly only limited by resource limits (CPU+DB), which might also appear as general load and cause everything to slow down.

So if you're looking at perf issues, make sure that you only start measuring once the federation state is up to date for all incoming instances.

dessalines · 2024-08-21T00:09:21Z

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?

I turned this on ~ 20 minutes after the migrations finished, and startup, after I saw things were going slow. I let it run for maybe 5-10m.

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network.

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

This is gonna be a tough one to solve, and we probably need to look at changes to the DB and the post query list function since 0.19.5. Its no rush tho, we should probably wait till @Nutomic gets back next month anyway for the next release.

Nutomic · 2024-08-21T10:34:02Z

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

That container only handles outgoing federation, the incoming federation which phiresky mentioned goes through the main container which handles http requests.

phiresky · 2024-08-21T14:13:25Z

Yeah, it's a bit harder to check incoming federation state which is the important part here. Outgoing federation will be idle after downtime.

A query like select * from federation_queue_state join instance on instance_id = instance.id where not instance.dead and last_successful_id > 1e7 order by last_successful_published_time asc might work. the 1e7 filter is to ignore small instances which are unlikely to overload the instance, and last_successful_published_time shows when the last processed activity is. they should all be close to the current time.

Nutomic · 2024-09-13T09:09:33Z

Is this issue still relevant? Weve been running 0.19.6-beta1 on lemmy.ml for a while now, and I havent noticed any problems. This is also the only issue remaining inthe 0.19.6 milestone, so once its closed we can publish the new release.

dessalines · 2024-09-13T23:14:48Z

Yes, beta1 doesn't have any of the DB changes, only that one specific federation commit. So we still have to investigate which commit is causing the slowness.

Nutomic · 2024-09-24T08:06:25Z

The changes to post_view.rs are really trivial so that cant be the problem:

git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_views/src/post_view.rs
53a226b Add show_nsfw override filter to GetPosts. (#4889)
32cee9c Fixing not being able to create comments on local community posts. (#4854)
d09854a Adding a show_read override to GetPosts. (#4846)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

And for triggers theres only a single change which also looks very simple:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_schema/replaceable_schema/triggers.sql
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)

And migrations:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- migrations/
33fd317 Adding a URL max length lemmy error. (#4960)
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)
fd58b4f Exponential controversy rank (#4872)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

image_details is really the only major change between these versions, everything else are minor bug fixes or dependency upgrades.

phiresky · 2024-09-30T10:03:04Z

So we still have to investigate which commit is causing the slowness.

I wouldn't necessarily assume that a code change is the cause, it might just be lemmy in general handling recovering from downtime poorly (as in, if incoming federation gets hammered recovering from downtime, it might cause compounding slowness everywhere).

To test, you could shut down the instance for 30min or however long it was down before and just start it again on the same version, I would tentatively expect the same extra load.

One reason why I'm saying this is that people have been complaining about lemmy becoming "slower" after every upgrade for multiple releases and often it seems to just be temporary the next few hours after the upgrade.

dessalines · 2024-10-01T22:39:52Z

I'm willing to try it again, as long as @Nutomic and someone else is available to help me to test. I don't think its federation, because I tried turning off federation, and it still wasn't usable.

But when I say the site was unusable, I mean that it was inaccessible to apps, and the web ui would only intermittently work.

78702b5 (the apub_url trigger changes) is the only one that sticks out to me that something could've gone wrong there.

Nutomic · 2024-10-18T14:24:23Z

One problem with 0.19.6 is with the migration from #3205. It takes a long time to recalculate all controversial scores. Once Lemmy starts again, postgres runs auto vacuum on the post_aggregates table (probably to regenerate the index). This is quite slow as it also needs to handle api requests at the same time. So maybe we should run vacuum as part of the migration, so it can use the full server cpu. And it would be good if the migration could filter some rows, eg posts with one or zero votes.

cc @dullbananas

Nutomic · 2024-10-18T14:29:30Z

But the main problem is that db queries are still slowing down extremely, so the site becomes unusable within a few minutes after startup. The slow queries are all annotated as PostQuery::list so the most likely cause is in that file. We already reverted #4797 and removed the join to image_details which was added in #4704. Neither made any difference. Shutting down the federation container didnt help either, so the problem is definitely caused by the api. But the remaining changes in post_view.rs look really trivial, so I dont know what else may be causing problems.

dessalines · 2024-10-18T15:00:27Z

Some stats:

Diff to look at: 0.19.5...0.19.6-beta.9-no-triggers

phiresky · 2024-10-18T15:38:27Z

i'm still not convinced it's related to any actual change rather than just a combination of the migration rewriting a table, destroying the in-memory page cache, and then the downtime causing the server to get hammered with federation requests

maybe just skip the controversial update? it's mostly eventually consistent anyways without the migration, no?

Nutomic · 2024-10-18T19:11:42Z

@phiresky Incoming federation would result in create and update queries, but the stats show only select queries at the top. Besides if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again. But what we saw was no server load on startup, quickly ramping up to 100% server load within a minute. We also would have seen similar problems during previous upgrades, but those were fine.

Anyway if there are no better ideas we can try to make a beta without any migrations so its easy to revert. If that fails we need to bisect to find the problematic commit. Otherwise apply the commits with migrations one by one to see which one causes problems.

dessalines added the bug Something isn't working label Aug 20, 2024

dessalines added this to the 0.19.6 milestone Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Table locks / slow queries on `0.19.6` betas #4983

[Bug]: Table locks / slow queries on `0.19.6` betas #4983

dessalines commented Aug 20, 2024

phiresky commented Aug 20, 2024 •

edited

Loading

phiresky commented Aug 20, 2024 •

edited

Loading

dessalines commented Aug 21, 2024

Nutomic commented Aug 21, 2024

phiresky commented Aug 21, 2024 •

edited

Loading

Nutomic commented Sep 13, 2024

dessalines commented Sep 13, 2024

Nutomic commented Sep 24, 2024

phiresky commented Sep 30, 2024

dessalines commented Oct 1, 2024 •

edited

Loading

Nutomic commented Oct 18, 2024 •

edited

Loading

Nutomic commented Oct 18, 2024

dessalines commented Oct 18, 2024 •

edited

Loading

phiresky commented Oct 18, 2024

Nutomic commented Oct 18, 2024

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

Comments

dessalines commented Aug 20, 2024

Requirements

Summary

Steps to Reproduce

Technical Details

Version

Lemmy Instance URL

phiresky commented Aug 20, 2024 • edited Loading

phiresky commented Aug 20, 2024 • edited Loading

dessalines commented Aug 21, 2024

Nutomic commented Aug 21, 2024

phiresky commented Aug 21, 2024 • edited Loading

Nutomic commented Sep 13, 2024

dessalines commented Sep 13, 2024

Nutomic commented Sep 24, 2024

phiresky commented Sep 30, 2024

dessalines commented Oct 1, 2024 • edited Loading

Nutomic commented Oct 18, 2024 • edited Loading

Nutomic commented Oct 18, 2024

dessalines commented Oct 18, 2024 • edited Loading

phiresky commented Oct 18, 2024

Nutomic commented Oct 18, 2024

[Bug]: Table locks / slow queries on `0.19.6` betas #4983

[Bug]: Table locks / slow queries on `0.19.6` betas #4983

phiresky commented Aug 20, 2024 •

edited

Loading

phiresky commented Aug 20, 2024 •

edited

Loading

phiresky commented Aug 21, 2024 •

edited

Loading

dessalines commented Oct 1, 2024 •

edited

Loading

Nutomic commented Oct 18, 2024 •

edited

Loading

dessalines commented Oct 18, 2024 •

edited

Loading