Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

Open
5 tasks done
dessalines opened this issue Aug 20, 2024 · 15 comments
Open
5 tasks done

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

dessalines opened this issue Aug 20, 2024 · 15 comments
Labels
bug Something isn't working
Milestone

Comments

@dessalines
Copy link
Member

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Earlier tonight I tried to deploy 0.19.6-beta.6 to lemmy.ml, after having tested various versions of it on voyager.lemmy.ml for a few weeks. Post queries start stalling out pretty quickly, and it becomes unusable.

I didn't think we changed anything major with the post queries, so this could be trigger related, or something to do with the site and community aggregates causing locks.

Also the controversial migration does take ~30m and locks up things, but I spose that's unavoidable, and not too big a deal since its only run once.

I turned on pg_stat_statements and got this:

Screenshot_20240819_212437_Termux

For now I restored lemmy.ml from the backup I took before.

cc @dullbananas @phiresky @Nutomic

Steps to Reproduce

NA

Technical Details

NA

Version

0.19.6

Lemmy Instance URL

voyager.lemmy.ml

@dessalines dessalines added the bug Something isn't working label Aug 20, 2024
@dessalines dessalines added this to the 0.19.6 milestone Aug 20, 2024
@phiresky
Copy link
Collaborator

phiresky commented Aug 20, 2024

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?
The second query with 161s total time and 15k calls is the get_instance_followed_community_inboxes function. Depending on the time frame that's a lot of calls and the time is much higher than it should be on average (most of the calls should be WHERE published > now() - interval '1 minute' which is indexed, returns an empty list, and should take <1ms.

Could just be due to the general DB overload though.

@phiresky
Copy link
Collaborator

phiresky commented Aug 20, 2024

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network. Since our incoming federation is not limited, it will be implicitly only limited by resource limits (CPU+DB), which might also appear as general load and cause everything to slow down.

So if you're looking at perf issues, make sure that you only start measuring once the federation state is up to date for all incoming instances.

@dessalines
Copy link
Member Author

what's the time frame the screenshot data is from? purely after the upgrade? how long did it run?

I turned this on ~ 20 minutes after the migrations finished, and startup, after I saw things were going slow. I let it run for maybe 5-10m.

One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network.

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

This is gonna be a tough one to solve, and we probably need to look at changes to the DB and the post query list function since 0.19.5. Its no rush tho, we should probably wait till @Nutomic gets back next month anyway for the next release.

@Nutomic
Copy link
Member

Nutomic commented Aug 21, 2024

I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related.

That container only handles outgoing federation, the incoming federation which phiresky mentioned goes through the main container which handles http requests.

@phiresky
Copy link
Collaborator

phiresky commented Aug 21, 2024

Yeah, it's a bit harder to check incoming federation state which is the important part here. Outgoing federation will be idle after downtime.

A query like select * from federation_queue_state join instance on instance_id = instance.id where not instance.dead and last_successful_id > 1e7 order by last_successful_published_time asc might work. the 1e7 filter is to ignore small instances which are unlikely to overload the instance, and last_successful_published_time shows when the last processed activity is. they should all be close to the current time.

@Nutomic
Copy link
Member

Nutomic commented Sep 13, 2024

Is this issue still relevant? Weve been running 0.19.6-beta1 on lemmy.ml for a while now, and I havent noticed any problems. This is also the only issue remaining inthe 0.19.6 milestone, so once its closed we can publish the new release.

@dessalines
Copy link
Member Author

Yes, beta1 doesn't have any of the DB changes, only that one specific federation commit. So we still have to investigate which commit is causing the slowness.

@Nutomic
Copy link
Member

Nutomic commented Sep 24, 2024

The changes to post_view.rs are really trivial so that cant be the problem:

git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_views/src/post_view.rs
53a226b Add show_nsfw override filter to GetPosts. (#4889)
32cee9c Fixing not being able to create comments on local community posts. (#4854)
d09854a Adding a show_read override to GetPosts. (#4846)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

And for triggers theres only a single change which also looks very simple:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- crates/db_schema/replaceable_schema/triggers.sql
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)

And migrations:
git log --oneline --follow 0.19.5..0.19.6-beta.6 -- migrations/
33fd317 Adding a URL max length lemmy error. (#4960)
78702b5 Use trigger to generate apub URL in insert instead of update, and fix query planner options not being set when TLS is disabled (#4797)
fd58b4f Exponential controversy rank (#4872)
6d8d231 Adding an image_details table to store image dimensions. (#4704)

image_details is really the only major change between these versions, everything else are minor bug fixes or dependency upgrades.

@phiresky
Copy link
Collaborator

So we still have to investigate which commit is causing the slowness.

I wouldn't necessarily assume that a code change is the cause, it might just be lemmy in general handling recovering from downtime poorly (as in, if incoming federation gets hammered recovering from downtime, it might cause compounding slowness everywhere).

To test, you could shut down the instance for 30min or however long it was down before and just start it again on the same version, I would tentatively expect the same extra load.

One reason why I'm saying this is that people have been complaining about lemmy becoming "slower" after every upgrade for multiple releases and often it seems to just be temporary the next few hours after the upgrade.

@dessalines
Copy link
Member Author

dessalines commented Oct 1, 2024

I'm willing to try it again, as long as @Nutomic and someone else is available to help me to test. I don't think its federation, because I tried turning off federation, and it still wasn't usable.

But when I say the site was unusable, I mean that it was inaccessible to apps, and the web ui would only intermittently work.

78702b5 (the apub_url trigger changes) is the only one that sticks out to me that something could've gone wrong there.

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

One problem with 0.19.6 is with the migration from #3205. It takes a long time to recalculate all controversial scores. Once Lemmy starts again, postgres runs auto vacuum on the post_aggregates table (probably to regenerate the index). This is quite slow as it also needs to handle api requests at the same time. So maybe we should run vacuum as part of the migration, so it can use the full server cpu. And it would be good if the migration could filter some rows, eg posts with one or zero votes.

cc @dullbananas

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

But the main problem is that db queries are still slowing down extremely, so the site becomes unusable within a few minutes after startup. The slow queries are all annotated as PostQuery::list so the most likely cause is in that file. We already reverted #4797 and removed the join to image_details which was added in #4704. Neither made any difference. Shutting down the federation container didnt help either, so the problem is definitely caused by the api. But the remaining changes in post_view.rs look really trivial, so I dont know what else may be causing problems.

@dessalines
Copy link
Member Author

dessalines commented Oct 18, 2024

@phiresky
Copy link
Collaborator

i'm still not convinced it's related to any actual change rather than just a combination of the migration rewriting a table, destroying the in-memory page cache, and then the downtime causing the server to get hammered with federation requests

maybe just skip the controversial update? it's mostly eventually consistent anyways without the migration, no?

@Nutomic
Copy link
Member

Nutomic commented Oct 18, 2024

@phiresky Incoming federation would result in create and update queries, but the stats show only select queries at the top. Besides if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again. But what we saw was no server load on startup, quickly ramping up to 100% server load within a minute. We also would have seen similar problems during previous upgrades, but those were fine.

Anyway if there are no better ideas we can try to make a beta without any migrations so its easy to revert. If that fails we need to bisect to find the problematic commit. Otherwise apply the commits with migrations one by one to see which one causes problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants