Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine action tables #4459

Merged
merged 269 commits into from
Nov 11, 2024
Merged

Conversation

dullbananas
Copy link
Collaborator

@dullbananas dullbananas commented Feb 16, 2024

This should cause a huge improvement in query plans, especially for queries that previously reached the from/join collapse limits. For example, getting saved posts might now start with an index scan of the post_actions table, which avoids scanning posts that the user didn't do anything with (or all non-saved posts if I add partial indexes, but I don't know if I should do that).

This will also make the code much cleaner and reduce the size of the database. (Edit: it may or may not reduce size)

Indexes for the new action tables will use INCLUDE WHERE with IS NULL for each action column to keep index-only scans possible.

In the new joins, person_id will not use a bind parameter if it's None, so there can still be separate generic query plans for users that are not logged in.

@dessalines
Copy link
Member

Before you go forward and spend too much time on this, it needs a lot of discussion, because we could lose a lot of data integrity solely for the sake of post_view query speed. An update to a person_action table, when that action could be many different columns is a lot more confusing than single-action tables with solid constraints.

There are a lot of inside-postgres things we could do before getting rid of the post_like or comment_like table (unfortunately most of them would be some form of caching / non-source data store tho).

@dullbananas
Copy link
Collaborator Author

dullbananas commented Feb 16, 2024

@dessalines Would that problem be fixed by using a composite type for each action that stores multiple values?

Edit: or multi-column constraints, like (a IS NULL) = (b IS NULL)

@dessalines
Copy link
Member

I'm not sure I like that option either, at least for source data.

The only thing I can think of rn, that would also help with the linked issue below, is to do what you're doing with the post_action_table (with many optional columns), but have it act as a cache / secondary store, being filled by triggers on inserts / updates to source tables like post_like. I don't like this too much, since these secondary stores are nearly always imperfect and tend to get out of sync, and solving problems with them can be a nightmare.

We desperately need some SQL experts that could help us with this one, as well as #2444 which is a similar problem.

@Nutomic
Copy link
Member

Nutomic commented Feb 19, 2024

I dont think this implementation would create any problems with data integrity, as you have mandatory columns for person_id, post_id etc and then optional columns for each action. In effect its the same integrity we have with existing table definitions. There is a risk to read or write the wrong column, but that seems unlikely as we can keep using existing wrapper methods such as PostLike::like.

On the other hand storing the data in another table and using triggers will definitely give us consistency bugs, as happened with comment counts. So I would say go ahead with this approach.

@dessalines
Copy link
Member

I've posted this to ![email protected] to see if any SQL experts can chime in on a correct way of doing this.

https://programming.dev/post/10280707

@dullbananas
Copy link
Collaborator Author

I changed the implementation of the existing post functions to use the post_actions table.

The only remotely scary thing is automatically deleting rows after all actions are unset. I will do that with a trigger that runs DELETE. It shouldn't have concurrency problems because the condition after WHERE is re-checked if needed after locking the row. Also, forgetting to update the trigger after adding columns will be guaranteed to raise an error because tuple comparison with the whole row will be used (e.g. (foo.*) = (foo.a, foo.b, NULL, NULL)).

@dullbananas
Copy link
Collaborator Author

conflicts are resolved now

@dessalines
Copy link
Member

We still gotta get more ppl than me looking at this. Its been on our PR list for too long, and it'll give a lot of potential performance benefits.

@Nutomic
Copy link
Member

Nutomic commented Oct 23, 2024

My comments are not adressed yet.

@Nutomic
Copy link
Member

Nutomic commented Oct 31, 2024

Did you actually compare the query plans eg for PostView before and after these changes to verify that there is a major benefit? These changes are very complex and can cause strange bugs from AssumeNotNull, as well as making future code changes much more difficult. So if there is only a minor benefit I would rather skip it and keep the current implementation. It may not be the most efficient, but at least its easy to understand and maintain.

If we merge this then you definitely need to add tests for uplete.rs. In case there is a weird failure in api tests it would be very hard to track it down to a specific part of that file otherwise.

@dullbananas
Copy link
Collaborator Author

Now there's tests in the uplete module.

I don't remember checking the query plans and durations. I will do that soon. Or you could do it if you have enough time in the next few days, which should be super easy with scripts/db_perf.sh. If you do, remember to merge from main right before checking.

I don't completely agree about the maintainability tradeoff. I think the current action-related code is completely the opposite of "easy to understand and maintain". There's already much simpler joins now with the combined tables, and maybe overall more ease in adding more actions. In the future there can be less maintainability problems by not using separate structs, or separate fields in views, for each individual action type.

Let me know if you want me to reduce the assume_not_null risk before this PR is merged, at the expense of this PR taking a much longer time.

@Nutomic
Copy link
Member

Nutomic commented Nov 8, 2024

Alright Ive approved it.

@dessalines
Copy link
Member

@dullbananas once you work out the conflicts, we can get this merged. This is one of the oldest prs so it should take merging order priority so you don't have to keep maintaining it.

@Nutomic Nutomic merged commit 2e8687e into LemmyNet:main Nov 11, 2024
1 check passed
@Nutomic
Copy link
Member

Nutomic commented Nov 18, 2024

With this PR merged there are some errors in scheduled tasks again:

2024-11-18T14:03:47.666537Z ERROR lemmy_server::scheduled_tasks: Failed to update site stats: relation "post_like" does not exist
2024-11-18T14:03:47.666802Z  INFO actix_server::builder: starting 16 workers
2024-11-18T14:03:47.666868Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2024-11-18T14:03:47.666880Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:8536", workers: 16, listening on: 0.0.0.0:8536
2024-11-18T14:03:47.666942Z ERROR lemmy_server::scheduled_tasks: Failed to update community stats: relation "post_like" does not exist
2024-11-18T14:03:47.667028Z  INFO lemmy_federate: Starting federation workers for process count 1 and index 0
2024-11-18T14:03:47.667313Z ERROR lemmy_server::scheduled_tasks: Failed to update site stats: relation "post_like" does not exist
2024-11-18T14:03:47.667631Z ERROR lemmy_server::scheduled_tasks: Failed to update community stats: relation "post_like" does not exist
2024-11-18T14:03:47.667959Z ERROR lemmy_server::scheduled_tasks: Failed to update site stats: relation "post_like" does not exist
2024-11-18T14:03:47.668302Z ERROR lemmy_server::scheduled_tasks: Failed to update community stats: relation "post_like" does not exist
2024-11-18T14:03:47.668598Z ERROR lemmy_server::scheduled_tasks: Failed to update site stats: relation "post_like" does not exist
2024-11-18T14:03:47.668927Z ERROR lemmy_server::scheduled_tasks: Failed to update community stats: relation "post_like" does not exist

@dessalines
Copy link
Member

I'll make an issue for that.

@phiresky
Copy link
Collaborator

phiresky commented Dec 15, 2024

Just saw this - I think you're missing some index(es). For listing saved posts with where saved is not null order by saved desc you want an index on post_actions (person_id, saved) where saved is not null.

The existing index on post_actions (person_id, post_id) where saved is not null is okay as long but it does force a full scan through all saved posts (which is not too bad because users probably have <100 posts saved)
See also my comment on #5264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants