Skip to content

Commit

Permalink
Reworking query logic to avoid unbounded inner queries when since_id …
Browse files Browse the repository at this point in the history
…is very old or max_id is very new when min_id is very old.

removed call to subject in test because subject is memoized on first call
  • Loading branch information
sneakers-the-rat committed Jan 27, 2024
1 parent 58b2099 commit b6ee8b9
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 21 deletions.
70 changes: 50 additions & 20 deletions app/models/public_feed.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# frozen_string_literal: true

class PublicFeed
# number of days ahead to look when filtering boosts
FILTER_REBLOGS_N_DAYS_AHEAD = 1.day

# @param [Account] account
# @param [Hash] options
# @option [Boolean] :with_replies
Expand Down Expand Up @@ -91,36 +94,63 @@ def without_reblogs_scope
Status.without_reblogs
end

def without_duplicate_reblogs(limit, max_id, since_id, min_id)
def without_duplicate_reblogs(limit, max_id, _since_id, min_id)
# See https://wiki.neuromatch.social/Filter_Duplicate_Boosts
# General strategy: for a given page of results, we care about whether
# a post that has been boosted in this page has also been boosted more recently.
# ie. for filtering, we care about all the posts in the future, but none in the past.
#
# we have three nested queries that are roughly:
# - regular (non-boost posts) OR
# - boosts IN
# - most recent boost per original post IN
# - boosts made from now until n days in the future

# First get limited statuses that could contain boosts
# Get this first so the later DISTINCT ON term's need for ordering by reblog_of_id
# doesn't cause us to be filtering over the wrong statuses
candidate_statuses = Status.select(:id).reorder(id: :desc)

# apply similar filtering criteria to outer get call
if min_id.present?
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(min_id))
elsif since_id.present?
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(since_id))
elsif limit.present?
# We only apply the limit when we have no lower bound
#
# To avoid doing the full query twice (incl all the scopes from the outer get method)
# we just get some amount of statuses `n` larger than the final limit s.t. we assume
# the number of statuses that would be filtered out by the `get` contexts aren't >`n`
# This should be fast since it's just a slice from the index
limit *= 5
candidate_statuses = candidate_statuses.limit(limit)
end
# Set lower bound
candidate_statuses = if min_id.present?
# we don't want to reorder id ascending here like the outer min_id param does -
# in this inner query, we want to only show the latest boost, and so the order
# needs to be descending. in both cases, we want to evaluate many more statuses into the future
# than the outer query, which provides this option to use like min_id + limit
candidate_statuses.where(Status.arel_table[:id].gt(min_id))
elsif max_id.present?
# Since_id is only relevant as a limit when there are fewer than `limit` posts between max_id
# and since_id, so we don't use it here. We can't just use `limit` unless we don't have a max_id, which
# means we are only getting the most recent statuses. Otherwise, we need to get some arbitrary number
# of statuses in the future to check for boosts. To set some lower bound on that, we use the limit + max_id
# to get the nth post after the max_id as the lower bound.
candidate_statuses.where(<<~SQL.squish, max_id: max_id.to_i, limit: limit)
"statuses"."id" > (
SELECT LAG(id,:limit) OVER (
ORDER BY id DESC
) FROM statuses
WHERE id = :max_id
)
SQL
else
candidate_statuses.limit(limit)
end

if max_id.present?
# We don't want to use the max_id since that will cause boosts to repeat
# across pages. Instead, if we're provided one, we increase it by a day -
if min_id.present?
# the only time a missing max_id is a problem is when we are doing reverse pagination, ie. with
# min_id, so in that case we set the max_id relative to that.
# Check this before max_id, since if a min_id is present max_id is not limiting
max_time = Mastodon::Snowflake.to_time(min_id.to_i)
max_time += FILTER_REBLOGS_N_DAYS_AHEAD
max_id = Mastodon::Snowflake.id_at(max_time)
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].lt(max_id))
elsif max_id.present?
# We don't want to use the given max_id since that will cause boosts to repeat
# across pages. Instead, if we're provided one, we increase it by some window of future posts to consider -
# so ok if people are boosting the same post a lot and there are fewer than 20 posts
# in a day, you'll see duplicates.
max_time = Mastodon::Snowflake.to_time(max_id.to_i)
max_time += 1.day
max_time += FILTER_REBLOGS_N_DAYS_AHEAD
max_id = Mastodon::Snowflake.id_at(max_time)
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].lt(max_id))
end
Expand Down
1 change: 0 additions & 1 deletion spec/models/public_feed_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@
sleep(0.002)

boost = Fabricate(:status, reblog_of_id: status.id, account: poster)
expect(subject).to include(boost.id)

# sleep for 2ms to make sure the other posts come in a greater snowflake ID
sleep(0.002)
Expand Down

0 comments on commit b6ee8b9

Please sign in to comment.