[Q&A] How can I get the top N match per group? #1661
-
What happens?If I have a data of users + groups, each user can belong to only one group, for any paradedb query, what would be the best way to get only the top N users per group? To ReproduceSimilarly with how a query like the following functions: SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY group ORDER BY id DESC) AS row_num
FROM users
) AS subquery
WHERE row_num <= 10; OS:x64 ParadeDB Version:0.9.4 Are you using ParadeDB Docker, Helm, or the extension(s) standalone?ParadeDB Docker Image Full Name:Ed David Affiliation:DSG Did you include all relevant data sets for reproducing the issue?Yes Did you include the code required to reproduce the issue?
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hey @edjdavid, the best version of your query depends on a few factors, like whether you need the BM25 score in your results and the size of The biggest factor on query speed with our search index is the number of results returned from a SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY "group" ORDER BY id DESC) AS row_num
FROM (
SELECT id, name, "group"
FROM user_group.search('name:bob AND group:1', limit_rows => 10)
UNION ALL
SELECT id, name, "group"
FROM user_group.search('name:bob AND group:2', limit_rows => 10)
UNION ALL
SELECT id, name, "group"
FROM user_group.search('name:bob AND group:3', limit_rows => 10)
UNION ALL
SELECT id, name, "group"
FROM user_group.search('name:bob AND group:4', limit_rows => 10)
UNION ALL
SELECT id, name, "group"
FROM user_group.search('name:bob AND group:5', limit_rows => 10)
) AS all_groups
) AS subquery
WHERE row_num <= 10; Here's an alternative that instead returns the BM25 score and performs an inner join on the users table. This may have different performance characteristics to the example above, because we're not pushing down the group filtering to Tantivy. All this depends on the size of your data, the size of the set of matches, and the size of your Top N. SELECT group.id, score_bm25.*
FROM group,
(
SELECT *
FROM idxusers.score_bm25(paradedb.boolean(must => ARRAY [
paradedb.parse('<user query here>'),
paradedb.term('group_id', group.id)
]), limit_rows => 10)
INNER JOIN users ON score_bm25.id = users.id
) scored_users; |
Beta Was this translation helpful? Give feedback.
We've added support for cross-table index joins: https://docs.paradedb.com/api-reference/optimization/joins