Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach) #578

shane-circuithub · 2023-10-05T16:05:56Z

(This is an alternative fix to the bug described in #577 that doesn't require adding Eq and Ord constraints to PrimExpr.)

This PR contains two commits.

The first commit adds a (failing) test case demonstrating the inability to combine orderedAggregate and distinctAggregator together, even in the limited cases allowed by PostgreSQL. The problem is that PostgreSQL only allows this if the expressions given in the ORDER BY clause match the expressions given as arguments to the aggregation function.

The second commit fixes this bug. It does so by adding Eq and Ord constraints to Symbol (but not PrimExpr) and using them to detect identical references anywhere within the PrimExprs contained with the Aggregate. Instead of renaming the entire expressions, we only rename references to fields (which could possibly be lateral references, though this isn't checked). We maintain a Map of already-renamed symbols such that identical references will be renamed to the same symbol, fulfilling PostgreSQL's restriction.

shane-circuithub · 2023-10-05T16:12:21Z

This is based on the refactoring in #575.

tomjaguarpaw · 2023-10-05T19:02:28Z

I think this is my preferred approach but I won't be able to get to it immediately.

tomjaguarpaw · 2023-10-14T17:06:35Z

Hi @shane-circuithub, please ping me if this is is urgent for you. I'll get to it in due course but I'm short of spare mental cycles at the moment.

shane-circuithub · 2023-10-16T07:08:12Z

It's not necessarily urgent that you merge this, because I can just pin to this commit for now, but I am actually using this. I didn't realise when I first wrote this, but this is actually needed for ordered-set aggregation functions to work. Take the following query for example:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY column1)
FROM (
  VALUES
    (1),
    (2),
    (3),
    (4)
) _;

 percentile_cont 
-----------------
             2.5
(1 row)

If you try to recreate that with Opaleye (without this commit), it will produce the following (paraphrased):

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    0.5 AS inner_1,
    column1 AS inner_2
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;

ERROR:  column "_.inner_1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

With this PR, this becomes:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY inner_1)
FROM (
  SELECT
    column1 AS inner_1
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;

 percentile_cont 
-----------------
             2.5
(1 row)

With this PR, extractAggregateFields only rewrites references (e.g., column1 to inner_1), it doesn't rewrite non-reference expressions (such as 0.5), which allows Opaleye produces valid SQL for ordered set aggregation functions. This is why I closed #577 in favour if this PR, because #577 wouldn't solve this problem.

The other "valid" way to construct ordered set aggregation functions (which admittedly I care much less about supporting) is with the direct argument in a GROUP BY clause, e.g.:

SELECT
  percentile_cont(column1) WITHIN GROUP (ORDER BY column2)
FROM (
  VALUES
    (0.5, 1),
    (0.5, 2),
    (0.5, 3),
    (0.5, 4),
    (0.8, 5),
    (0.8, 6),
    (0.8, 7),
    (0.8, 8)
) _
GROUP BY
  column1;

Without this PR, if you try to write this in Opaleye it will produce the following SQL (again paraphrased):

SELECT
  percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
FROM (
  SELECT
    column1 AS inner_1,
    column1 AS inner_2,
    column2 AS inner_3
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;

ERROR:  column "_.inner_2" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

Even though inner_1 and inner_2 refer to the same underlying column1, Postgres isn't smart enough to detect that, it wants the exact same expression in the GROUP BY clause and the direct argument. However, with this PR, it becomes:

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    column1 AS inner_1,
    column2 AS inner_2
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;

 percentile_cont 
-----------------
             2.5
             7.4

Because this PR rewrites only references, and because it keeps track of already-rewritten references and makes sure to only to rename each reference once, both the GROUP BY and the direct argument to percentile_cont end up with the same inner_1 expression.

This is not to put pressure on you to deal with this right now if you don't have the capacity (again I'm happy to just pin to this commit in the short-term), but it just explains an additional motivation that I wasn't aware of when I first wrote this.

tomjaguarpaw · 2023-10-20T12:23:56Z

The first commit adds a (failing) test

Could you double check? That test passes here.

EDIT: No, it fails here (as expected)!

tomjaguarpaw · 2023-10-20T12:31:56Z

Hmm, and if fails in CI. Maybe a difference between Postgres versions? I'm using 13 locally and CI is using 11.

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587659165/job/17898358429#step:17:627

tomjaguarpaw · 2023-10-20T12:39:30Z

Hmm, no it fails in CI in 13 too. I'm really confused why it doesn't fail for me locally!

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587744501/job/17898614746#step:17:631

tomjaguarpaw · 2023-10-20T12:44:33Z

No, it does fail locally. Not sure what I was thinking. Never mind.

tomjaguarpaw · 2023-10-20T12:45:26Z

The error is "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list", so I think I have to understand that first.

tomjaguarpaw · 2023-10-27T09:58:43Z

I think we have a bigger problem, because if I slightly change your test case then your fix is no longer sufficient. I appreciate that your fix is enough to make it work with well-formed examples that should work but I'd much rather find a fix that doesn't allow generating incorrect SQL.

@ Test/Test.hs:626 @ testStringArrayAggregateOrderedDistinct :: Test
 testStringArrayAggregateOrderedDistinct = it "" $ q `selectShouldReturnSorted` expected
   where q =
           O.aggregateOrdered
-            (O.asc snd)
+            (O.asc (\x -> snd x O..++ snd x))
             (PP.p2 (O.arrayAgg, O.distinctAggregator . O.stringAgg . O.sqlString $ ","))
             table7Q
         expected = [ ( map fst sortedData
@ Test/Test.hs:1506 @ main = do

tomjaguarpaw · 2023-10-27T10:26:33Z

I can now understand the aggregator DISTINCT issue (but not the WITHIN GROUP one). If you have an aggregation of the following form then the expressions p, q, r must each occur literally within the set {a,b,c,...}. That's a pretty annoying restriction! And it will be hard to enforce statically, but that's how I'd like to solve this particular issue. I'll have more of a think.

FROM (SELECT
     AGGREGATOR(DISTINCT a,b,c,... ORDER BY p,q,r, ...) as "result3_3"

tomjaguarpaw · 2023-10-27T11:02:54Z

I think the correct API for this is probably something like

makeAggregator ::
  AggregatorFunction w a b ->
  Order a ->
  Aggregator a b

makeDistinctAggregator ::
  AggregatorFunction w a b ->
  Order w ->
  Aggregator a b

where the AggregatorFunctions are like

stringAgg ::
  AggregatorFunction
    (Wrap (Field SqlText), Wrap (Field SqlText))
    (Field SqlText, Field SqlText)
    (Field SqlText)

and Wrap is an opaque wrapper that allows you to reorder Fields, but doesn't allow you to apply any Field functions to them. That is, the w parameter to the AggregatorFunction is the exact arguments supplied to the SQL aggregation function. So for example, we would have

wrap :: Field a -> Wrap (Field a)

ascWrap :: Order (Wrap (Field a))

asc :: Order (Field a)
asc = contramap wrap ascWrap

I think this would work. It's a bit heavyweight, but actually resolves my preexisting concerns about the semantics of orderAggregate, aggregateOrdered and distinctAggregator.

I've no idea whether the same approach would work for the WITHIN GROUP problem.

…derAggregate`

…tinctAggregator` to be used with `orderAggregate`

It's redundant because we only need to rebind PrimExprs that came from a lateral subquery. As explained at [1], rather than carefully analysing which PrimExprs came from a lateral subquery we can just rebind everything. A previous commit [2] changed things so that everything mentioned in aggregator order expressions is rebound. Instead we probably should have done what this commit does, that is, added an Unpackspec constraint to aggregate. Fixes #587 This is a simpler approach to resolving the issues discussed at * #585 * #578 This still suffers from the problem described at: #578 (comment) i.e. if we find a way of duplicating field names, such as O.asc (\x -> snd x O..++ snd x) then we can still create crashing queries. The benefit of this comment is that there is a way of generating non-crashing queries! [1] https://github.com/tomjaguarpaw/haskell-opaleye/blob/52a4063188dd617ff91050dc0f2e27fc0570633c/src/Opaleye/Internal/Aggregate.hs#L111-L114 [2] d848317, part of #576

shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from f5c179b to 66c8f7e Compare October 5, 2023 16:11

shane-circuithub mentioned this pull request Oct 5, 2023

Add support for ordered-set aggregation functions (WITHIN GROUP) #576

Closed

shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch 2 times, most recently from 86f19ab to ce8015a Compare October 11, 2023 15:02

shane-circuithub mentioned this pull request Oct 11, 2023

Fix bug preventing orderedAggregate and distinctAggregator from being used together (one approach) #577

Closed

shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from ce8015a to b4d0804 Compare October 11, 2023 17:45

shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from b4d0804 to ad27062 Compare October 15, 2023 21:55

shane-circuithub added 2 commits November 20, 2023 11:51

Add (failing) test for the combination of distinctAggregate and `or…

856b5b8

…derAggregate`

Detect identical references in extractAggregateFields to allow `dis…

56e733e

…tinctAggregator` to be used with `orderAggregate`

shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from ad27062 to 56e733e Compare November 20, 2023 11:52

shane-circuithub mentioned this pull request Jan 9, 2024

Rewrite only references in aggregation, not all values #585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach) #578

Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach) #578

shane-circuithub commented Oct 5, 2023

shane-circuithub commented Oct 5, 2023

tomjaguarpaw commented Oct 5, 2023

tomjaguarpaw commented Oct 14, 2023

shane-circuithub commented Oct 16, 2023 •

edited

Loading

tomjaguarpaw commented Oct 20, 2023 •

edited

Loading

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 27, 2023

tomjaguarpaw commented Oct 27, 2023 •

edited

Loading

tomjaguarpaw commented Oct 27, 2023

Fix bug preventing orderedAggregate and distinctAggregator from being used together (alternative approach) #578

Are you sure you want to change the base?

Fix bug preventing orderedAggregate and distinctAggregator from being used together (alternative approach) #578

Conversation

shane-circuithub commented Oct 5, 2023

shane-circuithub commented Oct 5, 2023

tomjaguarpaw commented Oct 5, 2023

tomjaguarpaw commented Oct 14, 2023

shane-circuithub commented Oct 16, 2023 • edited Loading

tomjaguarpaw commented Oct 20, 2023 • edited Loading

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 20, 2023

tomjaguarpaw commented Oct 27, 2023

tomjaguarpaw commented Oct 27, 2023 • edited Loading

tomjaguarpaw commented Oct 27, 2023

Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach) #578

Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach) #578

shane-circuithub commented Oct 16, 2023 •

edited

Loading

tomjaguarpaw commented Oct 20, 2023 •

edited

Loading

tomjaguarpaw commented Oct 27, 2023 •

edited

Loading