Rewrite only references in aggregation, not all values #585

shane-circuithub · 2024-01-09T21:04:44Z

In #576 I added preliminary support for ordered set aggregation functions. However, in the vast majority of cases this generates invalid SQL as things stand.

This PR contains two commits: the first adds a failing test demonstrating the problem, the second one fixes the problem.

To recap, recall PostgreSQL does not allow lateral references in aggregation functions, e.g., the following SQL is invalid:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      sum(x) AS result
  ) __

It fails with the error message ERROR: aggregate functions are not allowed in FROM clause of their own query level. Opaleye works around this limitation by rewriting the above query as follows:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      sum(inner1) AS result
    FROM
      (
        SELECT
          x AS inner1
      ) _
  ) __

The current implementation of this rewriting rewrites all arguments of aggregation functions regardless of whether they contain (potentially lateral) references or not. However, the same effect could be achieved if we only rewrote references and not all expressions. That's what this PR does.

This change is beneficial when using set aggregation functions such as percentile_cont. With the current rewriting scheme, Opaleye will generate:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2) AS result
    FROM
      (
        SELECT
          0.5 AS inner1,
          x AS inner2
      ) _
  ) __

Which fails with the error message:

ERROR:  column "_.inner1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 12:       percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2)...
                               ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

After the change in this PR, this instead becomes:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      percentile_cont(0.5) WITHIN GROUP (ORDER BY inner1) AS result
    FROM
      (
        SELECT
          x AS inner1
      ) _
  ) __

Which works.

Incidentally this is the exactly the same fix as the one proposed in #578, which I understand you consider a less than ideal fix for the problem it purports to solve, but I think the API changes that would result from your proposed solution in the comments (although it's correct) are bigger than I want to take on right now. In any case, this is the problem that I am eager to solve at this time, so would you consider this fix as a solution to the problem described above, even if you're not satisfied that it truly solves #578? It shouldn't break any existing code as far as I can see.

tomjaguarpaw · 2024-01-09T21:09:45Z

Thanks. This post contains a lot of repeated bits. I guess that's copy-paste error? If so, could you tidy it up?

shane-circuithub · 2024-01-09T21:28:51Z

Apologies for that, it should be fixed now.

tomjaguarpaw · 2024-01-10T10:55:11Z

Thanks. I'm very short on time and energy but I'd like to get this to a stage where you are at least unblocked for what you need to do.

I think the API changes that would result from your proposed solution in the comments (although it's correct) are bigger than I want to take on right now

When you say "take on" do you mean that the implementation of the changes to Opaleye (described in #578 (comment)) are too big to take on, or once such changes to Opaleye are made, the changes required on the rel8 side are too big? I'm not particularly worried about the former. I think I can probably do it myself given a free weekend afternoon. On the other hand, I don't know if the new API will be useable by rel8 at all, in the way you want to use it.

tomjaguarpaw · 2024-01-10T11:38:09Z

Oh, and I incorporated your tests as "pending" into the test suite. Can you rebase your branch (or maybe all of them)?

To recap, PostgreSQL does not allow lateral references in aggregation functions, e.g., the following SQL is invalid: ```sql SELECT result FROM ( VALUES (1), (2), (3) ) _(x), LATERAL ( SELECT sum(x) AS result ) __ ``` It fails with the error message `ERROR: aggregate functions are not allowed in FROM clause of their own query level`. Opaleye works around this limitation by rewriting the above query as follows: ```sql SELECT result FROM ( VALUES (1), (2), (3) ) _(x), LATERAL ( SELECT sum(inner1) AS result FROM ( SELECT x AS inner1 ) _ ) __ ``` The current implementation of this rewriting rewrites all arguments of aggregation functions regardless of whether they contain (potentially lateral) references or not. However, the same effect could be achieved if we only rewrote references and not all expressions. That's what this PR does. This change is beneficial when using set aggregation functions such as `percentile_cont`. With the current rewriting scheme, Opaleye will generate: ```sql SELECT result FROM ( VALUES (1), (2), (3) ) _(x), LATERAL ( SELECT percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2) AS result FROM ( SELECT 0.5 AS inner1, x AS inner2 ) _ ) __ ``` Which fails with the error message: ``` ERROR: column "_.inner1" must appear in the GROUP BY clause or be used in an aggregate function LINE 12: percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2) ^ DETAIL: Direct arguments of an ordered-set aggregate must use only grouped columns. ``` After the change in this PR, this instead becomes: ```sql SELECT result FROM ( VALUES (1), (2), (3) ) _(x), LATERAL ( SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY inner1) AS result FROM ( SELECT x AS inner1 ) _ ) __ ``` Which works.

shane-circuithub · 2024-01-10T13:45:15Z

When you say "take on" do you mean that the implementation of the changes to Opaleye (described in #578 (comment)) are too big to take on, or once such changes to Opaleye are made, the changes required on the rel8 side are too big? I'm not particularly worried about the former. I think I can probably do it myself given a free weekend afternoon. On the other hand, I don't know if the new API will be useable by rel8 at all, in the way you want to use it.

Yeah, I just mean on the Rel8 side of things. I actually don't care so much about the ability to use DISTINCT and ORDER BY together. I was aware of the problem but it was the ordered set aggregation that I was trying to solve, I just realised that my solution "solved" both problems but maybe I put too much emphasis on the former.

I've rebased this PR.

tomjaguarpaw · 2024-01-12T08:51:44Z

OK, I'll have a closer look at the weekend.

tomjaguarpaw · 2024-01-13T13:27:54Z

After refreshing my memory on this issue, the biggest objection I have is that it's still very easy to write crashing queries after this PR. The proposed change feels too invasive for a partial solution.

By way of alternative, I suggest excluding the Aggregator itself from what is rebound. I don't see any reason why we need to include it. I'll work on that now.

tomjaguarpaw · 2024-01-13T15:19:59Z

By way of alternative, I suggest excluding the Aggregator itself from what is rebound.

This is it. @duairc, please take a look. It fixes the particular test cases you wrote, at least. Does it work for your general use cases?

#586

shane-circuithub · 2024-01-15T20:31:04Z

I've just tested this out. It did require some changes to Rel8 and to CircuitHub code that uses Rel8, but they were relatively tolerable. I would have preferred not to incur a Table Expr i constraint on aggregate but your solution does make a lot of sense and it isn't really the end of the world. So I'm happy to go with this if you're happy with it.

tomjaguarpaw · 2024-01-15T20:57:01Z

Yeah, it's unfortunate to pick up that extra constraint but it does seem like it plays a genuine role. The API is still not safe (lmapping Aggregators can lead to crashing queries) but at least now it's possible to write the queries you want, so it seems like the right first step.

It's redundant because we only need to rebind PrimExprs that came from a lateral subquery. As explained at [1], rather than carefully analysing which PrimExprs came from a lateral subquery we can just rebind everything. A previous commit [2] changed things so that everything mentioned in aggregator order expressions is rebound. Instead we probably should have done what this commit does, that is, added an Unpackspec constraint to aggregate. Fixes #587 This is a simpler approach to resolving the issues discussed at * #585 * #578 This still suffers from the problem described at: #578 (comment) i.e. if we find a way of duplicating field names, such as O.asc (\x -> snd x O..++ snd x) then we can still create crashing queries. The benefit of this comment is that there is a way of generating non-crashing queries! [1] https://github.com/tomjaguarpaw/haskell-opaleye/blob/52a4063188dd617ff91050dc0f2e27fc0570633c/src/Opaleye/Internal/Aggregate.hs#L111-L114 [2] d848317, part of #576

shane-circuithub force-pushed the set-aggregate-test branch from efb6549 to 9183036 Compare January 9, 2024 21:23

shane-circuithub mentioned this pull request Jan 9, 2024

Release 1.5.0.0 circuithub/rel8#298

Merged

shane-circuithub force-pushed the set-aggregate-test branch from 9183036 to 9d7fd96 Compare January 9, 2024 21:30

shane-circuithub force-pushed the set-aggregate-test branch from 9d7fd96 to 43670ff Compare January 10, 2024 13:43

shane-circuithub closed this Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite only references in aggregation, not all values #585

Rewrite only references in aggregation, not all values #585

shane-circuithub commented Jan 9, 2024 •

edited

Loading

tomjaguarpaw commented Jan 9, 2024

shane-circuithub commented Jan 9, 2024

tomjaguarpaw commented Jan 10, 2024

tomjaguarpaw commented Jan 10, 2024

shane-circuithub commented Jan 10, 2024

tomjaguarpaw commented Jan 12, 2024

tomjaguarpaw commented Jan 13, 2024

tomjaguarpaw commented Jan 13, 2024

shane-circuithub commented Jan 15, 2024

tomjaguarpaw commented Jan 15, 2024

Rewrite only references in aggregation, not all values #585

Rewrite only references in aggregation, not all values #585

Conversation

shane-circuithub commented Jan 9, 2024 • edited Loading

tomjaguarpaw commented Jan 9, 2024

shane-circuithub commented Jan 9, 2024

tomjaguarpaw commented Jan 10, 2024

tomjaguarpaw commented Jan 10, 2024

shane-circuithub commented Jan 10, 2024

tomjaguarpaw commented Jan 12, 2024

tomjaguarpaw commented Jan 13, 2024

tomjaguarpaw commented Jan 13, 2024

shane-circuithub commented Jan 15, 2024

tomjaguarpaw commented Jan 15, 2024

shane-circuithub commented Jan 9, 2024 •

edited

Loading