[proposal] Reuse common expressions in a query #14196

gortiz · 2024-10-09T13:47:09Z

Introduction

In multi-stage query engine it is not very difficult to create queries that end up reading from the same table or executing the same join twice. This is specially easy when writing a SQL query using WITH expressions, but that is not the only case.

For example, a query like:

SELECT * 
FROM T1 
JOIN T2 as t2first 
    ON T1.col1 = t2first.col2 
JOIN T2 as t2second 
    ON t2first.col3 = t2second.col3

Generates the following plan:

flowchart BT
    J2([JOIN 2])
    J1([JOIN 1])
    S1([Scan T1])
    S2([Scan T2])
    S22([Scan T2])
    
    J2 --> J1
    S1 --> J2
    S2 --> J2
    S22 --> J1

In this case, we have two instances of Scan T2, which means it is executed twice and therefore the table is scanned twice. That may be already expensive, but it gets even worse when the subtree is more complex. For example imagine if it was a join or an aggregation.
To avoid this problem, we can reuse the subtree and therefore read T2 only once. In our example, the plan should be something like:

flowchart BT
    J2([JOIN 2])
    J1([JOIN 1])
    S1([Scan T1])
    S2([Scan T2])
    
    J2 --> J1
    S1 --> J2
    S2 --> J2
    S2 --> J1

This feature is supported in some databases and it is known as Spool in Calcite.

Design

The design document can be found here. This document includes some issues we can expect and how we could resolve them.

Current proposal

The proposal consider that the equivalence unit is the stage. We are not going to look for equivalent individual expressions in the tree but for equivalent stages. If at least two stages are found to be equivalent, then a new stage is created. This stage is equivalent to the equivalent stages in all operators but the mailbox send operator. The equivalent stages are then substituted by new stages with the same id but simplified to be a pair of receive and send mailboxes. The newly created stage broadcast all rows to the simplified stages.

For example, this query:

flowchart RL
    R([Root])
    J2([JOIN 2])
    J1([JOIN 1])
    S1([Scan T1])
    S2([Scan T2])

    subgraph Stage 1
        R
        Receiver1([Receive 1])
        Receiver1 --> R
    end

    subgraph Stage 2
        J1
        Sender2([Send 2])
        Receiver21([Receive 2.1])
        Receiver22([Receive 2.2])
        J1 --> Sender2
        Sender2 --> Receiver1
        Receiver21 -- hashtable --> J1
        Receiver22 -- stream --> J1
    end

    subgraph Stage 3
        J2
        Sender3([Send 3])
        Receiver31([Receive 3.1])
        Receiver32([Receive 3.2])

        J2 --> Sender3
        Sender3 -- broadcast --> Receiver21
        Receiver31 -- stream --> J2
        Receiver32 -- hashtable --> J2
    end

    subgraph Stage 4
        S1
        Sender4([Send 4])
        Sender4 -- random --> Receiver31
        S1 --> Sender4
    end

    subgraph Stage 5
        S2 --> Sender5
        Sender5([Send 5])
        
        Sender5 -- random --> Receiver22
    end
    
    subgraph Stage 6
        T22([Scan T2])
        Sender6([Send 6])
        Sender6 -- broadcast --> Receiver32
        T22 --> Sender6
    end

Will be transformed into:

flowchart RL
    R([Root])
    J2([JOIN 2])
    J1([JOIN 1])
    S1([Scan T1])
    S2([Scan T2])

    subgraph Stage 1
        R
        Receiver1([Receive 1])
        Receiver1 --> R
    end

    subgraph Stage 2
        J1
        Sender2([Send 2])
        Receiver21([Receive 2.1])
        Receiver22([Receive 2.2])
        J1 --> Sender2
        Sender2 --> Receiver1
        Receiver21 -- hashtable --> J1
        Receiver22 -- stream --> J1
    end
    
    subgraph Stage 3
        J2
        Sender3([Send 3])
        Receiver31([Receive 3.1])
        Receiver32([Receive 3.2])

        J2 --> Sender3
        Sender3 -- broadcast --> Receiver21
        Receiver31 -- stream --> J2
        Receiver32 -- hashtable --> J2
    end

    subgraph Stage 4
        S1
        Sender4([Send 4])
        Sender4 -- random --> Receiver31
        S1 --> Sender4
    end

    subgraph Same Server

    subgraph Stage 6
        Receiver6([Receiver 6])
        Sender6([Send 6])

        Receiver6 --> Sender6
        Sender6 -- broadcast --> Receiver32
    end

    subgraph Stage 7
        Receiver7([Receiver 6])
        Sender7([Send 6])

        Receiver7 --> Sender7
        Sender7 -- random --> Receiver22
    end

    subgraph Stage 5
        Sender5([Send 5])
        
        S2 --> Sender5
        Sender5 -- broadcast in memory --> Receiver7
        Sender5 -- broadcast in memory --> Receiver6
    end
    end

Challenges

Blocking and buffering

Given the blocking nature of some of our operators, some queries may end up in a inter-block when this feature is used. The design document describes one of this cases. The way we can face this problem is to buffer the data, probably offheap or even on disk. Alternatively we could just fail in this scenarios.

TODO list

Define when two stages are equivalent. [spool] Define equivalence between stages #14296
Modify MailboxSendNode to be able to send to multiple stages. [Spool] Introduce stage replacer and change send nodes to be able to send to more than one stage #14495
MailboxSendOperator and the plan serialization protocol to support mailboxes that send to multiple stages. In [Spool] Actual implementation #14507
Introduce a flag that enables/disables this optimization. In [Spool] Actual implementation #14507
Document this feature
Optional: Fix stats (right now stats will be counted twice in spools).
Optional: Study the impact explain with segment info.
Optional: Introduce a buffer to solve blocking issues

The text was updated successfully, but these errors were encountered:

ankitsultana · 2024-10-10T14:36:28Z

@gortiz : can you enable comment access on the doc?

gortiz · 2024-10-11T12:28:56Z

Let's try to trust internet and open comments for everyone :D

ankitsultana · 2024-10-24T19:08:31Z

@gortiz : do you folks have some numbers which show the performance gains with this approach? And specifically which queries this will help?

I have seen CTE reuse in query engines like Presto, but the bottlenecks there would be very different than what we have in most of the Pinot workloads (at least speaking from an Uber point of view). I suspect this can help if you have tiered storage (esp. for the Startree tiered storage plugin), but are there any other cases?

gortiz · 2024-10-28T11:48:46Z

I've seen several cases where this feature would help in Pinot queries written by our customers. Remember that this is not limited to WITH expressions (but common expressions in the relational algebra) and that includes intermediate stages. For example there are cases where people use the same join twice in a query. If were able to reuse these computations we would divide the number of rows we need to read from tables, but also the number of workers and allocation we need in intermediate stages.

We don't have actual numbers we can share, but we have some actual cases where an ideal reduce system would improve latency (and/or used workers) linearly.

gortiz · 2024-10-28T11:50:29Z

BTW, in the document I also mention apache/hive#5249, which tries to apply the same in Hive

I'm not familiar enough with how materialized views are implemented in Hive and Presto, but I'm assuming these are batch processes that will end up adding milliseconds to the latency of the query. In our case we will try to reuse the current mailbox system and only buffer on cases like the one explained in the document. Even in that case, latency should not be affected that much given the buffer will be consumed as a queue and therefore we should be able to apply different techniques to reduce the disk IO (if even needed).

gortiz · 2024-11-19T15:45:35Z

In #14495 the stage replacer is introduced. The next PR will probably include the changes in the worker protocol and operators that will let us actually apply the optimization

gortiz self-assigned this Oct 9, 2024

gortiz added Design Review multi-stage Related to the multi-stage query engine labels Oct 9, 2024

gortiz mentioned this issue Oct 24, 2024

[spool] Define equivalence between stages #14296

Merged

gortiz mentioned this issue Nov 19, 2024

[Spool] Introduce stage replacer and change send nodes to be able to send to more than one stage #14495

Merged

gortiz mentioned this issue Nov 20, 2024

[Spool] Actual implementation #14507

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Reuse common expressions in a query #14196

[proposal] Reuse common expressions in a query #14196

gortiz commented Oct 9, 2024 •

edited

Loading

ankitsultana commented Oct 10, 2024

gortiz commented Oct 11, 2024

ankitsultana commented Oct 24, 2024

gortiz commented Oct 28, 2024

gortiz commented Oct 28, 2024 •

edited

Loading

gortiz commented Nov 19, 2024

[proposal] Reuse common expressions in a query #14196

[proposal] Reuse common expressions in a query #14196

Comments

gortiz commented Oct 9, 2024 • edited Loading

Introduction

Design

Current proposal

Challenges

Blocking and buffering

TODO list

ankitsultana commented Oct 10, 2024

gortiz commented Oct 11, 2024

ankitsultana commented Oct 24, 2024

gortiz commented Oct 28, 2024

gortiz commented Oct 28, 2024 • edited Loading

gortiz commented Nov 19, 2024

gortiz commented Oct 9, 2024 •

edited

Loading

gortiz commented Oct 28, 2024 •

edited

Loading