Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ExistsSubquery in PPL #769

Merged
merged 2 commits into from
Oct 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 15 additions & 46 deletions docs/ppl-lang/PPL-Example-Commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ _- **Limitation: "REPLACE" or "APPEND" clause must contain "AS"**_

**SQL Migration examples with IN-Subquery PPL:**

1. tpch q4 (in-subquery with aggregation)
tpch q4 (in-subquery with aggregation)
```sql
select
o_orderpriority,
Expand Down Expand Up @@ -309,52 +309,21 @@ source = orders
| fields o_orderpriority, order_count
```

2.tpch q20 (nested in-subquery)
```sql
select
s_name,
s_address
from
supplier,
nation
where
s_suppkey in (
select
ps_suppkey
from
partsupp
where
ps_partkey in (
select
p_partkey
from
part
where
p_name like 'forest%'
)
)
and s_nationkey = n_nationkey
and n_name = 'CANADA'
order by
s_name
```
#### **ExistsSubquery**
[See additional command details](ppl-subquery-command.md)

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table inner2
- `source = outer | where exists [ source = inner | where a = c ]`
- `source = outer | where not exists [ source = inner | where a = c ]`
- `source = outer | where exists [ source = inner | where a = c and b = d ]`
- `source = outer | where not exists [ source = inner | where a = c and b = d ]`
- `source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where not exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where exists [ source = inner ] | eval l = "Bala" | fields l` (special uncorrelated exists)


Rewritten by PPL InSubquery query:
```sql
source = supplier
| where s_suppkey IN [
source = partsupp
| where ps_partkey IN [
source = part
| where like(p_name, "forest%")
| fields p_partkey
]
| fields ps_suppkey
]
| inner join left=l right=r on s_nationkey = n_nationkey and n_name = 'CANADA'
nation
| sort s_name
```
#### **ScalarSubquery**
[See additional command details](ppl-subquery-command.md)

Expand Down
90 changes: 84 additions & 6 deletions docs/ppl-lang/ppl-subquery-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,58 @@ source = supplier
| sort s_name
```

**ExistsSubquery usage**

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table inner2

- `source = outer | where exists [ source = inner | where a = c ]`
- `source = outer | where not exists [ source = inner | where a = c ]`
- `source = outer | where exists [ source = inner | where a = c and b = d ]`
- `source = outer | where not exists [ source = inner | where a = c and b = d ]`
- `source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where not exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l` (special uncorrelated exists)

**_SQL Migration examples with Exists-Subquery PPL:_**

tpch q4 (exists subquery with aggregation)
```sql
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and exists (
select
l_orderkey
from
lineitem
where l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority
```
Rewritten by PPL ExistsSubquery query:
```sql
source = orders
| where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01"
and exists [
source = lineitem
| where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
]
| stats count(1) as order_count by o_orderpriority
| sort o_orderpriority
| fields o_orderpriority, order_count
```

**ScalarSubquery usage**

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table nested
Expand Down Expand Up @@ -191,14 +243,14 @@ source = spark_catalog.default.outer

### **Additional Context**

The most cases in the description is to request a `InSubquery` expression.
`InSubquery`, `ExistsSubquery` and `ScalarSubquery` are all subquery expression. The common usage of subquery expression is in `where` clause:

The `where` command syntax is:

```
| where <boolean expression>
```
So the subquery in description is part of boolean expression, such as
So the subquery is part of boolean expression, such as

```sql
| where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id)
Expand All @@ -217,10 +269,11 @@ In issue description is a `ScalarSubquery`:
```sql
source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > (subquery source=targets | where target_met="true" | return target_value)
| where sales.sale_amount > [ source=targets | where target_met="true" | fields target_value ]
```

Recall the join command doc: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/PPL-Join-command.md#more-examples, the example is a subquery/subsearch **plan**, rather than a **expression**.
But `RelationSubquery` is not a subquery expression, it is a subquery plan.
[Recall the join command doc](ppl-join-command.md), the example is a subquery/subsearch **plan**, rather than a **expression**.

```sql
SEARCH source=customer
Expand All @@ -245,7 +298,32 @@ SEARCH <leftPlan>
Apply the syntax here and simply into

```sql
search <leftPlan> | left join on <condition> (subquery search ...)
search <leftPlan> | left join on <condition> [ search ... ]
```

The `(subquery search ...)` is not a `expression`, it's `plan`, similar to the `relation` plan
The `[ search ...]` is not a `expression`, it's `plan`, similar to the `relation` plan

**Uncorrelated Subquery**

An uncorrelated subquery is independent of the outer query. It is executed once, and the result is used by the outer query.
It's **less common** when using `ExistsSubquery` because `ExistsSubquery` typically checks for the presence of rows that are dependent on the outer query’s row.

There is a very special exists subquery which highlight by `(special uncorrelated exists)`:
```sql
SELECT 'nonEmpty'
FROM outer
WHERE EXISTS (
SELECT *
FROM inner
);
```
Rewritten by PPL ExistsSubquery query:
```sql
source = outer
| where exists [
source = inner
]
| eval l = "nonEmpty"
| fields l
```
This query just print "nonEmpty" if the inner table is not empty.
Loading
Loading