Skip to content

Commit

Permalink
Support Exists Subquery in PPL (opensearch-project#769)
Browse files Browse the repository at this point in the history
Signed-off-by: Lantao Jin <[email protected]>
  • Loading branch information
LantaoJin authored Oct 12, 2024
1 parent fe5148c commit d83f61d
Show file tree
Hide file tree
Showing 13 changed files with 860 additions and 70 deletions.
61 changes: 15 additions & 46 deletions docs/ppl-lang/PPL-Example-Commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ _- **Limitation: "REPLACE" or "APPEND" clause must contain "AS"**_

**SQL Migration examples with IN-Subquery PPL:**

1. tpch q4 (in-subquery with aggregation)
tpch q4 (in-subquery with aggregation)
```sql
select
o_orderpriority,
Expand Down Expand Up @@ -309,52 +309,21 @@ source = orders
| fields o_orderpriority, order_count
```

2.tpch q20 (nested in-subquery)
```sql
select
s_name,
s_address
from
supplier,
nation
where
s_suppkey in (
select
ps_suppkey
from
partsupp
where
ps_partkey in (
select
p_partkey
from
part
where
p_name like 'forest%'
)
)
and s_nationkey = n_nationkey
and n_name = 'CANADA'
order by
s_name
```
#### **ExistsSubquery**
[See additional command details](ppl-subquery-command.md)

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table inner2
- `source = outer | where exists [ source = inner | where a = c ]`
- `source = outer | where not exists [ source = inner | where a = c ]`
- `source = outer | where exists [ source = inner | where a = c and b = d ]`
- `source = outer | where not exists [ source = inner | where a = c and b = d ]`
- `source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where not exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where exists [ source = inner ] | eval l = "Bala" | fields l` (special uncorrelated exists)


Rewritten by PPL InSubquery query:
```sql
source = supplier
| where s_suppkey IN [
source = partsupp
| where ps_partkey IN [
source = part
| where like(p_name, "forest%")
| fields p_partkey
]
| fields ps_suppkey
]
| inner join left=l right=r on s_nationkey = n_nationkey and n_name = 'CANADA'
nation
| sort s_name
```
#### **ScalarSubquery**
[See additional command details](ppl-subquery-command.md)

Expand Down
90 changes: 84 additions & 6 deletions docs/ppl-lang/ppl-subquery-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,58 @@ source = supplier
| sort s_name
```

**ExistsSubquery usage**

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table inner2

- `source = outer | where exists [ source = inner | where a = c ]`
- `source = outer | where not exists [ source = inner | where a = c ]`
- `source = outer | where exists [ source = inner | where a = c and b = d ]`
- `source = outer | where not exists [ source = inner | where a = c and b = d ]`
- `source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]` (nested)
- `source = outer | where exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where not exists [ source = inner | where c > 10 ]` (uncorrelated exists)
- `source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l` (special uncorrelated exists)

**_SQL Migration examples with Exists-Subquery PPL:_**

tpch q4 (exists subquery with aggregation)
```sql
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and exists (
select
l_orderkey
from
lineitem
where l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority
```
Rewritten by PPL ExistsSubquery query:
```sql
source = orders
| where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01"
and exists [
source = lineitem
| where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
]
| stats count(1) as order_count by o_orderpriority
| sort o_orderpriority
| fields o_orderpriority, order_count
```

**ScalarSubquery usage**

Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table nested
Expand Down Expand Up @@ -191,14 +243,14 @@ source = spark_catalog.default.outer

### **Additional Context**

The most cases in the description is to request a `InSubquery` expression.
`InSubquery`, `ExistsSubquery` and `ScalarSubquery` are all subquery expression. The common usage of subquery expression is in `where` clause:

The `where` command syntax is:

```
| where <boolean expression>
```
So the subquery in description is part of boolean expression, such as
So the subquery is part of boolean expression, such as

```sql
| where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id)
Expand All @@ -217,10 +269,11 @@ In issue description is a `ScalarSubquery`:
```sql
source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > (subquery source=targets | where target_met="true" | return target_value)
| where sales.sale_amount > [ source=targets | where target_met="true" | fields target_value ]
```

Recall the join command doc: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/PPL-Join-command.md#more-examples, the example is a subquery/subsearch **plan**, rather than a **expression**.
But `RelationSubquery` is not a subquery expression, it is a subquery plan.
[Recall the join command doc](ppl-join-command.md), the example is a subquery/subsearch **plan**, rather than a **expression**.

```sql
SEARCH source=customer
Expand All @@ -245,7 +298,32 @@ SEARCH <leftPlan>
Apply the syntax here and simply into

```sql
search <leftPlan> | left join on <condition> (subquery search ...)
search <leftPlan> | left join on <condition> [ search ... ]
```

The `(subquery search ...)` is not a `expression`, it's `plan`, similar to the `relation` plan
The `[ search ...]` is not a `expression`, it's `plan`, similar to the `relation` plan

**Uncorrelated Subquery**

An uncorrelated subquery is independent of the outer query. It is executed once, and the result is used by the outer query.
It's **less common** when using `ExistsSubquery` because `ExistsSubquery` typically checks for the presence of rows that are dependent on the outer query’s row.

There is a very special exists subquery which highlight by `(special uncorrelated exists)`:
```sql
SELECT 'nonEmpty'
FROM outer
WHERE EXISTS (
SELECT *
FROM inner
);
```
Rewritten by PPL ExistsSubquery query:
```sql
source = outer
| where exists [
source = inner
]
| eval l = "nonEmpty"
| fields l
```
This query just print "nonEmpty" if the inner table is not empty.
Loading

0 comments on commit d83f61d

Please sign in to comment.