-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]support Subquery
in ppl
#661
Comments
@YANG-DB Should we rename the title to
So the subquery in description is part of boolean expression, such as | where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id) The PS: there are many kinds of source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > (subquery source=targets | where target_met="true" | return target_value) Recall the join command doc: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/PPL-Join-command.md#more-examples, the example I wrote is a subquery/subsearch plan, rather than a expression. SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN left = c, right = o ON c.c_custkey = o.o_custkey
[
SEARCH source=orders
| WHERE o_comment NOT LIKE '%unusual%packages%'
| FIELDS o_orderkey, o_custkey
]
| STATS ... simply to SEARCH <leftPlan>
| LEFT OUTER JOIN ON <condition>
[
<rightPlan>
]
| STATS ... Apply the syntax here and simply to search <leftPlan> | left join on <condition> (subquery search ...) The |
Base on #661 (comment), I created sub-tasks: #710 #711 #712 and #713, check the details in each issue. |
subquery
in pplInSubquery
in ppl
InSubquery
in pplSubquery
in ppl
Problem we are solving for
Proposed Solution
Syntax
The subquery command should be implemented using a clean, logical syntax that integrates with existing PPL structure.
In this example, the primary search (
source=logs
) is filtered by results from the subquery (source=events
).The subquery command should allow nested queries to be as complex as necessary, supporting multiple levels of nesting.
Example:
1.1 Scoping and Context Awareness
Subqueries must respect the context of the parent query, such as access to fields, filters, and other constraints applied at higher levels.
Field references in the subquery should be able to use aliasing if there are naming conflicts between the parent and subquery.
1.3 Operators and Piping
The subquery command should support the standard operators (
|
,where
,stats
, etc.) and functions available in PPL.Subquery results should be able to be further piped and processed after returning data to the primary query.
1.4 Performance Considerations
Queries using subquery should be optimized to minimize performance overhead. The system should use lazy evaluation or other optimization techniques to prevent unnecessary computation.
Limiting the number of nested subqueries may be required to avoid performance degradation in large datasets.
Subquery and Joins
The subquery command must be able to integrate with joins, allowing users to filter and correlate data across multiple data sources efficiently. This section outlines how subquery should interact with joins in the Piped Processing Language (PPL).
2.1 Use Case 1: Subquery as a Filter within Joins
The subquery command should be usable in the
ON
orWHERE
clause of a join, allowing for dynamic filtering based on the results of a subquery. For example:In this case:
The
join
command combines theusers
andorders
sources based onuser_id
.The
where
clause uses a subquery to filter the results, only returning orders that were returned due to a "damaged" reason.2.2 Use Case 2: Subquery within a Join Clause
Subquery can be used directly in the join condition, allowing a user to join based on the result of a dynamic subquery. For example:
In this case:
The join condition compares the
users.user_id
to the result of a subquery that retrieves high-value customers from another dataset.This allows users to filter which records are joined dynamically based on another dataset or query.
2.3 Use Case 3: Subquery with Different Join Types
The subquery should work with all types of joins, including
INNER
,LEFT OUTER
, andCROSS
joins. The behavior should remain consistent regardless of the join type.INNER JOIN: The subquery will filter results within both data sources. Only rows that match the join condition and the subquery filter will be returned.
Example:
In this case, only employees who exceeded their sales targets are returned.
LEFT OUTER JOIN: In a left join, the subquery will still apply only to the joined source, but unmatched rows from the left side should still appear.
Example:
Example:
Here, all employees are returned, but only those working on approved-budget projects will have project data.
2.4 Handling Performance with Joins and Subquery
Performance considerations are critical when combining joins and subqueries, as these operations could significantly increase query complexity.
Optimization: Queries should be optimized by evaluating subqueries first and passing their results efficiently to the join operation, reducing the overall number of records processed in the join.
Limiting Subquery in Joins: Users may be restricted from creating deeply nested subqueries within join clauses to prevent performance degradation. For example, a maximum depth of 2-3 levels may be enforced for subqueries used in join conditions.
Non-Functional Requirements
3.1 Performance
The subquery feature should not significantly degrade performance compared to running two separate queries independently. Where possible, the engine should optimize the execution plan to run the subquery efficiently.
3.2 Usability
The syntax should be clear and easily understood by users who are familiar with PPL.
The documentation should include detailed examples and edge cases to assist users in building complex subqueries.
3.3 Error Handling
If a subquery query returns no results or fails to execute, the system should return an appropriate error message, allowing the user to debug the query.
Cyclic subqueries (where a subquery references itself) should be detected, and an error should be thrown to prevent infinite loops.
Test Cases
4.1 Basic Query with Subquery
Validate that basic queries with Subquery execute as expected.
Input:
source=logs | where id in (subquery source=users | where active="true" | return id)
Expected Output: Logs only with user IDs returned by the subquery.
4.2 Performance Testing
Test the performance impact of subquery compared to equivalent independent queries.
4.3 Nested Subquery
Test with deeply nested subqueries and validate results.
Input:
source=actions | where user in (subquery source=users | where region in (subquery source=regions | where country="US"))
Expected Output: Actions filtered by users in US regions.
4.4 Edge Case: Empty Subquery
Test behavior when the subquery returns no results.
Input:
source=logs | where id in (subquery source=users | where id=999999 | return id)
Expected Output: No logs returned.
4.5 Error Case: Cyclic Subquery
Test cyclic subquery detection.
Input:
source=logs | where id in (subquery source=logs | where id in (subquery source=logs | ...))
Expected Output: Error for cyclic reference.
4.6 Test Case: Subquery as a Join Filter
Input:
Expected Output: Customers who have returned items with a "damaged" return reason.
4.7. Test Case: Subquery in Join Clause
Input:
Expected Output: Only users who are gold-status loyal customers are joined with purchase data.
4.8 Test Case: Subquery with LEFT JOIN
Input:
Expected Output: All customers are returned, but only those who purchased furniture will have non-null order data.
4.9 Test Case: Subquery with Multiple Joins
Input:
Expected Output: Employees working in departments and on projects with approved budgets are returned.
4.10 Error Handling in Joins with Subquery
Unmatched Subquery Results: If a subquery returns no results, the joined query should behave consistently with the join type. For an
INNER JOIN
, no rows are returned. For aLEFT JOIN
, unmatched rows from the left source should still appear.Cyclic Subquery: As with non-join queries, cyclic subqueries should be detected, and an appropriate error message should be returned.
The text was updated successfully, but these errors were encountered: